In this walkthrough we will go through he basics of fitting linear models with sklearn using LinearRegression()
. We will look at a simple polynomial model to get started, and focus on understanding how we can use PolynomialFeature()
to build a general function on which to build a basic model.
Adding Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
A Fishy Data set!
In this data set we are going to look at some fish data of 7 different species, where we will try and predict weight from simple measurements. Important stuff!
In total the data set contains 7 measurements of 159 fish.
df = pd.read_csv('https://raw.githubusercontent.com/satishgunjal/datasets/master/Fish.csv')
df = df.drop(['Length1', 'Length2', 'Length3'], axis =1) # Can also use axis = 'columns'
df.sample(5) # Display random 5 records
![academy.digilab.co.uk](/_next/image?url=%2Fimages%2Fcourses%2Ffoundations-in-machine-learning%2Fregression-walkthrough%2Fimg1.jpg&w=828&q=75)
Ok Let us first look at the correlations in the data, for this we calculate and plot a correlation matrix.
df.corr()
plt.rcParams["figure.figsize"] = (6,4) # Custom figure size in inches
sns.heatmap(df.corr(), annot =True)
plt.title('Correlation Matrix')
![academy.digilab.co.uk](/_next/image?url=%2Fimages%2Fcourses%2Ffoundations-in-machine-learning%2Fregression-walkthrough%2Fimg2.jpg&w=1200&q=75)
sns.pairplot(df, kind = 'scatter', hue = 'Species')
![academy.digilab.co.uk](/_next/image?url=%2Fimages%2Fcourses%2Ffoundations-in-machine-learning%2Fregression-walkthrough%2Fimg3.jpg&w=1920&q=75)
If we want to predict weight, using height looks complex. There looks like two clear responses based on the fish type. Yet making predictions using width, looks like a much more unified relationship, much less depedent on fish type.
So let's do that.
X = df['Width']
y = df['Weight']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)
plt.scatter(X_train,y_train, alpha = 0.4,color='lightcoral',label='Training Data')
plt.scatter(X_test,y_test, alpha = 0.8,color='lightblue',label='Testing Data')
plt.xlabel('Width')
plt.xlabel('Weight')
plt.legend()
plt.show()
![academy.digilab.co.uk](/_next/image?url=%2Fimages%2Fcourses%2Ffoundations-in-machine-learning%2Fregression-walkthrough%2Fimg4.jpg&w=2048&q=75)
There are clear outliers in this data set, yet since we have sufficient data their influence will be minimal. We could remove them, but we wont here.
Building a Linear Model
Ok so the first simple model we are going to build is a polynomial model, which looks like this
So here we talk about being the feature vectors, in this case they a polynomial features; really easy to do in ``sklearn`.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=2)
phi_train = polynomial_features.fit_transform(np.array(X_train).reshape(-1,1))
lr = LinearRegression().fit(phi_train, y_train)
Ok so now let us plot out the function and compare it to the data.
X_all = np.linspace(1.0, 8.0, 1000).reshape(-1,1)
X_plot_poly = polynomial_features.fit_transform(X_all)
y_pred = lr.predict(X_plot_poly)
plt.scatter(X_train,y_train, alpha = 0.4,color='lightcoral',label='Training Data')
plt.scatter(X_test,y_test, alpha = 0.8,color='lightblue',label='Testing Data')
plt.plot(X_all,y_pred,'-k',alpha = 0.8,label='Linear Model')
plt.xlabel('Width')
plt.ylabel('Weight')
plt.legend()
plt.show()
![academy.digilab.co.uk](/_next/image?url=%2Fimages%2Fcourses%2Ffoundations-in-machine-learning%2Fregression-walkthrough%2Fimg5.jpg&w=2048&q=75)
It isn't a good model. Ok it fits on average ok, but we know that weight can't be negative. So we need to impose constraints.
sklearn
gives us some options.
lr_positive = LinearRegression(fit_intercept = False, positive = True).fit(phi_train, y_train)
y_pred_positive = lr_positive.predict(X_plot_poly)
plt.scatter(X_train,y_train, alpha = 0.4,color='lightcoral',label='Training Data')
plt.scatter(X_test,y_test, alpha = 0.8,color='lightblue',label='Testing Data')
plt.plot(X_all,y_pred,'-k',alpha = 0.8,label='Linear Model')
plt.plot(X_all,y_pred_positive,'-g',alpha = 0.8,label='Linear Model w. +ve constraint')
plt.xlabel('Width')
plt.ylabel('Weight')
plt.legend()
plt.show()
![academy.digilab.co.uk](/_next/image?url=%2Fimages%2Fcourses%2Ffoundations-in-machine-learning%2Fregression-walkthrough%2Fimg6.jpg&w=1200&q=75)
Our model now at least make physically meaning predictions, although we see the performance of the model is reduce, particular at smalle width values.
How might we do better?
- Get rid of outliers?
- Try a different form of linear model?
- Apply a transform of the data first? If so, then what might you try?