In this walkthrough we will go through he basics of fitting linear models with sklearn using `LinearRegression()`

. We will look at a simple polynomial model to get started, and focus on understanding how we can use `PolynomialFeature()`

to build a general function on which to build a basic model.

### Adding Libraries

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```

### A Fishy Data set!

In this data set we are going to look at some fish data of 7 different species, where we will try and predict weight from simple measurements. Important stuff!

In total the data set contains 7 measurements of 159 fish.

```
df = pd.read_csv('https://raw.githubusercontent.com/satishgunjal/datasets/master/Fish.csv')
df = df.drop(['Length1', 'Length2', 'Length3'], axis =1) # Can also use axis = 'columns'
df.sample(5) # Display random 5 records
```

Ok Let us first look at the correlations in the data, for this we calculate and plot a correlation matrix.

```
df.corr()
plt.rcParams["figure.figsize"] = (6,4) # Custom figure size in inches
sns.heatmap(df.corr(), annot =True)
plt.title('Correlation Matrix')
```

```
sns.pairplot(df, kind = 'scatter', hue = 'Species')
```

If we want to predict weight, using height looks complex. There looks like two clear responses based on the fish type. Yet making predictions using width, looks like a much more unified relationship, much less depedent on fish type.

So let's do that.

```
X = df['Width']
y = df['Weight']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)
plt.scatter(X_train,y_train, alpha = 0.4,color='lightcoral',label='Training Data')
plt.scatter(X_test,y_test, alpha = 0.8,color='lightblue',label='Testing Data')
plt.xlabel('Width')
plt.xlabel('Weight')
plt.legend()
plt.show()
```

There are clear outliers in this data set, yet since we have sufficient data their influence will be minimal. We could remove them, but we wont here.

### Building a Linear Model

Ok so the first simple model we are going to build is a polynomial model, which looks like this

So here we talk about $\boldsymbol{\phi}$ being the feature vectors, in this case they a polynomial features; really easy to do in ``sklearn`.

```
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=2)
phi_train = polynomial_features.fit_transform(np.array(X_train).reshape(-1,1))
lr = LinearRegression().fit(phi_train, y_train)
```

Ok so now let us plot out the function and compare it to the data.

```
X_all = np.linspace(1.0, 8.0, 1000).reshape(-1,1)
X_plot_poly = polynomial_features.fit_transform(X_all)
y_pred = lr.predict(X_plot_poly)
plt.scatter(X_train,y_train, alpha = 0.4,color='lightcoral',label='Training Data')
plt.scatter(X_test,y_test, alpha = 0.8,color='lightblue',label='Testing Data')
plt.plot(X_all,y_pred,'-k',alpha = 0.8,label='Linear Model')
plt.xlabel('Width')
plt.ylabel('Weight')
plt.legend()
plt.show()
```

It isn't a good model. Ok it fits on average ok, but we know that weight can't be negative. So we need to impose constraints.

`sklearn`

gives us some options.

```
lr_positive = LinearRegression(fit_intercept = False, positive = True).fit(phi_train, y_train)
y_pred_positive = lr_positive.predict(X_plot_poly)
plt.scatter(X_train,y_train, alpha = 0.4,color='lightcoral',label='Training Data')
plt.scatter(X_test,y_test, alpha = 0.8,color='lightblue',label='Testing Data')
plt.plot(X_all,y_pred,'-k',alpha = 0.8,label='Linear Model')
plt.plot(X_all,y_pred_positive,'-g',alpha = 0.8,label='Linear Model w. +ve constraint')
plt.xlabel('Width')
plt.ylabel('Weight')
plt.legend()
plt.show()
```

Our model now at least make physically meaning predictions, although we see the performance of the model is reduce, particular at smalle width values.

How might we do better?

- Get rid of outliers?
- Try a different form of linear model?
- Apply a transform of the data first? If so, then what might you try?