In this explainer we will give a general introduction to linear models.

We will talk through what a linear model is?
How it works for simple linear models, like fitting polynomials which we probably know about before.
Then we look how we can build linear model, which are more expressive.
We will show how linear models are trained, both for example feature problems and data sets, and for much larger problems.
Finally we will touch on the interpretation of bias, which comes naturally out of linear models.

Starter Question. Is the following a linear model?

f(x) = x + 2\sin(2x) + \frac{1}{2}\exp\left(-\frac{x^2}{\lambda}\right)

You should be able to confidently answer this question, and give a reason for your answer by the end of this explainer!

So in this explainer, we are very much looking at supervised learning methods. For this we will start with introducing Generalised Linear Models (or GLMs) in the context of regression.

A regression task is to predict one or more continous target variables $y$ , given a $D$ -dimesional vector $x$ of input variables.

So why are GLMs super important?

The foundations for introducing the core concepts in machine learning.
They are widely used and (can be) highly effective.
If possible to build, they are often more explainable that other models, in because with our understand of "correlation" as a central concept.
They provide a stepping stone to understand more complex machine learning models, in particular Neural Networks and Gaussian Processes.

Fitting Polynomials

So we are probably used to fitting straight lines through data. I remember doing this at school

f(x) = w_0 + w_1x

where $w_0$ has the interpretation of the intercept and $w_1$ is the gradient (or slope) of the curve.

We can extend this to high-dimensional inputs, so what we mean by this is that there are two input variables ${\bf x} = [x_1, x_2]^T$ , and so

f({\bf x}) = w_0 + w_1x_1 + w_2 x_2,

is a linear model (of linear functions) for a two dimensional input. Equally we can have models which are higher order polynomials, where we fit quadratic, cubic or high order curves through the data, so in general we have

f(x) = w_0 + w_1x + w_2x^2 + \ldots + w_Kx^K = w_0 + \sum_{k=1}^Kw_k x^k.

So the key point here is that when we talk about linear models. It doesn't mean the functions themselves are linear (although they can be). So we can also go higher dimensions (more than one input) and higher order

f({\bf x}) = w_0 + w_1x_1 + w_{2}x_2 + w_3x_1^2 + w_4x_1x_2 + w_5x_2^2

for example, but can be arbitary complex

A More "General" Linear Model

A linear model is defined by the linear combination of functions. These functions which are known as basis functions or in Machine Learning features, and can be any general nonlinear function.

In general we can write a linear model as follows

f({\bf x}) = w_0\phi_0({\bf x}) + w_1\phi_1({\bf x}) + \ldots + w_N\phi_K({\bf x}) = \sum_{i=0}^K w_i\phi_i({\bf x}),

where each of the $w_i$ 's (or weights) are scalar numbers or to be fancy we write $w_i \in \mathbb R$ . That is $w_i$ belongs to the set of real numbers. In machine learning you might also see (and will see me write most the time) such a linear model written in vector form as follows

f({\bf x}) = {\bf w}^T{\boldsymbol \phi} \quad \text{where} \quad {\boldsymbol \phi} = [\phi_0({\bf x}), \phi_0({\bf x}), \ldots, \phi_K({\bf x})]^T.

Maybe to labour the point a little, this is just a more general case, so if we think of our original model we have $f(x) = w_0 + w_1x$ , then this is simple

\phi_0(x) \equiv 1 \quad \text{and} \quad \phi_1(x) = x,

and hence

f(x) = {\bf w}^T{\boldsymbol \phi} = [w_0 \quad w_1]^T \begin{bmatrix} 1 \\ x\end{bmatrix}.

It is often convention to reserve the first feature function, i.e. $\phi_0({\bf x})$ to be the constant function. This means that $\phi_0({\bf x})\equiv 1$ , and therefore

f({\bf x}) = w_0 + \sum_{i=1}^K w_i\phi_i({\bf x}_i)

The parameter $w_0$ therefore allows us to model any fixed offset in the data, and with that is often called the bias parameter (different from bias in a statistical sense).

The basis or feature functions themselves can be pre-defined. So for example for a single input parameter example the basis could be powers of $x$ , i.e. $\phi_k(x) = x^k$ . Other options include Mexican Hat Function, where basis functions are defined on overlapping regions (think finite element shape functions) such that

Other basis functions include

\phi_j(x) = \exp\left(-\frac{(x - \mu_k)^2}{2\lambda^2}\right)

where are determined by a general point $\mu_k$ and a length scale $\lambda$ . These are often called Gaussian basis functions. Or another common option is the sigmoidal basis functions which are defined by

\phi_j(x) = \sigma\left(\frac{x - \mu_j}{s}\right) \quad \text{where} \quad \sigma(x) = \frac{1}{1 + \exp(x)}.

To see what these look like on a plot, let's have a look - polynomials on the left, Gaussians in the middle and sigmoid basis functions on the right.

digiLab Academy - General Linear Models - basis functions | academy.digilab.co.uk

We will also see later in the course that the basis functions don't have to be well defined formulae as we have above. They can actually be discovered from the data. Whilst outside the scope of this explainer, but will be covered else where, unsupervised machine learning models can be used. For those interested in looking ahead consider reading about Principle Component Analysis (aka PCA) and Autoencoders. More on that later.

Training / Fitting a Linear Model

Important concept now is that given a set of basis functions $\phi_i({\bf x})$ , we now wish to find the weights ${\bf w}$ which minimise the error on the training data - so we need to introduce the idea of fit!

So a very natural way to define error is for $N$ pairs of training data $({\bf x}_j, y_j)$ , we can define the Mean-square loss function for a given set of weights

\mathcal L = \frac{1}{2N}\sum_{j=1}^N \left( f_{\bf w}({\bf x}_j) - y_j \right)^2 = \frac{1}{2N}\sum_{j=1}^N \left( {\bf w}^T\boldsymbol \phi({\bf x}_j) - y_j \right)^2

Our objective is therefore to minimise this loss function.

A natural way to do this is to find the gradient of error with respect to the weight's $w_i$ , and use this information to learn how to adjust (for nonlinear case) or pick the weights to minimise the error. The gradient basically tells us how much the error will increase/decrease if we increase/decrease the value of $w_i$ . Optimisation strategies then seek to change the value of the weights to reduce the error.

\frac{\partial \mathcal L}{\partial w_i} = \frac{1}{N}\sum_{j=1}^N\left({\bf w}^T{\boldsymbol \phi}({\bf x}_j) - y_j\right)\phi_i({\bf x}_j)

When we have reached a minimum error, then the gradient with respect to each of the weights will be zero.

In general, for nonlinear model, we would use some gradient based optimisation strategy for finding this optimal set of weights which minimises the error. Here, due to the linearity of the model, the solution can be calculated directly in closed-form. We won't go into the detail here, but give the solution. Let us first define the matrix, which is a rectangulat matrix often refered to as the design matrix

\Phi = \begin{bmatrix} \phi_0({\bf x}_1) & \phi_1({\bf x}_1) & \ldots & \phi_K({\bf x}_1)\\ \phi_0({\bf x}_2) & \phi_1({\bf x}_2) & \ldots & \phi_K({\bf x}_2)\\ \vdots & \vdots & \ddots & \vdots \\ \phi_0({\bf x}_N) & \phi_1({\bf x}_N) & \ldots & \phi_K({\bf x}_N) \end{bmatrix}

Then the "optimal" weights for the model are given by

{\bf w} = \left( \Phi^T \Phi\right)^{-1} \Phi^T{\bf y}

We see that the calculation involves taking the inverse of a dense matrix i.e. $\left( \Phi^T \Phi\right)^{-1}$ . This will be computationally expensive if the number of features $K$ and the number of data points $N$ is large. The time for calculation grows quadratically with $K$ and linearly in $N$ . For large numbers of features ( $K$ ) and therefore weights in such case we seek to use optimisation schemes like gradient descent.

Role of the bias parameter $w_0$

We will see in a work sample, how we can implement this and fit a linear model below.

Before we do this. It is a good opportunity to look at the role of $w_0$ the bias parameter, remembering that we conventionally choose $\phi_0({\bf x}) \equiv 1$ . So if we look at what happens by setting the gradient of the error with respect to $w_0$ to zero

\frac{\partial e}{\partial w_0} = \frac{1}{N}\sum_{j=1}^N\Big[{\bf w}^T{\boldsymbol \phi}({\bf x}_j) - y_j\Big]\phi_0({\bf x}_j) = 0

So let us rewrite this a little noting that $\phi_0({\bf x}) \equiv 1$ , then

\frac{1}{N}\sum_{j=1}^N\left(w_0 + \sum_{k=1}^Kw_k\phi_k({\bf x}_j)- y_j\right) = w_0 - \underbrace{\frac{1}{N}\sum_{j=1}^Ny_j}_{\bar y} + \sum_{k=1}^Kw_k\underbrace{\frac{1}{N}\sum_{j=1}^N\phi_k({\bf x}_j)}_{\bar{\phi}_k}

Ok, so all this equal zero, which means

w_0 = \bar{y} - \sum_{k=1}^Kw_k\bar{\phi}_k

What we notice is that the parameter $w_0$ makes a constant shift which corrects between the averages over the training set of the target variables and the weighted sum of averages of the basis functions evaluated at input points.

Fitting Polynomials

A More "General" Linear Model

Training / Fitting a Linear Model

Role of the bias parameter w0w_0w0​

Role of the bias parameter $w_0$