Section 2
Machine Learning Workflow
8. Section Overview
05:12 (Preview)
9. Data Preparation, Representation and Model Selection
30:35 (Preview)
10. Loss Functions
23:12
11. Training, Optimisation & Hyper-parameter Tuning
17:37
12. Exploring If Models are Any Good - Training Curves
21:47
13. Is My Model any Good - Validation Plots
13:04 (Preview)
Section 3
General Linear Models
14. Section Overview
04:25 (Preview)
15. The Foundations of ML - Curve Fitting
21:39
16. Regression Walkthrough
23:47
17. Underfitting and Overfitting
21:50
18. Controlling ML Models - Regularisation in Practise
16:28
19. Exploring Simple Relationships - Correlations
11:28
20. Finding Nonlinear Relations - Predictive Power Score
11:09
21. Correlation & PPS Walkthrough 📂
24:27
22. From Regression to Classification - Logistic Regression
20:27
23. Logistic Regression in Wild
15:48
24. Looking through the Right Lens - Principle Component Analysis
16:46
25. Looking in the Right Direction - Exploring PCA
21:23
26. Conclusion, Certificate, and What Next?
02:41
6. K-Means Explainer
Introduction to Machine Learning

k-Means is a simple and popular unsupervised learning algorithm in machine learning.

In this explainer, we will:

  • Explain the basic idea behind k-Means as an unsupervised, clustering algorithm.
  • Give a basic overview of how k-Means algorithms are trained.
  • Tell you an approach for picking the number of clusters kk.

We have introduced the idea of unsupervised learning. So as a short recap this is when we don't have labelled data we simply have a data set

X:={x(1),x(2)…,x(N)}.X := \{{\bf x}^{(1)}, {\bf x}^{(2)}\ldots, {\bf x}^{(N)}\}.

Here NN is the total samples and each sample x(j){\bf x}^{(j)} is a D-dimensional vector.

In unsupervised learning we want to understand the patterns in this data. This can (primarily) mean one of two things:

  1. Dimension Reduction - This is when we realise that actually there are correlations between each of the components that make x(j){\bf x}^{(j)}, and hence we can seek methods to reduce the dimension of each data point, with little compromise to the total information in the dataset. This is often called dimension reduction or feature extraction. It is a fundamental part of building good machine learning models.

  2. Clustering - is where we seek to naturally group samples from the data set into subsets or clusters with similar properties. There are various nice applications for clustering, one which we look at in this course called image segmentation.

So your first unsupervised learning algorithm is going to be K-means. This is a clustering algorithm.

From this we are seeking to partition the dataset into KK clusters, so using a bit of set notation

X=X0∪X2∪…∪XKwhereXi∩Xj=∅.X = X_0 \cup X_2 \cup \ldots \cup X_K \quad \text{where} \quad X_i \cap X_j = \emptyset.

Which basically says the union of all the subgroups is the total set, and no two sets intersect. That is a sample can only be assigned to a single cluster (in maths interect of two clusters is the empty set).

academy.digilab.co.uk

Setting up the Mathematical Notation


So for the maths we need a nice way to say which set a sample belongs. To do this we introduce a flag rikr_{ik} this tells us if the data point ii lives in cluster kk. If data point ii does live in cluser kk then rik=1r_{ik} = 1 otherwise it is zero. We can write a bit of maths for this

rik={1whenx(i)∈Xk0otherwiser_{ik} = \begin{cases} 1 \quad \text{when} \quad {\bf x}^{(i)} \in X_{k} \\ 0 \quad \text{otherwise} \end{cases}

Since a single data point only lives in a single cluster then if rik⋆=1r_{ik^\star} = 1 then rik=0r_{ik} = 0 for all other clusters kk not equal to k⋆k^\star. This is the same statement as above, where we say the interect of any two clusters is zero.

Let's look at a picture first

academy.digilab.co.uk

We can think of rikr_{ik} as a matrix (a table of numbers) above - with NN rows, one per sample, and KK coloumns, one for each sample. Here we look at an example with 77 samples and K=3K=3 clusters. On the left, we see that each row, has a single value 11. The coloumn in which the 11 sits, indicates the cluster the sample belongs to. This is shown for one example that r43=1r_{43} = 1 but r41=r42=0r_{41} = r_{42} = 0 which tells me samples x4∈X3{\bf x}_4 \in X_3 (belongs to the third cluster).

The picture on the right, shows me we can calculate the number of samples in each cluster by adding up all values in a column. We do this by summing up cover all rows. So here for cluster (or column) 3, we have N3N_3 the number in the third cluster is given by

N3=r13+r23+…+r73=∑i=17ri3=3N_3 = r_{13} + r_{23} + \ldots + r_{73} = \sum_{i=1}^7r_{i3} = 3

So basically we can pick out components of each cluster by summing over coloums.

For each subset or cluster we can introduce this concept of a prototype, typical or mean vector μk{\boldsymbol \mu}_k, where

μk=∑i=1Nrikxi∑i=1Nrik{\boldsymbol \mu}_k = \frac{\sum_{i=1}^{N} r_{ik}{\bf x}_i}{\sum_{i=1}^{N}r_{ik}}

Does this make sense? If in doubt write it down for the simple example shown in the picture.

What makes a good clustering?


So now we have written down some maths and definitions, so we can define belowing to a cluster. What is a good clustering?

So here is the idea behind k-means. Conceptally we might say that the best cluster minimises (reduces as much as possible) the distance of samples in a cluster from it's mean.

So we can write down an objective function for this

J=∑i=1N[∑k=1Krik∥xi−μk∥2].\mathcal J = \sum_{i=1}^N\left[\sum_{k=1}^Kr_{ik} \|{\bf x}_i - {\boldsymbol \mu}_k\|^2\right].

This objective function is also referred to as the Inertia. We see that we sum over all data points and all clusters.

Wouldn't really happen, but if every sample cluster was the same as it's mean then this would be zero.

We could of actually waited to find out what the correct choice for μk\boldsymbol{\mu}_k. The mean come out by minimising the objective J\mathcal J for a fixed set assignment to clusters (rikr_{ik}).

For this reason the algorithm is called KK-means.

Training a K - Means Model


Here we won't go into too much detail about how we might train a K-means model. But the aim of the game is to set the values of rikr_{ik} which minimise the inertia J\mathcal J.

We notice that for a given μk\boldsymbol{\mu}_k the objective (J\mathcal J) is linear with respect to rikr_{ik}. The objective is then minimised by assigning the data point ii to the cluster kk for which it has minimum distance from the sample xi{\bf x}_i to the mean μk\boldsymbol{\mu}_k.

There are lots of packages which will do K-means for you, but if you would like to know a little recipe for how it would be implemented as a simple version here you go

  1. We set the number of clusters KK

  2. We randomly assign each data point to a cluster, i.e. we assign initial values for rikr_{ik}.

  3. We start a loop until J\mathcal J the objective does not improve sufficiently. In this loop we a. First calculate the mean vectors μk\boldsymbol{\mu}_k b. We loop through each data point and calculate it's distance to each of the cluster means μk\boldsymbol{\mu}_k. c. We reassign data points to clusters for which their distance to the mean is smallest, i.e. we update rikr_{ik}. d. We recalculate J\mathcal J.

Extra Challenges

  • Can you make a basic implementation yourself - even testing it against the example below.
  • What do you think the computational cost of the algorithm is, i.e. where is all the effort? How might you speed it up?
  • How do you pick K? More on that below.

How to pick the number of clusters KK


In the simple example that I drew above, it was clear that the number of clustered we wanted was three. In general though, in an abstract data set it might not be clear how many data sets a model should be clustered into.

A Simple Approach to Determining KK - Elbow Method

A good k-means model is one with a low inertia (or objective function), but also has a low number of clusters. This is clearly a trade of as larger clusters will naturally reduce the error, all the way to zero when the number of clusters are actually the number of data points.

So let's look at a plot of inertia against number of clusters for this problem.

academy.digilab.co.uk

The first approach for picking KK is called the elbow method what does this do?

The elbow basically describes the transition in the gradient of the plot we have just plotted. What we are looking for is when the improvement in inertia is minimal for the increase in complexity. For our example we see this really is between 5 - 7 clusters. So let's go with 66.

Conclusions


k-means is a simple and often effective clustering algorithm. As an initial step clustering let's us explore if there are groups in our data, informing us how we might build a supervised machine learning model, or even how we should collect lables (e.g. semi-supervised learning). More over it can be used in computer vision tasks, such as image segmentation, and althogh not shown here compression.

A challenge often is to determine the number of clusters, and often this takes some engineering judgement, in which we plot inertia against number of clusters, and look for the "elbow".

Next Lesson
7. Trying out Clustering with K-Means