k-Means is a simple and popular unsupervised learning algorithm in machine learning.

In this explainer, we will:

Explain the basic idea behind k-Means as an unsupervised, clustering algorithm.
Give a basic overview of how k-Means algorithms are trained.
Tell you an approach for picking the number of clusters $k$ .

We have introduced the idea of unsupervised learning. So as a short recap this is when we don't have labelled data we simply have a data set

X := \{{\bf x}^{(1)}, {\bf x}^{(2)}\ldots, {\bf x}^{(N)}\}.

Here $N$ is the total samples and each sample ${\bf x}^{(j)}$ is a D-dimensional vector.

In unsupervised learning we want to understand the patterns in this data. This can (primarily) mean one of two things:

Dimension Reduction - This is when we realise that actually there are correlations between each of the components that make ${\bf x}^{(j)}$ , and hence we can seek methods to reduce the dimension of each data point, with little compromise to the total information in the dataset. This is often called dimension reduction or feature extraction. It is a fundamental part of building good machine learning models.
Clustering - is where we seek to naturally group samples from the data set into subsets or clusters with similar properties. There are various nice applications for clustering, one which we look at in this course called image segmentation.

So your first unsupervised learning algorithm is going to be K-means. This is a clustering algorithm.

From this we are seeking to partition the dataset into $K$ clusters, so using a bit of set notation

X = X_0 \cup X_2 \cup \ldots \cup X_K \quad \text{where} \quad X_i \cap X_j = \emptyset.

Which basically says the union of all the subgroups is the total set, and no two sets intersect. That is a sample can only be assigned to a single cluster (in maths interect of two clusters is the empty set).

Setting up the Mathematical Notation

So for the maths we need a nice way to say which set a sample belongs. To do this we introduce a flag $r_{ik}$ this tells us if the data point $i$ lives in cluster $k$ . If data point $i$ does live in cluser $k$ then $r_{ik} = 1$ otherwise it is zero. We can write a bit of maths for this

r_{ik} = \begin{cases} 1 \quad \text{when} \quad {\bf x}^{(i)} \in X_{k} \\ 0 \quad \text{otherwise} \end{cases}

Since a single data point only lives in a single cluster then if $r_{ik^\star} = 1$ then $r_{ik} = 0$ for all other clusters $k$ not equal to $k^\star$ . This is the same statement as above, where we say the interect of any two clusters is zero.

Let's look at a picture first

We can think of $r_{ik}$ as a matrix (a table of numbers) above - with $N$ rows, one per sample, and $K$ coloumns, one for each sample. Here we look at an example with $7$ samples and $K=3$ clusters. On the left, we see that each row, has a single value $1$ . The coloumn in which the $1$ sits, indicates the cluster the sample belongs to. This is shown for one example that $r_{43} = 1$ but $r_{41} = r_{42} = 0$ which tells me samples ${\bf x}_4 \in X_3$ (belongs to the third cluster).

The picture on the right, shows me we can calculate the number of samples in each cluster by adding up all values in a column. We do this by summing up cover all rows. So here for cluster (or column) 3, we have $N_3$ the number in the third cluster is given by

N_3 = r_{13} + r_{23} + \ldots + r_{73} = \sum_{i=1}^7r_{i3} = 3

So basically we can pick out components of each cluster by summing over coloums.

For each subset or cluster we can introduce this concept of a prototype, typical or mean vector ${\boldsymbol \mu}_k$ , where

{\boldsymbol \mu}_k = \frac{\sum_{i=1}^{N} r_{ik}{\bf x}_i}{\sum_{i=1}^{N}r_{ik}}

Does this make sense? If in doubt write it down for the simple example shown in the picture.

What makes a good clustering?

So now we have written down some maths and definitions, so we can define belowing to a cluster. What is a good clustering?

So here is the idea behind k-means. Conceptally we might say that the best cluster minimises (reduces as much as possible) the distance of samples in a cluster from it's mean.

So we can write down an objective function for this

\mathcal J = \sum_{i=1}^N\left[\sum_{k=1}^Kr_{ik} \|{\bf x}_i - {\boldsymbol \mu}_k\|^2\right].

This objective function is also referred to as the Inertia. We see that we sum over all data points and all clusters.

Wouldn't really happen, but if every sample cluster was the same as it's mean then this would be zero.

We could of actually waited to find out what the correct choice for $\boldsymbol{\mu}_k$ . The mean come out by minimising the objective $\mathcal J$ for a fixed set assignment to clusters ( $r_{ik}$ ).

For this reason the algorithm is called $K$ -means.

Training a K - Means Model

Here we won't go into too much detail about how we might train a K-means model. But the aim of the game is to set the values of $r_{ik}$ which minimise the inertia $\mathcal J$ .

We notice that for a given $\boldsymbol{\mu}_k$ the objective ( $\mathcal J$ ) is linear with respect to $r_{ik}$ . The objective is then minimised by assigning the data point $i$ to the cluster $k$ for which it has minimum distance from the sample ${\bf x}_i$ to the mean $\boldsymbol{\mu}_k$ .

There are lots of packages which will do K-means for you, but if you would like to know a little recipe for how it would be implemented as a simple version here you go

We set the number of clusters $K$
We randomly assign each data point to a cluster, i.e. we assign initial values for $r_{ik}$ .
We start a loop until $\mathcal J$ the objective does not improve sufficiently. In this loop we a. First calculate the mean vectors $\boldsymbol{\mu}_k$ b. We loop through each data point and calculate it's distance to each of the cluster means $\boldsymbol{\mu}_k$ . c. We reassign data points to clusters for which their distance to the mean is smallest, i.e. we update $r_{ik}$ . d. We recalculate $\mathcal J$ .

Extra Challenges

Can you make a basic implementation yourself - even testing it against the example below.
What do you think the computational cost of the algorithm is, i.e. where is all the effort? How might you speed it up?
How do you pick K? More on that below.

How to pick the number of clusters $K$

In the simple example that I drew above, it was clear that the number of clustered we wanted was three. In general though, in an abstract data set it might not be clear how many data sets a model should be clustered into.

A Simple Approach to Determining $K$ - Elbow Method

A good k-means model is one with a low inertia (or objective function), but also has a low number of clusters. This is clearly a trade of as larger clusters will naturally reduce the error, all the way to zero when the number of clusters are actually the number of data points.

So let's look at a plot of inertia against number of clusters for this problem.

The first approach for picking $K$ is called the elbow method what does this do?

The elbow basically describes the transition in the gradient of the plot we have just plotted. What we are looking for is when the improvement in inertia is minimal for the increase in complexity. For our example we see this really is between 5 - 7 clusters. So let's go with $6$ .

Conclusions

k-means is a simple and often effective clustering algorithm. As an initial step clustering let's us explore if there are groups in our data, informing us how we might build a supervised machine learning model, or even how we should collect lables (e.g. semi-supervised learning). More over it can be used in computer vision tasks, such as image segmentation, and althogh not shown here compression.

A challenge often is to determine the number of clusters, and often this takes some engineering judgement, in which we plot inertia against number of clusters, and look for the "elbow".

Setting up the Mathematical Notation

Explain this equation...

What makes a good clustering?

Explain this equation...

Training a K - Means Model

How to pick the number of clusters $K$

A Simple Approach to Determining $K$ - Elbow Method

Conclusions

Setting up the Mathematical Notation

Explain this equation...

What makes a good clustering?

Explain this equation...

Training a K - Means Model

How to pick the number of clusters KKK

A Simple Approach to Determining KKK - Elbow Method

Conclusions

How to pick the number of clusters $K$

A Simple Approach to Determining $K$ - Elbow Method