k-Means is a simple and popular unsupervised learning algorithm in machine learning.
In this explainer, we will:
- Explain the basic idea behind k-Means as an unsupervised, clustering algorithm.
- Give a basic overview of how k-Means algorithms are trained.
- Tell you an approach for picking the number of clusters .
We have introduced the idea of unsupervised learning. So as a short recap this is when we don't have labelled data we simply have a data set
Here is the total samples and each sample is a D-dimensional vector.
In unsupervised learning we want to understand the patterns in this data. This can (primarily) mean one of two things:
-
Dimension Reduction - This is when we realise that actually there are correlations between each of the components that make , and hence we can seek methods to reduce the dimension of each data point, with little compromise to the total information in the dataset. This is often called dimension reduction or feature extraction. It is a fundamental part of building good machine learning models.
-
Clustering - is where we seek to naturally group samples from the data set into subsets or clusters with similar properties. There are various nice applications for clustering, one which we look at in this course called image segmentation.
So your first unsupervised learning algorithm is going to be K-means. This is a clustering algorithm.
From this we are seeking to partition the dataset into clusters, so using a bit of set notation
Which basically says the union of all the subgroups is the total set, and no two sets intersect. That is a sample can only be assigned to a single cluster (in maths interect of two clusters is the empty set).
Setting up the Mathematical Notation
So for the maths we need a nice way to say which set a sample belongs. To do this we introduce a flag this tells us if the data point lives in cluster . If data point does live in cluser then otherwise it is zero. We can write a bit of maths for this
Since a single data point only lives in a single cluster then if then for all other clusters not equal to . This is the same statement as above, where we say the interect of any two clusters is zero.
Let's look at a picture first
We can think of as a matrix (a table of numbers) above - with rows, one per sample, and coloumns, one for each sample. Here we look at an example with samples and clusters. On the left, we see that each row, has a single value . The coloumn in which the sits, indicates the cluster the sample belongs to. This is shown for one example that but which tells me samples (belongs to the third cluster).
The picture on the right, shows me we can calculate the number of samples in each cluster by adding up all values in a column. We do this by summing up cover all rows. So here for cluster (or column) 3, we have the number in the third cluster is given by
So basically we can pick out components of each cluster by summing over coloums.
For each subset or cluster we can introduce this concept of a prototype, typical or mean vector , where
Does this make sense? If in doubt write it down for the simple example shown in the picture.
What makes a good clustering?
So now we have written down some maths and definitions, so we can define belowing to a cluster. What is a good clustering?
So here is the idea behind k-means. Conceptally we might say that the best cluster minimises (reduces as much as possible) the distance of samples in a cluster from it's mean.
So we can write down an objective function for this
This objective function is also referred to as the Inertia. We see that we sum over all data points and all clusters.
Wouldn't really happen, but if every sample cluster was the same as it's mean then this would be zero.
We could of actually waited to find out what the correct choice for . The mean come out by minimising the objective for a fixed set assignment to clusters ().
For this reason the algorithm is called -means.
Training a K - Means Model
Here we won't go into too much detail about how we might train a K-means model. But the aim of the game is to set the values of which minimise the inertia .
We notice that for a given the objective () is linear with respect to . The objective is then minimised by assigning the data point to the cluster for which it has minimum distance from the sample to the mean .
There are lots of packages which will do K-means for you, but if you would like to know a little recipe for how it would be implemented as a simple version here you go
-
We set the number of clusters
-
We randomly assign each data point to a cluster, i.e. we assign initial values for .
-
We start a loop until the objective does not improve sufficiently. In this loop we a. First calculate the mean vectors b. We loop through each data point and calculate it's distance to each of the cluster means . c. We reassign data points to clusters for which their distance to the mean is smallest, i.e. we update . d. We recalculate .
Extra Challenges
- Can you make a basic implementation yourself - even testing it against the example below.
- What do you think the computational cost of the algorithm is, i.e. where is all the effort? How might you speed it up?
- How do you pick K? More on that below.
How to pick the number of clusters
In the simple example that I drew above, it was clear that the number of clustered we wanted was three. In general though, in an abstract data set it might not be clear how many data sets a model should be clustered into.
A Simple Approach to Determining - Elbow Method
A good k-means model is one with a low inertia (or objective function), but also has a low number of clusters. This is clearly a trade of as larger clusters will naturally reduce the error, all the way to zero when the number of clusters are actually the number of data points.
So let's look at a plot of inertia against number of clusters for this problem.
The first approach for picking is called the elbow method what does this do?
The elbow basically describes the transition in the gradient of the plot we have just plotted. What we are looking for is when the improvement in inertia is minimal for the increase in complexity. For our example we see this really is between 5 - 7 clusters. So let's go with .
Conclusions
k-means is a simple and often effective clustering algorithm. As an initial step clustering let's us explore if there are groups in our data, informing us how we might build a supervised machine learning model, or even how we should collect lables (e.g. semi-supervised learning). More over it can be used in computer vision tasks, such as image segmentation, and althogh not shown here compression.
A challenge often is to determine the number of clusters, and often this takes some engineering judgement, in which we plot inertia against number of clusters, and look for the "elbow".