Section 2
Machine Learning Workflow
8. Section Overview
05:12 (Preview)
9. Data Preparation, Representation and Model Selection
30:35 (Preview)
10. Loss Functions
23:12
11. Training, Optimisation & Hyper-parameter Tuning
17:37
12. Exploring If Models are Any Good - Training Curves
21:47
13. Is My Model any Good - Validation Plots
13:04 (Preview)
Section 3
General Linear Models
14. Section Overview
04:25 (Preview)
15. The Foundations of ML - Curve Fitting
21:39
16. Regression Walkthrough
23:47
17. Underfitting and Overfitting
21:50
18. Controlling ML Models - Regularisation in Practise
16:28
19. Exploring Simple Relationships - Correlations
11:28
20. Finding Nonlinear Relations - Predictive Power Score
11:09
21. Correlation & PPS Walkthrough 📂
24:27
22. From Regression to Classification - Logistic Regression
20:27
23. Logistic Regression in Wild
15:48
24. Looking through the Right Lens - Principle Component Analysis
16:46
25. Looking in the Right Direction - Exploring PCA
21:23
26. Conclusion, Certificate, and What Next?
02:41
7. Trying out Clustering with K-Means
Introduction to Machine Learning
📂 Please register or log in to download resources

Trying out Clustering with K-Means


In this walkthrough we can are going to explore K-means on a crime data set across all US states. Which states have similar crime profiles?

Import Libraries

import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
%matplotlib inline

Import Data Set and have a look

Let us have a look at the data.

df = pd.read_csv("data_arrest.csv", delimiter=",", index_col=0)

df.head()
academy.digilab.co.uk

Rescaling data is important when using K means. This is because the different inputs can have different scales. Hence if not rescaled some variables will have a greater influence than others.

Here we use the normal MinMaxScaler() which rescales the data from 0 to 1 for each input variable.

from sklearn.preprocessing import MinMaxScaler

minmax = MinMaxScaler()

hat_df = crime_rates_standardized = pd.DataFrame(minmax.fit_transform(df))

Clustering with K Means

Here we now fit a K Means model, and we try out different values of K (the number of clusters). In each case we can calculate the inertia (the total distance from the data to the centroid of the cluster it is assigned).

from sklearn.cluster import KMeans

KMeans()
plt.figure(figsize=(8, 6))
inertia = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', n_init='auto',random_state = 123).fit(hat_df)

    inertia.append(kmeans.inertia_) # criterion based on which K-means clustering works

We can then plot the 'elbow' curve and see a good choice of number of clusters

plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()
academy.digilab.co.uk

Looking at the elbow curve we see that K = 4 seems a good choice. So let us fit that model again, and look at the results.

model = KMeans(n_clusters = 4, init = 'k-means++', n_init = 'auto', random_state = 42)

clusters = model.fit_predict(hat_df)

scatter = plt.scatter(df['Murder'], df['Assault'], c=clusters)

plt.xlabel('Murders Rates')
plt.ylabel('Assaults Rates')
plt.show()

There seems like a clustering which divided by the amount of crime. We see a clear correlation between murder rates and assault rates (no really surprise here). The clustering is clear visible along this relationship.

Now let plot this data set on a map. We do this using plotly which you can install using

pip install plotly-express

First we need to convert state names into the abbrevation of two letters, and then add them to the data set along with the assigned cluster numbers.

import plotly.express as px # You can install this using `pip install plotly-express`

us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}

abbrev_state = []
for state in df.index:
    abbrev_state.append(us_state_to_abbrev[state])

df['state_code'] = abbrev_state
df['clusters'] = clusters

df.head()
academy.digilab.co.uk

We can now plot this map using plotly-express

fig = px.choropleth(df,
                    locations='state_code',
                    locationmode="USA-states",
                    scope="usa",
                    color='clusters',
                    color_continuous_scale="Viridis_r",
                    )
fig.show()
academy.digilab.co.uk
Next Lesson
8. Section Overview