Section 2
Support Vector Machines
7. Introducing: Support Vector Machines
07:44 (Preview)
8. Support Vector Machines to Maximise Decision Margins 📂
9. A Code Walkthrough for SVMs 📂
10. Overlapping Classes and Kernel SVMs 📂
11. Experimenting with Overlapping Class Distributions 📂
12. Using Kernel SVMs for Non-Linear Predictions 📂
13. Support Vector Machines in the Wild 📂
14. Solving Regression Problems with SVMs
15. Comparing Least-Squares with SVM Regression 📂
Section 3
Decision Trees
16. Introducing: Decision Trees
09:19 (Preview)
17. Decision Trees in Everyday Thinking 📂
18. Machine-Designed Decision Trees 📂
19. Classification Problems with Decision Trees: A Code Walkthrough 📂
20. Regression Problems with Decision Trees: A Code Walkthrough 📂
Section 4
Random Forests
21. Ensemble Methods: Machine Learning and Democracy
4:57 (Preview)
22. Random Forests: Decisions Don't Fall Far from the Decision Tree 📂
23. Random Forests out in the Wild 📂
24. Interpolation Through a Random Forest 📂
Section 5
Gradient Boosting
25. Give Yourself a Gradient Boost
07:01 (Preview)
26. Auto-Correction in a Forest of Stumps 📂
27. Gradient Boosting by Hand: Code Example 📂
28. XGBoost in the Wild 📂
29. Cross validate with the XGBoost API 📂
30. Conclusion, Certificate, and What Next?
3. Introducing: The Kernel Trick
The Kernel Trick
📂 Please register or log in to download resources
📑 Learning Objectives
  • Abstractly projecting data features into higher dimensions
  • Applying linear models to projections
  • The Kernel Ridge Regression formula
  • Identify important hyper parameters

Extracting features without getting your hands dirty

Let's consider our data problem more generally. If we have a DD-dimensional dataset x=(x1,,xD)RD\mathbf{x} = (x_1,\ldots,x_D)\in \mathbb{R}^D, we can to stretch it into a much higher number of dimensions HDH\gg D so that the problem is more easily separable. This leaves us defficit in on two accounts:

  • Tractability: as the number of dimensions increases, fitting models involve learning more parameters which may require vast amounts of computational resources.
  • Explainability: with large amounts of dimensions DD, it can be difficult to systematically decide how to project the features into even higher dimensions HH.

Enter: The Kernel Trick

Let's recount the solution to the least squares regression problem. But first of all, let's postulate a projection of our data ϕ=ϕ(x1,,xD)RH\boldsymbol{\phi} = \mathbf{\phi}(x_1, \ldots, x_D)\in\mathbb{R}^H into a higher dimensional space. Our linear model from before now reads as f=i=1Hϕiβif = \sum_{i=1}^{H}\phi_{i}\beta_{i} for a larger vector of parameters. The analytical solution to minimising the squares of parameters iβi2|\sum_i\beta_i^2| does not depend on the projected features ϕi\phi_i themselves, but the products between them (written as ϕi×ϕj\phi_i\times\phi_j' for any two ϕ,ϕ\boldsymbol{\phi}, \boldsymbol{\phi}').

The so-called kernel trick is to recognise these products as the outputs of special types of functions, called "kernels". The outputs will depend only on pairs of input data features x,x\mathbf{x}, \mathbf{x}':

k(x,x):=ϕ(x)ϕ(x)R.k(\mathbf{x}, \mathbf{x}') := \boldsymbol{\phi}(\mathbf{x}) \cdot \boldsymbol{\phi}(\mathbf{x}')\in \mathbb{R}.
Projection Diagram

Diagram to show the shortcut taken by the kernel trick.

Making careful choices of the kernel function kk avoids need for ever explicitly writing down ϕ\boldsymbol{\phi}, which for very large HH may be very inconvenient to do.

The higher projection dimension is always treated theoretically, and with the kernel trick we can evaluate models in the projected space via scalar products. In this way, we can formalise the method to treat models in infinite dimensional spaces H=H=\infty without having to explicitly compute in these spaces.

Next Lesson
4. Choosing Between Kernel Functions