Section 2
Support Vector Machines
7. Introducing: Support Vector Machines
07:44 (Preview)
8. Support Vector Machines to Maximise Decision Margins 📂
25:06
9. A Code Walkthrough for SVMs 📂
32:55
10. Overlapping Classes and Kernel SVMs 📂
21:06
11. Experimenting with Overlapping Class Distributions 📂
25:33
12. Using Kernel SVMs for Non-Linear Predictions 📂
11:36
13. Support Vector Machines in the Wild 📂
17:16
14. Solving Regression Problems with SVMs
22:37
15. Comparing Least-Squares with SVM Regression 📂
56:07
16. Conclusion, Certificate, and What Next?
04:39
3. Introducing: The Kernel Trick
The Kernel Trick
📂 Please register or log in to download resources
📑 Learning Objectives
  • Abstractly projecting data features into higher dimensions
  • Applying linear models to projections
  • The Kernel Ridge Regression formula
  • Identify important hyper parameters

Extracting features without getting your hands dirty


Let's consider our data problem more generally. If we have a DD-dimensional dataset x=(x1,…,xD)∈RD\mathbf{x} = (x_1,\ldots,x_D)\in \mathbb{R}^D, we can to stretch it into a much higher number of dimensions H≫DH\gg D so that the problem is more easily separable. This leaves us defficit in on two accounts:

  • Tractability: as the number of dimensions increases, fitting models involve learning more parameters which may require vast amounts of computational resources.
  • Explainability: with large amounts of dimensions DD, it can be difficult to systematically decide how to project the features into even higher dimensions HH.

Enter: The Kernel Trick


Let's recount the solution to the least squares regression problem. But first of all, let's postulate a projection of our data ϕ=ϕ(x1,…,xD)∈RH\boldsymbol{\phi} = \mathbf{\phi}(x_1, \ldots, x_D)\in\mathbb{R}^H into a higher dimensional space. Our linear model from before now reads as f=∑i=1Hϕiβif = \sum_{i=1}^{H}\phi_{i}\beta_{i} for a larger vector of parameters. The analytical solution to minimising the squares of parameters ∣∑iβi2∣|\sum_i\beta_i^2| does not depend on the projected features ϕi\phi_i themselves, but the products between them (written as ϕi×ϕj′\phi_i\times\phi_j' for any two ϕ,ϕ′\boldsymbol{\phi}, \boldsymbol{\phi}').

The so-called kernel trick is to recognise these products as the outputs of special types of functions, called "kernels". The outputs will depend only on pairs of input data features x,x′\mathbf{x}, \mathbf{x}':

k(x,x′):=ϕ(x)⋅ϕ(x′)∈R.k(\mathbf{x}, \mathbf{x}') := \boldsymbol{\phi}(\mathbf{x}) \cdot \boldsymbol{\phi}(\mathbf{x}')\in \mathbb{R}.
Projection Diagram

Diagram to show the shortcut taken by the kernel trick.

Making careful choices of the kernel function kk avoids need for ever explicitly writing down Ï•\boldsymbol{\phi}, which for very large HH may be very inconvenient to do.

The higher projection dimension is always treated theoretically, and with the kernel trick we can evaluate models in the projected space via scalar products. In this way, we can formalise the method to treat models in infinite dimensional spaces H=∞H=\infty without having to explicitly compute in these spaces.

Next Lesson
4. Choosing Between Kernel Functions