Kernels and support vector machines#

The problem only depends on \(x_i\cdot x_{i'}\)#

  • As with the maximal margin classifier, the problem can be reduced to finding \(\alpha_1,\dots,\alpha_n\):

\[\begin{split} \begin{align*} &\max_\alpha\;\; \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{i'=1}^n \alpha_i\alpha_{i'}y_i y_{i'} \color{Blue}{(x_i\cdot x_{i'})}\\ &\text{subject to }\;\; 0\leq \alpha_i \leq D \;\text{ for all }i=1,\dots,n, \\ &\quad \sum_{i}\alpha_i y_i = 0. \end{align*} \end{split}\]
  • As for maximal margin classifier, this only depends on the training sample inputs through the inner products \(x_i\cdot x_j\) for every pair \(i,j\).

  • Why care?


Key fact about the support vector classifier#

  • To find the hyperplane all we need to know is the dot product between any pair of input vectors:

\[ \begin{align}\begin{aligned}\begin{split}K(x_i,x_k) = (x_i\cdot x_k) =\langle x_i, x_k \rangle = \sum_{j=1}^p x_{ij}x_{kj}$$\\\end{split}\\- The matrix $\mathbf{K}_{ij}=K(x_i,x_k)$ is called the *kernel* or *Gram* matrix.\\- To **make predictions at $x$** all we need to know is $\langle x_i, x \rangle$ for each point $x_i$ (actually only the support vectors).\\- This actually opens up the possibility of (richer) *nonlinear* classifiers: using basis functions can use \end{aligned}\end{align} \]

K_f(x_i, x_j) = \langle f(x_i), f(x_j) \rangle, \qquad K_f(x, x_i) = \langle f(x), f(x_i) \rangle $$


How to create non-linear boundaries?#

Fig 9.8
  • The support vector classifier can only produce a linear boundary.


How to create non-linear boundaries?#

  • In logistic regression, we dealt with this problem by adding transformations of the predictors.

  • The original decision boundary is a hyperplane: $\(\log \left[ \frac{P(Y=1|X)}{P(Y=0|X)} \right] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0.\)$

  • With a quadratic predictor, we get a quadratic boundary: $\(\log \left[ \frac{P(Y=1|X)}{P(Y=0|X)} \right] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2= 0.\)$


  • With a support vector classifier we can apply a similar trick.

  • The original decision boundary is the hyperplane defined by: $\( \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0.\)$

  • If we expand the set of predictors to the 3D space \((X_1,X_2,X_1^2)\), the decision boundary becomes: $\( \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2= 0.\)$

  • This is in fact a linear boundary in the 3D space. However, we can classify a point knowing just \((X_1,X_2)\). The boundary in this projection is quadratic in \(X_1\).


Use polynomials!#

  • Idea: Add polynomial terms up to degree \(d\): $\(Z = (X_1, X_1^2, \dots, X_1^d, X_2, X_2^2, \dots, X_2^d, \dots, X_p, X_p^2, \dots, X_p^d).\)$

  • Does this make the computation more expensive?

  • Recall that all we need to compute is the dot product: $\(x_i \cdot x_{k} = \langle x_i,x_k\rangle = \sum_{j=1}^p x_{ij}x_{kj}.\)$

  • With the expanded set of predictors, we need: $\(z_i \cdot z_{k} = \langle z_i,z_k\rangle = \sum_{j=1}^p\sum_{\ell=1}^d x^\ell_{ij}x^\ell_{kj}.\)$

  • Modulo computing these dot products, the order of computations is the same as with linear features. Not quite true for logistic regression as described above.


Kernels and the kernel trick#

  • The kernel matrix defined by \(K(x_i,x_k) = \langle z_i, z_k \rangle\) for a set of linearly independent vectors \(z_1,\dots,z_n\) is always symmetric and positive semi-definite, i.e. has no negative eigenvalues.

  • Theorem: If \(K\) is a positive definite \(n\times n\) matrix, there exist vectors \((z_1,\dots,z_n)\) in some space \(\mathbf{Z}\), such that \(K(x_i,x_k) = \langle z_i,z_k\rangle\).

  • We just need \(\mathbf{K}_{ij}=K(x_i,x_j)\) to classify all of our learning examples, but to classify we need to know \(x \mapsto K(x, x_i)\).

  • The kernel \(K:\mathbb{R}^p \times \mathbb{R}^p \rightarrow \mathbb{R}\) should be such that for any choice of \(\{x_1, \dots, x_n\}\) the matrix \(\mathbf{K}\) is positive semidefinite.

  • Such kernel functions are associated to reproducing kernel Hilbert spaces (RKHS).

  • Support vector classifiers using this kernel trick are support vector machines


  • With a kernel function \(K\), the problem can be reduced to the optimization:

\[\begin{split} \begin{align*} \hat \alpha\;=\;&\arg\max_\alpha\;\; \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{i'=1}^n \alpha_i\alpha_{i'}y_i y_{i'} \mathbf{K}_{ii'} \\ &\text{subject to }\; 0\leq \alpha_i \leq D \;\text{ for all }i=1,\dots,n, \\ &\quad \sum_{i=1}^n \alpha_i y_i = 0. \end{align*} \end{split}\]
  • This is the dual problem of a different optimization problem than we started with.

  • Predictions can be computed similarly to original kernel \(K(x,y)=x \cdot y\). (Details omitted.)


Kernels: implicit basis expansion#

Using basis expansions:#

  • Find a mapping \(\Phi\) which expands the original set of predictors \(X_1,\dots,X_p\). For example, $\(\Phi(X) = (X_1,X_2,X_1^2)\)$

  • For each pair of samples, compute: $\(\mathbf{K}_{ik} = \langle\Phi(x_i),\Phi(x_k)\rangle.\)$

  • Predictions use \(\langle\Phi(x), \Phi(x_i) \rangle, 1 \leq i \leq n\).


Using a kernel#

  • Prove that a function \(R(\cdot,\cdot)\) is positive definite (there are a lot of known examples). For example: radial basis (Gaussian) kernel $\(R(x_i,x_k) = e^{-\frac{\gamma}{2}\| x_i-x_k\|^2_2}.\)$

  • For each pair of samples, compute: $\(\mathbf{K}_{ik} = R(x_i,x_k).\)$

  • Often much easier!

  • Predictions use \(R(x, x_i), 1 \leq i \leq n\).


A polynomial kernel with \(d=2\):#

\[\mathbf{K}_{ik} = R(x_i,x_k) = \left( 1+ \langle x_i,x_k\rangle\right)^2\]

This is equivalent to the expansion:

\[\begin{split} \begin{align*} \Phi(x) =& (\sqrt{2}x_1,\dots,\sqrt{2}x_p, \\ & x_1^2,\dots,x_p^2, \\ & \sqrt{2}x_1x_2,\sqrt{2}x_1x_3, \dots, \sqrt{2}x_{p-1}x_p) \end{align*} \end{split}\]
  • For a single \(x\), computing \(K(x,x_k)\) directly is \(O(p)\).

  • For a single \(x\) computing the kernel using the expansion is \(O(p^2)\).


Finding kernel functions#

  • Proving that a symmetric function \(R(\cdot,\cdot)\) is positive definite (PD) is not always easy.

  • However, we can easily define PD kernels by combining those we are familiar with:

  • Sums, scalings \((>0)\), and products of PD kernels are PD.

  • Intuitively, a kernel \(K(x_i,x_k)\) defines a similarity between the samples \(x_i\) and \(x_k\).

  • This intuition can guide our choice in different problems.


Fig 9.9

Left: polynomial degree=3, Right: radial basis


Common kernels#

  • The polynomial kernel:

\[K(x_i,x_k) = \left( 1+ \langle x_i,x_k\rangle\right)^d\]
  • The radial basis kernel:

\[K(x_i,x_k) = \exp\big(-\gamma\sum_{j=1}^p (x_{ip}-x_{kp})^2\big)\]

Summary so far#

  • As noted above support vector machine is a support vector classifier applied on an expanded set of predictors, e.g. $\(\Phi: (X_1,X_2) \to (X_1,X_2,X_1X_2, X_1^2,X_2^2).\)$

  • We expand the vector of predictors for each sample \(x_i\) and then perform the algorithm.

  • We only need to know the dot products:

\[ \langle \Phi(x_i),\Phi(x_k) \rangle \equiv K(x_i,x_k)\]

for every pair of samples \((x_i,x_k)\) as well as how to compute \(\langle \Phi(x), \Phi(x_i) \rangle\) for new \(x\)’s.

  • Often, the dot product:

\[\langle \Phi(x_i),\Phi(x_k) \rangle \equiv K(x_i, x_k) \]

is a simple function \(f(x_i,x_k)\) of the original vectors. Even if the mapping \(\Phi\) significantly expands the space of predictors.


Examples#

  • Example 1: Polynomial kernel

    • \[K(x_i,x_k) = (1+ \langle x_i,x_k \rangle)^2.\]
    • With two predictors, this corresponds to the mapping:

    \[\Phi: (X_1,X_2) \to (\sqrt{2}X_1,\sqrt{2}X_2,\sqrt{2}X_1X_2, X_1^2,X_2^2).\]
  • Example 2: RBF kernel

    • \(K(x_i,x_k) = \exp(-\gamma d( x_i,x_k )^2)\) where \(d\) is the Euclidean distance between \(x_i\) and \(x_k\).

    • In this case, the mapping \(\Phi\) is an expansion into an infinite number of transformations!

    • We can apply the method even if we don’t know what these transformations explicitly are.


Kernels for non-standard data types#

  • We can define families of kernels (with tuning parameters), which capture similarity between non-standard kinds of data:

    1. Text, strings

    2. Images

    3. Graphs

    4. Histograms

  • Sometimes we know the mapping \(\Phi\), but there are algorithms that are fast for computing \(K(x_i,x_k)\) without doing the expansion explicitly.

  • Other times, the expansion \(\Phi\) is infinite-dimensional or simply not known.


Stringdot kernel#

Suppose we want to compare two strings in a finite alphabet:

\[x_1 = ACCTATGCCATA, \qquad x_2 = AGCTAAGCATAC\]
  • Stringdot kernel: For each word \(u\) of length \(p\), we define a feature:

\[\Phi_u(x_i) = \text{# of times that }u\text{ appears in }x_i\]
  • Naive algorithm would require looping over each sequence, for every subsequence \(u\) of length \(p\).

  • This would be \(O(n^2)\) steps, where \(n\) is the length of the sequences.


Gapweight kernel#

Suppose we want to compare two strings in a finite alphabet: $\(x_1 = ACCTATGCCATA\)\( \)\(x_2 = AGCTAAGCATAC\)$

  • Gap weight kernel: For each word \(u\) of length \(p\), we define a feature:

\[\Phi_u(x_i) = \sum_{v: u \subset v \subset x_i} \lambda^{len(v)} \]

with \(0<\lambda\leq 1\).

  • The number of features can be huge! However, this can be computed in \(\mathcal O(Mp\log n )\) steps where \(M\) is the number of matches.


Applying SVMs with more than 2 classes#

  • SVMs don’t generalize nicely to the case of more than 2 classes.

  • Two main approaches:

    1. One vs. one: Construct \(n \choose 2\) SVMs comparing every pair of classes. Apply all SVMs to a test observation and classify to the class that wins the most one-on-one challenges.

    2. One vs. all: For each class \(k\), construct an SVM \(\beta^{(k)}\) coding class \(k\) as \(1\) and all other classes as \(-1\). Assign a test observation to the class \(k^*\), such that the distance from \(x_i\) to the hyperplane defined by \(\beta^{(k^*)}\) is largest (the distance is negative if the sample is misclassified).


Relationship of SVM to logistic regression#

Recall the Lagrange form of the problem.

\[\begin{split} \begin{align*} &\min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 + D\sum_{i=1}^n \epsilon_i\\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq (1-\epsilon_i) \quad \text{ for all }i=1,\dots,n,\\ &\epsilon_i \geq 0 \quad \text{for all }i=1,\dots,n. \end{align*} \end{split}\]

  • Set \(D=1/\lambda\) and minimize over \(\epsilon_i\) explicitly.

  • If \(1 - y_i(\beta+0 + w \cdot x_i) \leq 0\) we can take \(\hat{\epsilon}_i=0\). Otherwise, we take $\(\hat{\epsilon}_i=1 - y_i(\beta_0 + w \cdot x_i).\)$

  • Or, $\(\hat{\epsilon}_i = \max(1 - y_i(\beta_0+w \cdot x_i), 0).\)$


  • Plugging this into the objective (and replacing \(w\) with \(\beta\)) yields

\[\begin{split} \begin{align*} &\min_{\beta}\;\; \sum_{i=1}^n \max\left(1 - y_i(\beta_0 + \sum_{j=1}^p \beta_j x_{ij}), 0\right) + \frac{\lambda}{2}\sum_{j=1}^p\|\beta_j\|^2 \\ \end{align*} \end{split}\]
  • This loss is a function of \(y_i(\beta_0 + \sum_{j=1}^p \beta_j x_{ij})\) and a ridge penalty.

  • Loss for logistic regression is also a function of \(y_i(\beta_0 + \sum_{j=1}^p \beta_j x_{ij})\).

  • Large \(\lambda\) \(\iff\) small \(D\) \(\iff\) large \(C\).


Comparing the losses#

Fig 9.12

The kernel trick can be applied beyond SVMs#

Kernels and dot products:

  • Associated to a positive definite \(R\) is a dot product. For \(x\) in the feature space \(\mathbb{R}^p\), define \(R_x:\mathbb{R}^p \rightarrow \mathbb{R}\) by

\[ R_x(x_0) = R(x,x_0) \]
  • The kernel defines a dot product on linear combinations of the \(R_x\)’s for different \(x\)’s:

\[\left \langle \sum_j c_j R_{x_j}, \sum_i d_i R_{y_i} \right \rangle_R = \sum_{i,j} c_j d_i R(x_j,y_i) \]

and hence a length

\[ \left\|\sum_j c_j R_{x_j} \right\|^2_R = \sum_{i,j} c_i c_j R(x_i, x_j) \]

Kernel regression:

  • For tuning parameter \(\lambda\) define

\[ \hat{f}_{\lambda} = \text{argmin}_f \sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda \|f\|^2_K \]
  • Remarkably, it is known that

\[ \hat{f} = \sum_{i=1}^n \hat{\alpha}_i K_{X_i}. \]

for some \(\hat{\alpha}((X_1,Y_1), \dots, (X_n,Y_n))\).


  • Problem reduces to finding

\[ \hat{\alpha} = \text{argmin}_{\alpha} \sum_{i=1}^n \left(Y_i - \sum_{j=1}^n \alpha_j K(X_i, X_j) \right)^2 + \lambda \sum_{l, r=1}^n \alpha_l \alpha_r K(X_i, X_j) \]
  • Finding \(\hat{\alpha}\) is just like ridge regression!

  • Similar phenomenon for logistic regression…

  • Just like smoothing splines, we solved a problem over an big space of functions!

  • Smoothing splines are a special case of the kernel trick…

  • Like support vector machines, optimization is over an \(n\)-dimensional vector, not a \(p\)-dimensional vector as in linear regression…


Kernel PCA:

  • Suppose we want to do PCA with an expanded set of predictors, defined by the mapping \(\Phi\).

  • First principal component is

\[ \hat{f}_1 = \text{argmax}_{f: \|f\|_K \leq 1} \widehat{\text{Var}}(f(X)). \]
  • Even if \(\Phi\) expands the predictors to a very high dimensional space, we can do PCA!

  • The cost only depends on the number of observations \(n\).

  • Long story short: Kernel trick allows transformation of linear methods into nonlinear ones by implicit featurization.