Support vector classifier#

  • Problem: It is not always possible to separate the points using a hyperplane.

  • Support vector classifier: a relaxation of the maximal margin classifier.

  • Allows a number of points to be on the wrong side of the margin or even the hyperplane by allowing slack \(\epsilon_i\) for each case.


Fig 9.6

  • A new optimization problem:

\[\begin{split} \begin{align*} &\max_{\beta_0,\beta,\epsilon}\;\; M\\ &\text{subject to } \sum_{j=1}^p \beta_j^2 = 1,\\ &y_i(\beta_0 +\beta_1x_{i1}+\dots+\beta_p x_{ip}) \geq M(1-\epsilon_i) \quad \text{ for all }i=1,\dots,n\\ &\epsilon_i \geq 0 \;\text{for all }i=1,\dots,n, \quad \sum_{i=1}^n \epsilon_i \leq C. \end{align*}\\ \end{split}\]
  • The quantity \(y_i(\beta_0 +\beta_1x_{i1}+\dots+\beta_p x_{ip})\) is distance from \(x_i\) from the hyperplane.

  • \(M\) is the width of the margin in either direction

  • \(\epsilon=(\epsilon_1,\dots,\epsilon_n)\) are called slack variables

  • \(C\) is called the budget


Tuning the budget, \(C\) (high to low)#

Fig 9.7

If the budget is too low, we tend to overfit#

Fig 9.5
  • Maximal margin classifier, \(C=0\).

  • Adding one observation dramatically changes the classifier.


Finding the support vector classifier#

We can reformulate the problem by defining a vector \(w=(w_1,\dots,w_p) = \beta/M\):

\[\begin{split} \begin{align*} &\min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 + D\sum_{i=1}^n \epsilon_i\\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq (1-\epsilon_i) \quad \text{ for all }i=1,\dots,n,\\ &\epsilon_i \geq 0 \quad \text{for all }i=1,\dots,n. \end{align*} \end{split}\]

The penalty \(D\geq 0\) serves a function similar to the budget \(C\), but is inversely related to it.


\[\begin{split} \begin{align*} &\min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 + D\sum_{i=1}^n \epsilon_i \\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq (1-\epsilon_i) \quad \text{ for all }i=1,\dots,n.\\ &\epsilon_i \geq 0 \quad \text{for all }i=1,\dots,n. \end{align*} \end{split}\]

Introducing KKT multipliers, \(\alpha_i\) and \(\mu_i\), this is equivalent to: $\( \begin{align*} &\max_{\alpha,\mu}\; \min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1+\epsilon_i] + \sum_{i=1}^n (D-\mu_i)\epsilon_i\\ &\text{subject to } \;\;\alpha_i \geq 0, \mu_i\geq 0 , \quad \text{ for all }i=1,\dots,n. \end{align*} \)$


\[\begin{split} \begin{align*} &\max_{\alpha,\mu}\; \min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1+\epsilon_i] + \sum_{i=1}^n (D-\mu_i)\epsilon_i\\ &\text{subject to } \;\;\alpha_i \geq 0, \mu_i\geq 0 , \quad \text{ for all }i=1,\dots,n. \end{align*} \end{split}\]
  • Setting the derivatives with respect to \(w\), \(\beta_0\), and \(\epsilon\) to 0, we obtain: $\(\hat w = \sum_{i=1}^n \alpha_i y_i x_i, \quad \sum_{i=1}^n\alpha_i y_i = 0, \quad \mu_i=D-\alpha_i\)$

  • Furthermore, the KKT conditions yield \(\alpha_i>0\) if and only if \(y_i(\beta_0 +w\cdot x_i)\leq 1\), that is, if \(x_i\) falls on the wrong side of the margin.


Support vectors#

Fig 9.7