Support vector classifier#

Problem: It is not always possible to separate the points using a hyperplane.
Support vector classifier: a relaxation of the maximal margin classifier.
Allows a number of points to be on the wrong side of the margin or even the hyperplane by allowing slack $\epsilon_i$ for each case.

A new optimization problem:

\[\begin{split} \begin{align*} &\max_{\beta_0,\beta,\epsilon}\;\; M\\ &\text{subject to } \sum_{j=1}^p \beta_j^2 = 1,\\ &y_i(\beta_0 +\beta_1x_{i1}+\dots+\beta_p x_{ip}) \geq M(1-\epsilon_i) \quad \text{ for all }i=1,\dots,n\\ &\epsilon_i \geq 0 \;\text{for all }i=1,\dots,n, \quad \sum_{i=1}^n \epsilon_i \leq C. \end{align*}\\ \end{split}\]

The quantity $y_i(\beta_0 +\beta_1x_{i1}+\dots+\beta_p x_{ip})$ is distance from $x_i$ from the hyperplane.
$M$ is the width of the margin in either direction
$\epsilon=(\epsilon_1,\dots,\epsilon_n)$ are called slack variables
$C$ is called the budget

Tuning the budget, $C$ (high to low)#

If the budget is too low, we tend to overfit#

Maximal margin classifier, $C=0$.
Adding one observation dramatically changes the classifier.

Finding the support vector classifier#

We can reformulate the problem by defining a vector $w=(w_1,\dots,w_p) = \beta/M$:

\[\begin{split} \begin{align*} &\min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 + D\sum_{i=1}^n \epsilon_i\\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq (1-\epsilon_i) \quad \text{ for all }i=1,\dots,n,\\ &\epsilon_i \geq 0 \quad \text{for all }i=1,\dots,n. \end{align*} \end{split}\]

The penalty $D\geq 0$ serves a function similar to the budget $C$, but is inversely related to it.

\[\begin{split} \begin{align*} &\min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 + D\sum_{i=1}^n \epsilon_i \\ &\text{subject to } \\ &y_i(\beta_0 +w\cdot x_i) \geq (1-\epsilon_i) \quad \text{ for all }i=1,\dots,n.\\ &\epsilon_i \geq 0 \quad \text{for all }i=1,\dots,n. \end{align*} \end{split}\]

Introducing KKT multipliers, $\alpha_i$ and $\mu_i$, this is equivalent to: $$ \begin{align*} &\max_{\alpha,\mu}\; \min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1+\epsilon_i] + \sum_{i=1}^n (D-\mu_i)\epsilon_i\\ &\text{subject to } \;\;\alpha_i \geq 0, \mu_i\geq 0 , \quad \text{ for all }i=1,\dots,n. \end{align*} $$

\[\begin{split} \begin{align*} &\max_{\alpha,\mu}\; \min_{\beta_0,w,\epsilon}\;\; \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(\beta_0 +w\cdot x_i)-1+\epsilon_i] + \sum_{i=1}^n (D-\mu_i)\epsilon_i\\ &\text{subject to } \;\;\alpha_i \geq 0, \mu_i\geq 0 , \quad \text{ for all }i=1,\dots,n. \end{align*} \end{split}\]

Setting the derivatives with respect to $w$, $\beta_0$, and $\epsilon$ to 0, we obtain: $$\hat w = \sum_{i=1}^n \alpha_i y_i x_i, \quad \sum_{i=1}^n\alpha_i y_i = 0, \quad \mu_i=D-\alpha_i$$
Furthermore, the KKT conditions yield $\alpha_i>0$ if and only if $y_i(\beta_0 +w\cdot x_i)\leq 1$, that is, if $x_i$ falls on the wrong side of the margin.

STATS 202

Support vector classifier

Contents

Support vector classifier#

Tuning the budget, \(C\) (high to low)#

If the budget is too low, we tend to overfit#

Finding the support vector classifier#

Support vectors#