Linear Discriminant Analysis (LDA)#

Strategy: Instead of estimating $P(Y\mid X)$ directly, we could estimate:
1. $\hat P(X \mid Y)$: Given the response, what is the distribution of the inputs.
2. $\hat P(Y)$: How likely are each of the categories.
Then, we use Bayes rule to obtain the estimate:

\[\begin{split} \begin{aligned} \hat P(Y = k \mid X = x) &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\hat P(X=x)} \\ &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\sum_{j=1}^K\hat P(X = x \mid Y=j) \hat P(Y=j)} \end{aligned} \end{split}\]

LDA: multivariate normal with equal covariance#

LDA is the special case of the above strategy when $P(X \mid Y=k) = N(\mu_k, \mathbf\Sigma)$.
That is, within each class the features have multivariate normal distribution with center depending on the class and common covariance $\mathbf\Sigma$.
The probabilities $P(Y=k)$ are estimated by the fraction of training samples of class $k$.

Decision boundaries#

Fig 4.6 — Fig. 17 Density contours and decision boundaries for LDA with three classes.#

LDA has (piecewise) linear decision boundaries#

Suppose that:

We know $P(Y=k) = \pi_k$ exactly.
$P(X=x | Y=k)$ is Mutivariate Normal with density:

\[f_k(x) = \frac{1}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]

Above: $\mu_k:$ Mean of the inputs for category $k$ and $\mathbf\Sigma:$ covariance matrix (common to all categories)

Then, what is the Bayes classifier?

By Bayes rule, the probability of category $k$, given the input $x$ is:

\[P(Y=k \mid X=x) = \frac{f_k(x) \pi_k}{P(X=x)}\]

The denominator does not depend on the response $k$, so we can write it as a constant:

\[P(Y=k \mid X=x) = C \times f_k(x) \pi_k\]

Now, expanding $f_k(x)$:

\[P(Y=k \mid X=x) = \frac{C\pi_k}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]

Let’s absorb everything that does not depend on $k$ into a constant $C'$:

\[P(Y=k \mid X=x) = C'\pi_k e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]

Take the logarithm of both sides: $$\log P(Y=k \mid X=x) = \color{Red}{\log C'} + \color{blue}{\log \pi_k - \frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}.$$
This is the same for every category, $k$.
We want to find the maximum of this expression over $k$.

Goal is to maximize the following over $k$: $$ \begin{aligned} &\log \pi_k - \frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k).\\ =&\log \pi_k - \frac{1}{2}\left[x^T \mathbf{\Sigma}^{-1}x + \mu_k^T \mathbf{\Sigma}^{-1}\mu_k\right] + x^T \mathbf{\Sigma}^{-1}\mu_k \\ =& C'' + \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k \end{aligned} $$
We define the objectives (called discriminant functions): $$\delta_k(x) = \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k$$ At an input $x$, we predict the response with the highest $\delta_k(x)$.

Decision boundaries#

What are the decision boundaries? It is the set of points $x$ in which 2 classes do just as well (i.e. the discriminant functions of the two classes agree at $x$):

\[\begin{split} \begin{aligned} \delta_k(x) &= \delta_\ell(x) \\ \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k & = \log \pi_\ell - \frac{1}{2}\mu_\ell^T \mathbf{\Sigma}^{-1}\mu_\ell + x^T \mathbf{\Sigma}^{-1}\mu_\ell \end{aligned} \end{split}\]

This is a linear equation in $x$.

Decision boundaries revisited#

Estimating $\pi_k$#

\[\hat \pi_k = \frac{\#\{i\;;\;y_i=k\}}{n}\]

In English: the fraction of training samples of class $k$.

Estimating the parameters of $f_k(x)$#

Estimate the center of each class $\mu_k$:#

\[\hat\mu_k = \frac{1}{\#\{i\;;\;y_i=k\}}\sum_{i\;;\; y_i=k} x_i\]

Estimate the common covariance matrix $\mathbf{\Sigma}$:
One predictor ($p=1$):

\[\hat \sigma^2 = \frac{1}{n-K}\sum_{k=1}^K \;\sum_{i:y_i=k} (x_i-\hat\mu_k)^2.\]

Many predictors ($p>1$): Compute the vectors of deviations $(x_1 -\hat \mu_{y_1}),(x_2 -\hat \mu_{y_2}),\dots,(x_n -\hat \mu_{y_n})$ and use an unbiased estimate of its covariance matrix, $\mathbf\Sigma$.

Final decision rule#

For an input $x$, predict the class with the largest:

\[\hat\delta_k(x) = \log \hat\pi_k - \frac{1}{2}\hat\mu_k^T \mathbf{\hat\Sigma}^{-1}\hat\mu_k + x^T \mathbf{\hat\Sigma}^{-1}\hat\mu_k\]

The decision boundaries are defined by $\left\{x: \delta_k(x) = \delta_{\ell}(x) \right\}, 1 \leq j,\ell \leq K$.

STATS 202

Linear Discriminant Analysis (LDA)

Contents

Linear Discriminant Analysis (LDA)#

LDA: multivariate normal with equal covariance#

Decision boundaries#

LDA has (piecewise) linear decision boundaries#

Decision boundaries#

Decision boundaries revisited#

Estimating \(\pi_k\)#

Estimating the parameters of \(f_k(x)\)#

Estimate the center of each class \(\mu_k\):#

Final decision rule#