Principal Components Analysis#

Some facts#

  1. This is the most popular unsupervised procedure ever.

  2. Invented by Karl Pearson (1901).

  3. Developed by Harold Hotelling (1933). \(\leftarrow\) Stanford pride!

  4. What does it do? It provides a way to visualize high dimensional data, summarizing the most important information.


What is PCA good for?#

All pairs scatter

Biplot for first 2 principal components

pairs(USarrests) Fig 10.1

What is the first principal component?#

It is the line which passes the closest to a cloud of samples, in terms of squared Euclidean distance.

Fig 6.14

What does this look like with 3 variables?#

The first two principal components span a plane which is closest to the data.

Fig 10.2a

Fig 10.2b

Fig 10.2a Fig 10.2b

A second interpretation#

The projection onto the first principal component is the one with the highest variance.

Fig 6.15

How do we say this in math?#

  • Let \(\mathbf{X}\) be a data matrix with \(n\) samples, and \(p\) variables.

  • From each variable, we subtract the mean of the column; i.e. we center the variables.

  • To find the first principal component \(\phi_1 = (\phi_{11},\dots,\phi_{p1})\), we solve the following optimization

\[\begin{split} \begin{aligned} \max_{\phi_{11},\dots,\phi_{p1}} \left\{ \frac{1}{n}\sum_{i=1}^n \left( \sum_{j=1}^p x_{ij}\phi_{j1} \right)^2 \right\} \\ \text{ subject to} \sum_{j=1}^p \phi_{j1}^2 =1. \end{aligned} \end{split}\]
  • The quantity

\[ \sum_{j=1}^p x_{ij} \phi_{j1} = X\phi_1. \]
  • Summing the square of the entries of \(Z\) computes the variance of the \(n\) samples projected onto \(\phi_1\). (This is a variance because we have centered the columns.)

  • Having found the maximizer \(\hat{\phi}_1\), the the first principal component score is

\[ Z_1 = X\hat{\phi}_1 \]

Second principal component#

  • To find the second principal component \(\phi_2 = (\phi_{12},\dots,\phi_{p2})\), we solve the following optimization

\[\begin{split} \begin{aligned} \max_{\phi_{12},\dots,\phi_{p2}} \left\{ \frac{1}{n}\sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j2}x_{ij} \right)^2 \right\} \\ \text{ subject to} \;\;\sum_{j=1}^p \phi_{j2}^2 =1 \;\;\text{ and }\;\; \sum_{j=1}^p \hat{\phi}_{j1}\phi_{j2} = 0. \end{aligned} \end{split}\]
  • The second constraint \(\hat{\phi}_1'\phi_2=0\) forces first and second principal components to be orthogonal.

  • Having found the optimal \(\hat{\phi}_2\), the second principal component score is

\[ Z_2 = X\hat{\phi}_2 \]
  • It turns out that the constraint \(\hat{\phi}_1'\phi_2=0\) implies that the scores \(Z_1=(z_{11},\dots,z_{n1})\) and \(Z_2=(z_{12},\dots,z_{n2})\) are uncorrelated.


Solving the optimization#

This optimization is fundamental in linear algebra. It is satisfied by either:

  1. The singular value decomposition (SVD) of (the centered) \(\mathbf{X}\): $\(\mathbf{X} = \mathbf{U\Sigma\Phi}^T\)\( where the \)i\(th column of \)\mathbf{\Phi}\( is the \)i\(th principal component \)\hat{\phi}i\(, and the \)i\(th column of \)\mathbf{U\Sigma}\( is the \)i\(th vector of scores \)(z{1i},\dots,z_{ni})$.

  2. The eigendecomposition of \(\mathbf{X}^T\mathbf{X}\):

\[\mathbf{X}^T\mathbf{X} = \mathbf{\Phi\Sigma}^2\mathbf{\Phi}^T\]

Biplot#

Fig 10.1

Scaling the variables#

  • Most of the time, we don’t care about the absolute numerical value of a variable.

  • Typically, we care about the value relative to the spread observed in the sample.

  • Before PCA, in addition to centering each variable, we also typically multiply it times a constant to make its variance equal to 1.


Example: scaled vs. unscaled PCA#

Fig 10.3
  • In special cases, we have variables measured in the same unit; e.g. gene expression levels for different genes.

  • In such case, we care about the absolute value of the variables and we can perform PCA without scaling.


How many principal components are enough?#

  • We said 2 principal components capture most of the relevant information. But how can we tell?

The proportion of variance explained#

  • We can think of the top principal components as directions in space in which the data vary the most.

  • The \(i\)th score vector \((z_{1i},\dots,z_{ni})\) can be interpreted as a new variable. The variance of this variable decreases as we take \(i\) from 1 to \(p\).

  • The total variance of the score vectors is the same as the total variance of the original variables:

\[\sum_{i=1}^p \frac{1}{n}\sum_{j=1}^n z_{ji}^2 = \sum_{k=1}^p \text{Var}(x_{k}).\]
  • We can quantify how much of the variance is captured by the first \(m\) principal components/score variables.


Scree plot#

Fig 10.4
  • The variance of the \(m\)th score variable is:

\[ \frac{1}{n}Z_m^TZ_m = \frac{1}{n}\sum_{i=1}^n z_{im}^2 =\frac{1}{n}\sum_{i=1}^n \left(\sum_{j=1}^p \hat{\phi}_{jm}x_{ij}\right)^2 \]
  • The scree plot plots these variances: we see that the top 2 explain almost 90% of the total variance.