CS 229 - Machine Learning

# Unsupervised Learning cheatsheet Star

## Introduction to Unsupervised Learning

Motivation The goal of unsupervised learning is to find hidden patterns in unlabeled data $\{x^{(1)},...,x^{(m)}\}$.

Jensen's inequality Let $f$ be a convex function and $X$ a random variable. We have the following inequality:

$\boxed{E[f(X)]\geqslant f(E[X])}$

## Clustering

### Expectation-Maximization

Latent variables Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted $z$. Here are the most common settings where there are latent variables:

 Setting Latent variable $z$ $x|z$ Comments Mixture of $k$ Gaussians $\textrm{Multinomial}(\phi)$ $\mathcal{N}(\mu_j,\Sigma_j)$ $\mu_j\in\mathbb{R}^n, \phi\in\mathbb{R}^k$ Factor analysis $\mathcal{N}(0,I)$ $\mathcal{N}(\mu+\Lambda z,\psi)$ $\mu_j\in\mathbb{R}^n$

Algorithm The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter $\theta$ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:

• E-step: Evaluate the posterior probability $Q_{i}(z^{(i)})$ that each data point $x^{(i)}$ came from a particular cluster $z^{(i)}$ as follows:
$\boxed{Q_i(z^{(i)})=P(z^{(i)}|x^{(i)};\theta)}$
• M-step: Use the posterior probabilities $Q_i(z^{(i)})$ as cluster specific weights on data points $x^{(i)}$ to separately re-estimate each cluster model as follows:
$\boxed{\theta_i=\underset{\theta}{\textrm{argmax }}\sum_i\int_{z^{(i)}}Q_i(z^{(i)})\log\left(\frac{P(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\right)dz^{(i)}}$ ### $k$-means clustering

We note $c^{(i)}$ the cluster of data point $i$ and $\mu_j$ the center of cluster $j$.

Algorithm After randomly initializing the cluster centroids $\mu_1,\mu_2,...,\mu_k\in\mathbb{R}^n$, the $k$-means algorithm repeats the following step until convergence:

$\boxed{c^{(i)}=\underset{j}{\textrm{arg min}}||x^{(i)}-\mu_j||^2}\quad\textrm{and}\quad\boxed{\mu_j=\frac{\displaystyle\sum_{i=1}^m1_{\{c^{(i)}=j\}}x^{(i)}}{\displaystyle\sum_{i=1}^m1_{\{c^{(i)}=j\}}}}$ Distortion function In order to see if the algorithm converges, we look at the distortion function defined as follows:

$\boxed{J(c,\mu)=\sum_{i=1}^m||x^{(i)}-\mu_{c^{(i)}}||^2}$

### Hierarchical clustering

Algorithm It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.

Types There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:

 Ward linkage Average linkage Complete linkage Minimize within cluster distance Minimize average distance between cluster pairs Minimize maximum distance of between cluster pairs

### Clustering assessment metrics

In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.

Silhouette coefficient By noting $a$ and $b$ the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient $s$ for a single sample is defined as follows:

$\boxed{s=\frac{b-a}{\max(a,b)}}$

Calinski-Harabaz index By noting $k$ the number of clusters, $B_k$ and $W_k$ the between and within-clustering dispersion matrices respectively defined as

$B_k=\sum_{j=1}^kn_{c^{(i)}}(\mu_{c^{(i)}}-\mu)(\mu_{c^{(i)}}-\mu)^T,\quad\quad W_k=\sum_{i=1}^m(x^{(i)}-\mu_{c^{(i)}})(x^{(i)}-\mu_{c^{(i)}})^T$
the Calinski-Harabaz index $s(k)$ indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:

$\boxed{s(k)=\frac{\textrm{Tr}(B_k)}{\textrm{Tr}(W_k)}\times\frac{N-k}{k-1}}$

## Dimension reduction

### Principal component analysis

It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.

Eigenvalue, eigenvector Given a matrix $A\in\mathbb{R}^{n\times n}$, $\lambda$ is said to be an eigenvalue of $A$ if there exists a vector $z\in\mathbb{R}^n\backslash\{0\}$, called eigenvector, such that we have:

$\boxed{Az=\lambda z}$

Spectral theorem Let $A\in\mathbb{R}^{n\times n}$. If $A$ is symmetric, then $A$ is diagonalizable by a real orthogonal matrix $U\in\mathbb{R}^{n\times n}$. By noting $\Lambda=\textrm{diag}(\lambda_1,...,\lambda_n)$, we have:

$\boxed{\exists\Lambda\textrm{ diagonal},\quad A=U\Lambda U^T}$

Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix $A$.

Algorithm The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on $k$ dimensions by maximizing the variance of the data as follows:

• Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
$\boxed{x_j^{(i)}\leftarrow\frac{x_j^{(i)}-\mu_j}{\sigma_j}}\quad\textrm{where}\quad\boxed{\mu_j = \frac{1}{m}\sum_{i=1}^mx_j^{(i)}}\quad\textrm{and}\quad\boxed{\sigma_j^2=\frac{1}{m}\sum_{i=1}^m(x_j^{(i)}-\mu_j)^2}$
• Step 2: Compute $\displaystyle\Sigma=\frac{1}{m}\sum_{i=1}^mx^{(i)}{x^{(i)}}^T\in\mathbb{R}^{n\times n}$, which is symmetric with real eigenvalues.
• Step 3: Compute $u_1, ..., u_k\in\mathbb{R}^n$ the $k$ orthogonal principal eigenvectors of $\Sigma$, i.e. the orthogonal eigenvectors of the $k$ largest eigenvalues.
• Step 4: Project the data on $\textrm{span}_\mathbb{R}(u_1,...,u_k)$.

This procedure maximizes the variance among all $k$-dimensional spaces. ### Independent component analysis

It is a technique meant to find the underlying generating sources.

Assumptions We assume that our data $x$ has been generated by the $n$-dimensional source vector $s=(s_1,...,s_n)$, where $s_i$ are independent random variables, via a mixing and non-singular matrix $A$ as follows:

$\boxed{x=As}$

The goal is to find the unmixing matrix $W=A^{-1}$.

Bell and Sejnowski ICA algorithm This algorithm finds the unmixing matrix $W$ by following the steps below:

• Write the probability of $x=As=W^{-1}s$ as:
$p(x)=\prod_{i=1}^np_s(w_i^Tx)\cdot|W|$
• Write the log likelihood given our training data $\{x^{(i)}, i\in[\![1,m]\!]\}$ and by noting $g$ the sigmoid function as:
$l(W)=\sum_{i=1}^m\left(\sum_{j=1}^n\log\Big(g'(w_j^Tx^{(i)})\Big)+\log|W|\right)$
Therefore, the stochastic gradient ascent learning rule is such that for each training example $x^{(i)}$, we update $W$ as follows:
$\boxed{W\longleftarrow W+\alpha\left(\begin{pmatrix}1-2g(w_1^Tx^{(i)})\\1-2g(w_2^Tx^{(i)})\\\vdots\\1-2g(w_n^Tx^{(i)})\end{pmatrix}{x^{(i)}}^T+(W^T)^{-1}\right)}$