Clustering#

  • As in classification, we assign a class to each sample in the data matrix.

  • However, the class is not an output variable; we only use input variables because we are in unsupervised setting.

  • Clustering is an unsupervised procedure, whose goal is to find homogeneous subgroups among the observations.

  • We will discuss 2 algorithms (more in depth on these later): \(K\)-means and hierarchical clustering.


\(K\)-means clustering#

  1. \(K\) is the number of clusters and must be fixed in advance.

  2. The goal of this method is to maximize the similarity of samples within each cluster:

\[\min_{C_1,\dots,C_K} \sum_{\ell=1}^K W(C_\ell) \quad;\quad W(C_\ell) = \frac{1}{|C_\ell|}\sum_{i,j\in C_\ell} \text{Distance}^2(x_{i},x_{j}).\]
  • This is a divisive clustering method.


\(K\)-means clustering#

Fig 10.5

\(K\)-means clustering algorithm#

  • Assign each sample to a cluster from 1 to \(K\) arbitrarily, e.g. at random.

  • Iterate these two steps until the clustering is constant:

  • Step 2a: Find the centroid of each cluster \(\ell\); i.e. the average \(\overline x_{\ell}\) of all the samples in the cluster \(\hat{C}_{\ell}\):

\[\overline{x}_{\ell} = \frac{1}{|\hat{C}_\ell|}\sum_{i\in \hat{C}_\ell} x_{i} \quad \text{for }j=1,\dots,p.\]
  • Step 2b: Reassign each sample to the nearest centroid:

\[ \hat{C}_{\ell} = \left\{i: \text{Distance}(x_i, \overline{x}_{\ell}) \leq \text{Distance}(x_i, \overline{x}_k), 1 \leq k \leq K \right\} \]

\(K\)-means clustering algorithm#

Fig 10.6

Properties of \(K\)-means clustering#

  • The algorithm always converges to a local minimum of

\[\min_{C_1,\dots,C_K} \sum_{\ell=1}^K W(C_\ell)\quad;\quad W(C_\ell) = \frac{1}{|C_\ell|}\sum_{i,j\in C_\ell} \text{Distance}^2(x_{i},x_{j}).\]
  • Why?

\[ \frac{1}{|C_\ell|}\sum_{i,j\in C_\ell} \text{Distance}^2(x_{i},x_{j}) = 2\sum_{i\in C_\ell} \text{Distance}^2(x_{i},\overline x_{\ell}) \]

Right hand side can only be reduced in each iteration.

  • Each initialization could yield a different minimum.


Example: \(K\)-means output with different initializations#

Fig 10.7

In practice, we start from many random initializations and choose the output which minimizes the objective function.


Hierarchical clustering#

Fig 10.11

Many algorithms for hierarchical clustering are agglomerative.


Hierarchical clustering#

Fig 10.10
  • The output of the algorithm is a dendogram.

  • We must be careful about how we interpret the dendogram.

Hierarchical clustering#

Fig 10.9
  • The number of clusters is not fixed.

  • Hierarchical clustering is not always appropriate.

  • E.g. Market segmentation for consumers of 3 different nationalities.

    • Natural 2 clusters: gender

    • Natural 3 clusters: nationality

  • These clusterings are not nested or hierarchical.


Distance between clusters#

  • At each step, we link the 2 clusters that are deemed “closest” to each other.

  • Hierarchical clustering algorithms are classified according to the notion of distance between clusters.


Distance between clusters#

Complete linkage
  • Complete linkage: The distance between 2 clusters is the maximum distance between any pair of samples, one in each cluster.


Distance between clusters#

Single linkage
  • Single linkage: The distance between 2 clusters is the minimum distance between any pair of samples, one in each cluster.

  • Suffers from the chaining phenomenon


Distance between clusters#

Average linkage
  • Average linkage: The distance between 2 clusters is the average of all pairwise distances.


Fig 10.12
  • Chaining: many merges simply add a single point to a cluster…


Clustering is riddled with questions and choices#

  1. Is clustering appropriate? i.e. Could a sample belong to more than one cluster?

  2. Mixture models, soft clustering, topic models.

  3. How many clusters are appropriate?

  4. There are formal methods based on gap statistics, mixture models, etc.

  5. Are the clusters robust?

  6. Run the clustering on different random subsets of the data. Is the structure preserved?

  7. Try different clustering algorithms. Are the conclusions consistent?

  8. Most important: temper your conclusions.


Clustering is riddled with questions and choices#

  1. Should we scale the variables before doing the clustering \(\implies\) changes the notion of distance

  2. Variables with larger variance have a larger effect on the Euclidean distance between two samples.

  3. Does Euclidean distance capture dissimilarity between samples?


Choosing an appropriate distance#

Fig 10.13
  • Example: Suppose that we want to cluster customers at a store for market segmentation.

    • Samples are customers

    • Each variable corresponds to a specific product and measures the number of items bought by the customer during a year.


  1. Euclidean distance would cluster all customers who purchase few things (orange and purple).

  2. Perhaps we want to cluster customers who purchase similar things (orange and teal).

  3. In this case the correlation distance may be a more appropriate measure of dissimilarity between samples.