Classification trees
Contents
Classification trees#
They work much like regression trees. Both are examples of decision trees.
We predict the response by majority vote, i.e. pick the most common class in every region.
Instead of trying to minimize the RSS:
\[\cancel{\sum_{m=1}^{|T|} \sum_{x_i\in R_m} (y_i-\bar y_{R_m})^2}\]
we minimize a classification loss function.
Classification losses#
The 0-1 loss or misclassification rate:
\[\sum_{m=1}^{|T|} \sum_{x_i\in R_m} \mathbf{1}(y_i \neq \hat y_{R_m})\]
Let \(\hat p_{m,k}\) is the proportion of class \(k\) within \(R_m\), and \(q_m\) is the proportion of samples in \(R_m\). The Gini index is:
\[\sum_{m=1}^{|T|} q_m \sum_{k=1}^K \hat p_{mk}(1-\hat p_{mk}),\]
The cross-entropy:
\[- \sum_{m=1}^{|T|} q_m \sum_{k=1}^K \hat p_{mk}\log(\hat p_{mk}).\]
Comments#
The Gini index and cross-entropy are better measures of the purity of a region, i.e. they are low when the region is mostly one category.
Motivation for the Gini index: If instead of predicting the most likely class, we predict a random sample from the distribution \((\hat p_{1,m},\hat p_{2,m},\dots,\hat p_{K,m})\), the Gini index is the expected misclassification rate.
It is typical to use the Gini index or cross-entropy for growing the tree, while using the misclassification rate when pruning the tree.