CS 229 - Machine Learning

# Machine Learning tips and tricks cheatsheet Star

## Classification metrics

In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.

Confusion matrix The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:

 Predicted class + - Actual class + TPTrue Positives FNFalse NegativesType II error - FPFalse PositivesType I error TNTrue Negatives

Main metrics The following metrics are commonly used to assess the performance of classification models:

 Metric Formula Interpretation Accuracy $\displaystyle\frac{\textrm{TP}+\textrm{TN}}{\textrm{TP}+\textrm{TN}+\textrm{FP}+\textrm{FN}}$ Overall performance of model Precision $\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}$ How accurate the positive predictions are RecallSensitivity $\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}$ Coverage of actual positive sample Specificity $\displaystyle\frac{\textrm{TN}}{\textrm{TN}+\textrm{FP}}$ Coverage of actual negative sample F1 score $\displaystyle\frac{2\textrm{TP}}{2\textrm{TP}+\textrm{FP}+\textrm{FN}}$ Hybrid metric useful for unbalanced classes

ROC The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:

 Metric Formula Equivalent True Positive RateTPR $\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}$ Recall, sensitivity False Positive RateFPR $\displaystyle\frac{\textrm{FP}}{\textrm{TN}+\textrm{FP}}$ 1-specificity

AUC The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure: ## Regression metrics

Basic metrics Given a regression model $f$, the following metrics are commonly used to assess the performance of the model:

 Total sum of squares Explained sum of squares Residual sum of squares $\displaystyle\textrm{SS}_{\textrm{tot}}=\sum_{i=1}^m(y_i-\overline{y})^2$ $\displaystyle\textrm{SS}_{\textrm{reg}}=\sum_{i=1}^m(f(x_i)-\overline{y})^2$ $\displaystyle\textrm{SS}_{\textrm{res}}=\sum_{i=1}^m(y_i-f(x_i))^2$

Coefficient of determination The coefficient of determination, often noted $R^2$ or $r^2$, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:

$\boxed{R^2=1-\frac{\textrm{SS}_\textrm{res}}{\textrm{SS}_\textrm{tot}}}$

Main metrics The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables $n$ that they take into consideration:

 Mallow's Cp AIC BIC Adjusted $R^2$ $\displaystyle\frac{\textrm{SS}_{\textrm{res}}+2(n+1)\widehat{\sigma}^2}{m}$ $\displaystyle2\Big[(n+2)-\log(L)\Big]$ $\displaystyle\log(m)(n+2)-2\log(L)$ $\displaystyle1-\frac{(1-R^2)(m-1)}{m-n-1}$

where $L$ is the likelihood and $\widehat{\sigma}^2$ is an estimate of the variance associated with each response.

## Model selection

Vocabulary When selecting a model, we distinguish 3 different parts of the data that we have as follows:

 Training set Validation set Testing set • Model is trained • Usually 80% of the dataset • Model is assessed • Usually 20% of the dataset • Also called hold-out or development set • Model gives predictions • Unseen data

Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below: Cross-validation Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:

 k-fold Leave-p-out • Training on $k-1$ folds and assessment on the remaining one • Generally $k=5$ or $10$ • Training on $n-p$ observations and assessment on the $p$ remaining ones • Case $p=1$ is called leave-one-out

The most commonly used method is called $k$-fold cross-validation and splits the training data into $k$ folds to validate the model on one fold while training the model on the $k-1$ other folds, all of this $k$ times. The error is then averaged over the $k$ folds and is named cross-validation error. Regularization The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:

 LASSO Ridge Elastic Net • Shrinks coefficients to 0• Good for variable selection Makes coefficients smaller Tradeoff between variable selection and small coefficients   $...+\lambda||\theta||_1$$\lambda\in\mathbb{R} ...+\lambda||\theta||_2^2$$\lambda\in\mathbb{R}$ $...+\lambda\Big[(1-\alpha)||\theta||_1+\alpha||\theta||_2^2\Big]$$\lambda\in\mathbb{R},\alpha\in[0,1]$

## Diagnostics

Bias The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.

Variance The variance of a model is the variability of the model prediction for given data points.

Bias/variance tradeoff The simpler the model, the higher the bias, and the more complex the model, the higher the variance.

 Underfitting Just right Overfitting Symptoms • High training error • Training error close to test error • High bias • Training error slightly lower than test error • Very low training error • Training error much lower than test error • High variance Regression illustration   Classification illustration   Deep learning illustration   Possible remedies • Complexify model• Add more features• Train longer • Perform regularization• Get more data

Error analysis Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.

Ablative analysis Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.