Contents

Validation

Contents

Validation#

Thinking about the true loss function is important

Most of the regression methods we’ve studied aim to minimize the RSS, while classification methods aim to minimize the 0-1 loss.
In classification, we often care about certain kinds of error more than others; i.e. the natural loss function is not the 0-1 loss.
Even if we use a method which minimizes a certain kind of training error, we can tune it to optimize our true loss function.
Example: in the default study we could find the threshold that brings the False negative rate below an acceptable level.

How to choose a supervised method that minimizes the test error#

In addition, tune the parameters of each method: maybe
- \(k\) in \(k\)-nearest neighbors.
- The number of variables to include in forward or backward selection.
- The order of a polynomial in polynomial regression.

Validation set approach#

Use of a validation set is one way to approximate the test error:

Divide the data into two parts.
Train each model with one part.
Compute the error on the remaining validation data.

Fig 5.1 — Fig. 29 Schematic of validation set approach.#

Example: choosing order of polynomial#

Fig 5.2 — Fig. 30 Left: validation error as a function of degree. Right: multiple splits into validation and training.#

Polynomial regression to estimate mpg from horsepower in the Auto data.
Problem: Every split yields a different estimate of the error.