Validation#

Thinking about the true loss function is important

  • Most of the regression methods we’ve studied aim to minimize the RSS, while classification methods aim to minimize the 0-1 loss.

  • In classification, we often care about certain kinds of error more than others; i.e. the natural loss function is not the 0-1 loss.

  • Even if we use a method which minimizes a certain kind of training error, we can tune it to optimize our true loss function.

  • Example: in the default study we could find the threshold that brings the False negative rate below an acceptable level.


How to choose a supervised method that minimizes the test error#

  • In addition, tune the parameters of each method: maybe

    • \(k\) in \(k\)-nearest neighbors.

    • The number of variables to include in forward or backward selection.

    • The order of a polynomial in polynomial regression.


Validation set approach#

Use of a validation set is one way to approximate the test error:

  • Divide the data into two parts.

  • Train each model with one part.

  • Compute the error on the remaining validation data.

Fig 5.1

Fig. 29 Schematic of validation set approach.#


Example: choosing order of polynomial#

Fig 5.2

Fig. 30 Left: validation error as a function of degree. Right: multiple splits into validation and training.#

  • Polynomial regression to estimate mpg from horsepower in the Auto data.

  • Problem: Every split yields a different estimate of the error.