Validation
Contents
Validation#
Thinking about the true loss function is important
Most of the regression methods we’ve studied aim to minimize the RSS, while classification methods aim to minimize the 0-1 loss.
In classification, we often care about certain kinds of error more than others; i.e. the natural loss function is not the 0-1 loss.
Even if we use a method which minimizes a certain kind of training error, we can tune it to optimize our true loss function.
Example: in the
default
study we could find the threshold that brings the False negative rate below an acceptable level.
How to choose a supervised method that minimizes the test error#
In addition, tune the parameters of each method: maybe
\(k\) in \(k\)-nearest neighbors.
The number of variables to include in forward or backward selection.
The order of a polynomial in polynomial regression.
Validation set approach#
Use of a validation set is one way to approximate the test error:
Divide the data into two parts.
Train each model with one part.
Compute the error on the remaining validation data.
Example: choosing order of polynomial#
Polynomial regression to estimate
mpg
fromhorsepower
in the Auto data.Problem: Every split yields a different estimate of the error.