Contents

K-fold cross-validation

Contents

\(K\)-fold cross-validation#

Algorithm 5.3? \(K\)-fold CV#

Split the data into \(K\) subsets or folds.
For every \(i=1,\dots,K\):
- train the model on every fold except the \(i\)th fold,
- compute the test error on the \(i\)th fold.
Average the test errors.

Schematic for \(K\)-fold CV#

Fig 5.5 — Fig. 32 Schematic of \(K\)-fold CV fold approach.#

LOOCV vs. \(K\)-fold cross-validation#

Fig 5.4 — Fig. 33 Comparison of LOOCV and \(K\)-fold CV.#

Comments#

\(K\)-fold CV depends on the chosen split (somewhat).
In \(K\)-fold CV, we train the model on less data than what is available to LOOCV. This introduces bias into the estimates of test error.
In LOOCV, the training samples highly resemble each other. This increases the variance of the test error estimate.
\(n\)-fold CV is equivalent LOOCV.

Choosing an optimal model#

Fig 5.6 — Fig. 34 Comparison of LOOCV and \(K\)-fold CV to test MSE.#

Even if the error estimates are off, choosing the model with the minimum cross validation error often leads to a method with near minimum test error.

Fig 5.7

In a classification problem, things look similar.

Logistic regression with polynomial predictors of increasing degree. (\(------\))
\(------\) Bayes boundary

Choosing an optimal model#

Fig 5.8

In a classification problem, things look similar.

Cubic model has best test error.
Quartic has best CV.
Curves look similar.

The one standard error (1SE) rule of thumb#

Fig 7.3 (ESL)

Forward stepwise selection
10-fold cross validation, True test error
1-SE rule of thumb:
- A number of models with \(10\le p\le 15\) have almost the same CV error.
- The vertical bars represent 1 standard error in the test error from the 10 folds.
- Choose the simplest model whose CV error is no more than one standard error above the model with the lowest CV error.

The wrong way to do cross validation#

Reading: Section 7.10.2 of The Elements of Statistical Learning.
We want to classify 200 individuals according to whether they have cancer or not.
We use logistic regression onto 1000 measurements of gene expression.
Proposed strategy:
1. Using all the data, select the 20 most significant genes using \(z\)-tests.
2. Estimate the test error of logistic regression with these 20 predictors via 10-fold cross validation.

To see how that works, let’s use the following simulated data:
1. Each gene expression is standard normal and independent of all others.
2. The response (cancer or not) is sampled from a coin flip — no correlation to any of the “genes”.
Q: What should the misclassification rate be for any classification method using these predictors?
A: Roughly 50%.

We run this simulation, and obtain a CV error rate of 3%!
Why?
- Since we only have 200 individuals in total, among 1000 variables, at least some will appear correlated with the response.
- We had run variable selection using all the data, so the variables we select have some correlation with the response in every subset or fold in the cross validation.

The right way to do cross validation#

Divide the data into 10 folds.
For \(i=1,\dots,10\):
1. Using every fold except \(i\), perform the variable selection and fit the model with the selected variables.
2. Compute the error on fold \(i\).
3. Average the 10 test errors obtained.

In our simulation, this produces an error estimate of close to 50%.
Moral of the story: Every aspect of the learning method that involves using the data — variable selection, for example — must be cross-validated.