Resampling

web.stanford.edu/class/stats202

Sergio Bacallado, Jonathan Taylor

Autumn 2022

Validation

Thinking about the true loss function is important

How to choose a supervised method that minimizes the test error

Validation set approach

Use of a validation set is one way to approximate the test error:

Schematic of validation set approach.

Example: choosing order of polynomial

Left: validation error as a function of degree. Right: multiple splits into validation and training.

Leave one out cross-validation (LOOCV)

Regression

\[\text{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^n (y_i - \color{Red}{\hat y_i^{(-i)}})^2\]

Schematic for LOOCV

Schematic of leave-one-out cross-validation (LOOCV) set approach.

Classification

\[\text{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(y_i \neq \color{Red}{\hat y_i^{(-i)}})\]

Shortcut for linear regression

\[\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left(\frac{y_i-\hat y_i}{1-h_{ii}}\right)^2\]

\(K\)-fold cross-validation

Algorithm 5.3? \(K\)-fold CV

Schematic for \(K\)-fold CV

Schematic of \(K\)-fold CV fold approach.

LOOCV vs. \(K\)-fold cross-validation

Comparison of LOOCV and \(K\)-fold CV.

Comments

Choosing an optimal model

Comparison of LOOCV and \(K\)-fold CV to test MSE.

Even if the error estimates are off, choosing the model with the minimum cross validation error (10 fold in orange) often leads to a method with near minimum test error.

In a classification problem, things look similar.

Choosing an optimal model

The one standard error (1SE) rule of thumb

The wrong way to do cross validation

The right way to do cross validation

  1. Divide the data into 10 folds.

  2. For \(i=1,\dots,10\):

    1. Using every fold except \(i\), perform the variable selection and fit the model with the selected variables.
    2. Compute the error on fold \(i\).
    3. Average the 10 test errors obtained.

Bootstrap

Cross-validation vs. the Bootstrap

Bootstrap

Brad Efron

Standard errors in linear regression from a sample of size \(n\)

Advertising = read.csv('https://www.statlearning.com/s/Advertising.csv')
M.sales = lm(sales ~ TV, data=Advertising)
summary(M.sales)
## 
## Call:
## lm(formula = sales ~ TV, data = Advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

Classical way to compute Standard Errors

Limitations of the classical approach

Example: Investing in two assets

\[\alpha X + (1-\alpha) Y.\]

\[\alpha = \frac{\sigma_Y^2 - \text{Cov}(X,Y)}{\sigma_X^2 + \sigma_Y^2 -2\text{Cov}(X,Y)}.\]

\[\widehat \alpha = \frac{\widehat \sigma_Y^2 - \widehat{ \text{Cov}}(X,Y)}{\widehat \sigma_X^2 + \widehat \sigma_Y^2 -2\widehat{ \text{Cov}}(X,Y)}.\]

Resampling the data from the true distribution

Computing the standard error of \(\widehat \alpha\)

\[(x_1^{(s)},\dots,x_n^{(s)})\]

we can compute a value of the estimate \(\widehat \alpha^{(1)},\widehat \alpha^{(2)},\dots\).

In reality, we only have \(n\) samples

A single panel of Fig 5.9

\[\widehat P(X,Y) = \frac{1}{n}\sum_{i=1}^{n} \delta_{(x_i,y_i)}.\]

A schematic of the Bootstrap

A single dataset

Comparing Bootstrap sampling to sampling from the true distribution