Thinking about the true loss function is important
Most of the regression methods we’ve studied aim to minimize the RSS, while classification methods aim to minimize the 0-1 loss.
In classification, we often care about certain kinds of error more than others; i.e. the natural loss function is not the 0-1 loss.
Even if we use a method which minimizes a certain kind of training error, we can tune it to optimize our true loss function.
Example: in the default
study we could find the
threshold that brings the False negative rate below an acceptable
level.
Use of a validation set is one way to approximate the test error:
Schematic of validation set approach.
Left: validation error as a function of degree. Right: multiple splits into validation and training.
Polynomial regression to estimate mpg
from
horsepower
in the Auto data.
Problem: Every split yields a different estimate of the error.
\[\text{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^n (y_i - \color{Red}{\hat y_i^{(-i)}})^2\]
Schematic of leave-one-out cross-validation (LOOCV) set approach.
\[\text{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(y_i \neq \color{Red}{\hat y_i^{(-i)}})\]
Computing \(\text{CV}_{(n)}\) can be computationally expensive, since it involves fitting the model \(n\) times.
For linear regression, there is a shortcut:
\[\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left(\frac{y_i-\hat y_i}{1-h_{ii}}\right)^2\]
Above, \(h_{ii}\) is the leverage statistic.
Approximate versions sometimes used for logistic regression…
Split the data into \(K\) subsets or folds.
For every \(i=1,\dots,K\):
Average the test errors.
Schematic of \(K\)-fold CV fold approach.
Comparison of LOOCV and \(K\)-fold CV.
\(K\)-fold CV depends on the chosen split (somewhat).
In \(K\)-fold CV, we train the model on less data than what is available to LOOCV. This introduces some bias into the estimates of test error.
In LOOCV, the training samples highly resemble each other. This increases the some variance of the test error estimate.
\(n\)-fold CV is equivalent LOOCV.
Comparison of LOOCV and \(K\)-fold CV to test MSE.
Even if the error estimates are off, choosing the model with the minimum cross validation error (10 fold in orange) often leads to a method with near minimum test error.
In a classification problem, things look similar.
Logistic regression with polynomial predictors of increasing degree. (\(------\))
\(------\) Bayes boundary
Cubic model has best test error.
Quartic has best CV.
Curves look similar.
Q: Why doesn’t training error keep decreasing?
Forward stepwise selection (we’ll see in more detail shortly)
10-fold cross validation, True test error
1-SE rule of thumb:
Reading: Section 7.10.2 of The Elements of Statistical Learning.
We want to classify 200 individuals according to whether they have cancer or not.
We use logistic regression onto 1000 measurements of gene expression.
Proposed strategy:
We run this simulation, and obtain a CV error rate of 3%!
Why?
Divide the data into 10 folds.
For \(i=1,\dots,10\):
In our simulation, this produces an error estimate of close to 50%.
Moral of the story: Every aspect of the learning method that involves using the data — variable selection, for example — must be cross-validated.
Cross-validation: provides estimates of the (test) error
The Bootstrap: provides the (standard) error of estimates
Brad Efron
One of the most important techniques in all of Statistics.
Computer intensive method.
Popularized by Brad Efron \(\leftarrow\) Stanford pride!
Advertising = read.csv('https://www.statlearning.com/s/Advertising.csv')
M.sales = lm(sales ~ TV, data=Advertising)
summary(M.sales)
##
## Call:
## lm(formula = sales ~ TV, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
Example: Estimate the variance of a sample \(x_1,x_2,\dots,x_n\):
Unbiased estimate of \(\sigma^2\): \[\hat \sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\overline x)^2.\]
What is the Standard Error of \(\hat \sigma^2\)?
Suppose that \(X\) and \(Y\) are the returns of two assets.
These returns are observed every day: \((x_1,y_1),\dots,(x_n,y_n)\).
We have a fixed amount of money to invest and we will invest a fraction \(\alpha\) on \(X\) and a fraction \((1-\alpha)\) on \(Y\).
Therefore, our return will be
\[\alpha X + (1-\alpha) Y.\]
Our goal will be to minimize the variance of our return as a function of \(\alpha\).
One can show that the optimal \(\alpha\) is:
\[\alpha = \frac{\sigma_Y^2 - \text{Cov}(X,Y)}{\sigma_X^2 + \sigma_Y^2 -2\text{Cov}(X,Y)}.\]
\[\widehat \alpha = \frac{\widehat \sigma_Y^2 - \widehat{ \text{Cov}}(X,Y)}{\widehat \sigma_X^2 + \widehat \sigma_Y^2 -2\widehat{ \text{Cov}}(X,Y)}.\]
Suppose we compute the estimate \(\widehat\alpha = 0.6\) using the samples \((x_1,y_1),\dots,(x_n,y_n)\).
How sure can we be of this value? (A little vague of a question.)
If we had sampled the observations in a different 100 days, would we get a wildly different \(\widehat \alpha\)? (A more precise question.)
In this thought experiment, we know the actual joint distribution \(P(X,Y)\), so we can resample the \(n\) observations to our hearts’ content.
True distribution of \(\widehat{\alpha}\)
We will use \(S\) samples to estimate the standard error of \(\widehat{\alpha}\).
For each sampling of the data, for \(1 \leq s \leq S\)
\[(x_1^{(s)},\dots,x_n^{(s)})\]
we can compute a value of the estimate \(\widehat \alpha^{(1)},\widehat \alpha^{(2)},\dots\).
A single panel of Fig 5.9
\[\widehat P(X,Y) = \frac{1}{n}\sum_{i=1}^{n} \delta_{(x_i,y_i)}.\]
Equivalently, resample the data by drawing \(n\) samples with replacement from the actual observations.
Why it works: variances computed under the empirical distribution are good approximations of variances computed under the true distribution (in many cases).
A single dataset
Left panel is population distribution of \(\widehat{\alpha}\) – centered (approximately) around the true \(\alpha\).
Middle panel is bootstrap distribution of \(\widehat{\alpha}\) – centered (approximately) around observed \(\widehat{\alpha}\).