Stepwise selection methods#

Best subset selection has 2 problems:

  1. It is often very expensive computationally. We have to fit \(2^p\) models!

  2. If for a fixed \(k\), there are too many possibilities, we increase our chances of overfitting. The model selected has high variance.

In order to mitigate these problems, we can restrict our search space for the best model. This reduces the variance of the selected model at the expense of an increase in bias.


Forward selection#

Algorithm (6.2 of ISLR)#

  1. Let \({\cal M}_0\) denote the null model, which contains no predictors.

  2. For \(k=0, \dots, p-1\):

    1. Consider all \(p-k\) models that augment the predictors in \({\cal M}_k\) with one additional predictor.

    2. Choose the best among these \(p-k\) models, and call it \({\cal M}_{k+1}\). Here best is defined as having smallest RSS or \(R^2\).

  3. Select a single best model from among \({\cal M}_0, \dots, {\cal M}_p\) using cross-validated prediction error, AIC, BIC or adjusted \(R^2\).


Forward selection in using step#

library(ISLR) # where `Credit` lives
M.full = lm(Balance ~ ., data=Credit)
M.null = lm(Balance ~ 1, data=Credit)
step(M.null, scope=list(upper=M.full), direction='forward', trace=0)
  • First four variables are Rating, Income, Student, Limit.

  • Different from best subsets of size 4.


Backward selection#

Algorithm (6.3 of ISLR)#

  1. Let \({\cal M}_p\) denote the full model, which contains all \(p\) predictors

  2. For \(k=p,p-1,\dots,1\):

    1. Consider all \(k\) models that contain all but one of the predictors in \({\cal M}_k\) for a total of \(k-1\) predictors.

    2. Choose the best among these \(k\) models, and call it \({\cal M}_{k-1}\). Here best is defined as having smallest RSS or \(R^2\).

  3. Select a single best model from among \({\cal M}_0, \dots, {\cal M}_p\) using cross-validated prediction error, AIC, BIC or adjusted \(R^2\).


Backward selection#

step(M.full, scope=list(lower=M.null), direction='backward', trace=0)

Forward selection with BIC#

step(M.null, scope=list(upper=M.full), direction='forward', trace=0, k=log(nrow(Credit))) # k defaults to 2, i.e AIC

Forward vs. backward selection#

  • You cannot apply backward selection when \(p>n\).

  • Although we might like them to, they need not produce the same sequence of models.

  • Example: \(X_1,X_2 \sim \mathcal{N}(0,\sigma)\) independent and set

\[\begin{split} \begin{aligned} X_3 &= X_1 + 3 X_2 \\ Y &= X_1 + 2X_2 + \epsilon \end{aligned} \end{split}\]
  • Regress \(Y\) onto \(X_1,X_2,X_3\).

    • Forward: \(\{X_3\} \to \mathbf{\{X_3,X_2\} \text{or} \{X_3,X_1\}} \to \{X_3,X_2,X_1\}\)

    • Backward: \(\{X_1,X_2,X_3\} \to \mathbf{ \{X_1,X_2\}} \to \{X_2\}\)


Forward vs. backward selection#

n = 5000
X1 = rnorm(n)
X2 = rnorm(n)
X3 = X1 + 3 * X2 
Y = X1 + 2 * X2 + rnorm(n)
D = data.frame(X1, X2, X3, Y)

Forward#

step(lm(Y ~ 1, data=D), list(upper=~ X1 + X2 + X3), direction='forward')

Backward#

step(lm(Y ~ X1 + X2 + X3, data=D), list(lower=~ 1), direction='backward')

Other stepwise selection strategies#

  • Mixed stepwise selection (direction='both'): Do forward selection, but at every step, remove any variables that are no longer necessary.

  • Forward stagewise selection: Roughly speaking, don’t add in the variable fully at each step…

  • \(\dots\)