Bootstrapping a linear regression

STATS 191

2024-04-01

Bootstrapping linear regression

Figure: our multiple linear regression model

\[Y|X = X\beta + \epsilon\]

  • We’ve talked about checking assumptions.

  • What to do if the assumptions don’t hold?

  • We will use the bootstrap!

Random \(X\): pairs bootstrap

  • Suppose we think of the pairs \((X_i, Y_i)\) coming from some distribution \(F\) – this is a distribution for both the features and the outcome.

  • In our usual model, \(\beta\) is clearly defined. What is \(\beta\) without this assumption?

Population least squares

  • For our distribution \(F\), we can define

\[ E_F[\pmb{X}\pmb{X}^T], \qquad E_F[\pmb{X} \cdot \pmb{Y}] \]

where \((\pmb{X}, \pmb{Y}) \sim F\) leading to

\[ \beta(F) = \left(E_F[\pmb{X}\pmb{X}^T]\right)^{-1} E_F[\pmb{X} \cdot \pmb{Y}]. \]

Asymptotics

  • In fact, our least squares estimator is \(\beta(\hat{F}_n)\) where \(\hat{F}_n\) is the empirical distribution of our sample of \(n\) observations from \(F\).

  • As we take a larger and larger sample,

\[ \beta(\hat{F}_n) \to \beta(F) \]

and

\[ n^{1/2}(\beta(\hat{F}_n) - \beta(F)) \to N(0, \Sigma(F)) \]

for some covariance matrix \(\Sigma=\Sigma(F)\) depending only on \(F\).

  • Recall the variance of OLS estimator (with \(X\) fixed): \[ (X^TX)^{-1} \text{Var}(X^T(Y-X\beta)) (X^TX)^{-1}. \]

  • With \(X\) random and \(n\) large this is approximately \[ \frac{1}{n} \left(E_F[\pmb{X}\pmb{X}^T] \right)^{-1} \text{Var}_F(\pmb{X} \cdot (\pmb{Y} - \pmb{X} \beta(F))) \left(E_F[\pmb{X}\pmb{X}^T] \right)^{-1}. \]

Population least squares

  • In usual model,

\[\text{Var}(X^T(Y-X\beta)) = \sigma^2 X^TX \approx n \cdot E_F[\pmb{X} \pmb{X}^T].\]

  • This is wrong in general!

  • We will use OLS estimate – but correct its variance!

  • Can we get our hands on \(\text{Var}(X^T(Y-X\beta))\) or \(\text{Var}(\hat{\beta})\) without a model?

Nonparametric bootstrap in a nutshell

Basic algorithm for pairs

There are many variants of the bootstrap, most using roughly this structure

boot_sample = c()
for (b in 1:B) {
    idx_star = sample(1:n, n, replace=TRUE)
    X_star = X[idx_star,]
    Y_star = Y[idx_star]
    boot_sample = rbind(boot_sample, coef(lm(Y_star ~ X_star)))
}
cov_beta_boot = cov(boot_sample)

Residual bootstrap

  • If \(X\) is fixed, it doesn’t make sense to sample new \(X\) values for \(X^*\).

  • Residual bootstrap keeps \(X\) fixed, but adds randomly sampled residuals

boot_sample = c()
M = lm(Y ~ X - 1)
beta.hat = coef(M)
X = model.matrix(M)
Y.hat = X @ beta.hat
r.hat = Y - Y.hat
for (b in 1:B) {
    idx_star = sample(1:n, n, replace=TRUE)
    X_star = X 
    Y_star = Y.hat + r.hat[idx_star]
    boot_sample = rbind(boot_sample, coef(lm(Y_star ~ X_star - 1)))
}
cov_beta_boot = cov(boot_sample)

Bootstrap for inference

  • Estimated covariance cov_beta_boot can be used to estimate \(\text{Var}(a^T\hat{\beta})\) for confidence intervals or general linear hypothesis tests.

  • Software does something slightly different – using percentiles of the bootstrap sample: bootstrap percentile intervals.

Bootstrapping regression

Reference for more R examples

Using the boot package

  • The Boot function in car is a wrapper around the more general boot function.

How is the coverage?

  • First we’ll use the standard regression model but errors that aren’t Gaussian.

Check one instance

Check coverage

Misspecified model

  • Now we make data for which we might have used WLS but we don’t have a model for the weights!

Check one instance

Check coverage