Bootstrapping a linear regression

STATS 191

2024-04-01

Bootstrapping linear regression

Figure: our multiple linear regression model

\[Y|X = X\beta + \epsilon\]

We’ve talked about checking assumptions.
What to do if the assumptions don’t hold?
We will use the bootstrap!

Random \(X\): pairs bootstrap

Suppose we think of the pairs \((X_i, Y_i)\) coming from some distribution \(F\) – this is a distribution for both the features and the outcome.
In our usual model, \(\beta\) is clearly defined. What is \(\beta\) without this assumption?

Population least squares

For our distribution \(F\), we can define

\[ E_F[\pmb{X}\pmb{X}^T], \qquad E_F[\pmb{X} \cdot \pmb{Y}] \]

where \((\pmb{X}, \pmb{Y}) \sim F\) leading to

\[ \beta(F) = \left(E_F[\pmb{X}\pmb{X}^T]\right)^{-1} E_F[\pmb{X} \cdot \pmb{Y}]. \]

Asymptotics

In fact, our least squares estimator is \(\beta(\hat{F}_n)\) where \(\hat{F}_n\) is the empirical distribution of our sample of \(n\) observations from \(F\).
As we take a larger and larger sample,

\[ \beta(\hat{F}_n) \to \beta(F) \]

and

\[ n^{1/2}(\beta(\hat{F}_n) - \beta(F)) \to N(0, \Sigma(F)) \]

for some covariance matrix \(\Sigma=\Sigma(F)\) depending only on \(F\).

Recall the variance of OLS estimator (with \(X\) fixed): \[ (X^TX)^{-1} \text{Var}(X^T(Y-X\beta)) (X^TX)^{-1}. \]
With \(X\) random and \(n\) large this is approximately \[ \frac{1}{n} \left(E_F[\pmb{X}\pmb{X}^T] \right)^{-1} \text{Var}_F(\pmb{X} \cdot (\pmb{Y} - \pmb{X} \beta(F))) \left(E_F[\pmb{X}\pmb{X}^T] \right)^{-1}. \]

Population least squares

In usual model,

\[\text{Var}(X^T(Y-X\beta)) = \sigma^2 X^TX \approx n \cdot E_F[\pmb{X} \pmb{X}^T].\]

This is wrong in general!
We will use OLS estimate – but correct its variance!
Can we get our hands on \(\text{Var}(X^T(Y-X\beta))\) or \(\text{Var}(\hat{\beta})\) without a model?

Nonparametric bootstrap in a nutshell

Basic algorithm for pairs

There are many variants of the bootstrap, most using roughly this structure

boot_sample = c()
for (b in 1:B) {
    idx_star = sample(1:n, n, replace=TRUE)
    X_star = X[idx_star,]
    Y_star = Y[idx_star]
    boot_sample = rbind(boot_sample, coef(lm(Y_star ~ X_star)))
}
cov_beta_boot = cov(boot_sample)

Residual bootstrap

If \(X\) is fixed, it doesn’t make sense to sample new \(X\) values for \(X^*\).
Residual bootstrap keeps \(X\) fixed, but adds randomly sampled residuals

boot_sample = c()
M = lm(Y ~ X - 1)
beta.hat = coef(M)
X = model.matrix(M)
Y.hat = X @ beta.hat
r.hat = Y - Y.hat
for (b in 1:B) {
    idx_star = sample(1:n, n, replace=TRUE)
    X_star = X 
    Y_star = Y.hat + r.hat[idx_star]
    boot_sample = rbind(boot_sample, coef(lm(Y_star ~ X_star - 1)))
}
cov_beta_boot = cov(boot_sample)

Bootstrap for inference

Estimated covariance cov_beta_boot can be used to estimate \(\text{Var}(a^T\hat{\beta})\) for confidence intervals or general linear hypothesis tests.
Software does something slightly different – using percentiles of the bootstrap sample: bootstrap percentile intervals.

Bootstrapping regression

Reference for more R examples

Using the `boot` package

The Boot function in car is a wrapper around the more general boot function.

How is the coverage?

First we’ll use the standard regression model but errors that aren’t Gaussian.

Check one instance

Check coverage

Misspecified model

Now we make data for which we might have used WLS but we don’t have a model for the weights!

Bootstrapping a linear regression

Bootstrapping linear regression

Figure: our multiple linear regression model

Random \(X\): pairs bootstrap

Population least squares

Asymptotics

Population least squares

Nonparametric bootstrap in a nutshell

Basic algorithm for pairs

Residual bootstrap

Bootstrap for inference

Bootstrapping regression

Reference for more R examples

Using the boot package

How is the coverage?

Check one instance

Check coverage

Misspecified model

Check one instance

Check coverage

Using the `boot` package