Assumptions for simple linear regression

STATS 191

2024-04-01

Outline

  • Goodness of fit of regression: analysis of variance.

  • \(F\)-statistics.

  • Residuals.

  • Diagnostic plots.

Figure depicts the statistical model for regression:

  • First we start with \(X\), then compute the mean \(\beta_0 + \beta_1 X\) then add error \(\epsilon\) yielding

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

Geometry of least squares

Full model

  • This is the model

\[ Y = \beta_0 \cdot 1 + \beta_1 \cdot X + \epsilon. \]

  • Its fitted values are

\[ \hat{Y} = \hat{Y}_F = \hat{\beta}_0 \cdot 1 + \hat{\beta}_1 \cdot X \]

Reduced model

  • This is the model

\[ Y = \beta_0 \cdot 1 + \epsilon \]

  • Its fitted values are

\[ \hat{Y} = \hat{Y}_R = \bar{Y} \cdot 1 \]

Regression sum of squares

  • The closer \(\hat{Y}\) is to the \({1}\) axis, the less “variation” there is along the \(X\) axis.

  • This closeness can be measured by the length of the vector \(\hat{Y}-\bar{Y} \cdot 1\).

  • Its length is

\[ SSR = \|\hat{Y} - \bar{Y} \cdot 1 \|^2 = \|\hat{Y}_F - \hat{Y}_R\|^2 \]

An important right triangle

  • Sides of the triangle: SSR, SSE

  • Hypotenuse: SST

Degrees of freedom in the right triangle

  • Sides of the triangle: SSR has 1 d.f., SSE has n-2 d.f.

  • Hypotenuse: SST has n-1 d.f.

Mean squares

  • Each sum of squares gets an extra bit of information associated to them, called their degrees of freedom.

  • Roughly speaking, the degrees of freedom can be determined by dimension counting.

\[ \begin{aligned} MSE &= \frac{1}{n-2}\sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 \\ MSR &= \frac{1}{1} \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 \\ MST &= \frac{1}{n-1}\sum_{i=1}^n(Y_i - \overline{Y})^2 \\ \end{aligned} \]

Computing degrees of freedom

  • The \(SSE\) has \(n-2\) degrees of freedom because it is the squared length of a vector that lies in \(n-2\) dimensions. To see this, note that it is perpendicular to the 2-dimensional plane formed by the \(X\) axis and the \(1\) axis.

  • The \(SST\) has \(n-1\) degrees of freedom because it is the squared length of a vector that lies in \(n-1\) dimensions. In this case, this vector is perpendicular to the \(1\) axis.

  • The \(SSR\) has 1 degree of freedom because it is the squared length of a vector that lies in the 2-dimensional plane but is perpendicular to the \(1\) axis.

A different visualization

These sums of squares can be visualized by other means as well. We will illustrate with a synthetic dataset.

SST: total sum of squares

Description

This figure depicts the total sum of squares, \(SST\) – the sum of the squared differences between the Y values and the sample mean of the Y values.

SSE: error sum of squares

Description

This figure depicts the error sum of squares, \(SSE\) - the sum of the squared differences between the \(Y\) values and the \(\hat{Y}\) values, i.e. the fitted values of the regression model.

SSR: regression sum of squares

Description

This figure depicts the regression sum of squares, \(SSR\) - the sum of the squared differences between the \(\hat{Y}\) values and the sample mean of the \(Y\) values.

Definition of \(R^2\)

As noted above, if the regression model fits very well, then \(SSR\) will be large relative to \(SST\). The \(R^2\) score is just the ratio of these sums of squares.

Let’s verify this on the big_bang data.

Let’s verify our claim \(SST=SSE+SSR\):

\(R^2\) and correlation

Finally, \(R=\sqrt{R^2}\) is called the (absolute) correlation coefficient because it is equal to the absolute value of sample correlation coefficient of \(X\) and \(Y\).

\(F\)-statistics

After a \(t\)-statistic, the next most commonly encountered statistic is a \(\chi^2\) statistic, or its closely related cousin, the \(F\) statistic.

  • A \(\chi^2_k\) random variable is the distribution of the squared length of a centered normal vector in \(k\) dimensions (proper definition needs slightly more detail).

  • Sums of squares are squared lengths!

\(F\) statistic for simple linear regression

  • Defined as

\[\begin{equation} F=\frac{SSR/1}{SSE/(n-2)} = \frac{(SST-SSE)/1}{SSE/(n-2)} = \frac{MSR}{MSE} \end{equation}\]

  • Can be thought of as a ratio of a difference in sums of squares normalized by our “best estimate” of variance .

\(F\) statistics and \(R^2\)

The \(R^2\) is also closely related to the \(F\) statistic reported as the goodness of fit in summary of lm.

Simple manipulations yield

\[\begin{equation} F = \frac{(n-2) \cdot R^2}{1-R^2} \end{equation}\]

\(F\) test under \(H_0\)

  • Under \(H_0:\beta_1=0\),

\[ F \sim F_{1, n-2} \]

because

\[ \begin{aligned} SSR &= \|\hat{Y} - \bar{Y} \cdot 1\|^2 \\ SSE &= \|Y - \hat{Y}\|^2 \end{aligned} \]

and from our “right triangle”, these vectors are orthogonal.

  • The null hypothesis \(H_0:\beta_1=0\) implies that \(SSR \sim \chi^2_1 \cdot \sigma^2\).

\(F\)-statistics and mean squares

  • An \(F\)-statistic is a ratio of mean squares: it has a numerator, \(N\), and a denominator, \(D\) that are independent.

  • Let

\[N \sim \frac{\chi^2_{\rm num} }{ df_{{\rm num}}}, \qquad D \sim \frac{\chi^2_{\rm den} }{ df_{{\rm den}}}\] and define

\[ F = \frac{N}{D}. \]

  • We say \(F\) has an \(F\) distribution with parameters \(df_{{\rm num}}, df_{{\rm den}}\) and write \(F \sim F_{df_{{\rm num}}, df_{{\rm den}}}\)

Takeaway

  • \(F\) statistics are computed to test some \(H_0\).

  • When that \(H_0\) is true, the \(F\) statistic has this \(F\) distribution (with appropriate degrees of freedom).

Relation between \(F\) and \(t\) statistics.

  • If \(T \sim t_{\nu}\), then

    \[ T^2 \sim \frac{N(0,1)^2}{\chi^2_{\nu}/\nu} \sim \frac{\chi^2_1/1}{\chi^2_{\nu}/\nu}.\]

  • In other words, the square of a \(t\)-statistic is an \(F\)-statistic. Because it is always positive, an \(F\)-statistic has no direction associated with it.

  • In fact

    \[ F = \frac{MSR}{MSE} = \frac{\widehat{\beta}_1^2}{SE(\widehat{\beta}_1)^2}.\]

Verifying \(F\)-statistic calculation

Let’s check this in our example.

The \(t\) statistic for education is the \(t\)-statistic for the parameter \(\beta_1\) under \(H_0:\beta_1=0\). Its value is 6.024 above. If we square it, we should get about the same as the F-statistic.

Interpretation of an \(F\)-statistic

  • In regression, the numerator is usually a difference in goodness of fit of two (nested) models.

  • The denominator is \(\hat{\sigma}^2\) – an estimate of \(\sigma^2\).

  • In our example today: the bigger model is the simple linear regression model, the smaller is the model with constant mean (one sample model).

  • If the \(F\) is large, it says that the bigger model explains a lot more variability in \(Y\) (relative to \(\sigma^2\)) than the smaller one.

Analysis of variance

  • The \(F\)-statistic has the form

\[ F=\frac{(SSE_R - SSE_F) / (df_R - df_F)}{SSE_F / df_F} \]

Right triangle with full and reduced model: sum of squares

  • Sides of the triangle: \(SSE_R-SSE_F\), \(SSE_F\)

  • Hypotenuse: \(SSE_R\)

Right triangle with full and reduced model: degrees of freedom

  • Sides of the triangle: \(df_R-df_F\), \(df_F\)

  • Hypotenuse: \(df_R\)

The \(F\)-statistic for simple linear regression revisited

  • The null hypothesis is

\[ H_0: \text{reduced model (R) is correct}. \]

  • The usual \(\alpha\) rejection rule would be to reject \(H_0\) if the \(F_{\text{obs}}\) the observed \(F\) statistic is greater than \(F_{1,n-2,1-\alpha}\).

  • In our case, the observed \(F\) was 36.3, \(n-2=22\) and the appropriate 5% threshold is computed below to be 4.30. Therefore, we strongly reject \(H_0\).

Case study B: breakdown time for insulating fluid

  • A designed experiment to estimate average breakdown time under different voltages.

Another model

  • There were only 7 distinct values of Voltage: can be treated as a category (i.e. factor)

A different reduced & full model

Our “right triangle” again (only degrees of freedom this time):

  • Sides of the triangle: \(df_R-df_F=5\), \(df_F=69\)

  • Hypotenuse: \(df_R=74\)

Diagnostics for simple linear regression

What can go wrong?

  • Using a linear regression function can be wrong: maybe regression function should be quadratic.

  • We assumed independent Gaussian errors with the same variance. This may be incorrect.

    1. The errors may not be normally distributed.

    2. The errors may not be independent.

    3. The errors may not have the same variance.

  • Detecting problems is more art then science, i.e. we cannot test for all possible problems in a regression model.

Inspecting residuals

The basic idea of most diagnostic measures is the following:

If the model is correct then residuals \(e_i = Y_i -\widehat{Y}_i, 1 \leq i \leq n\) should look like a sample of (not quite independent) \(N(0, \sigma^2)\) random variables.

A poorly fitting model

Figure: \(Y\) vs. \(X\) and fitted regression line

Residuals for Anscombe’s data

Figure: residuals vs. \(X\)

Quadratic model

  • Let’s add a quadratic term to our model (a multiple linear regression model).
quadratic.lm = lm(Y ~ poly(X, 2))

Figure: \(Y\) and fitted quadratic model vs. \(X\)

Inspecting residuals of quadratic model

The residuals of the quadratic model have no apparent pattern in them, suggesting this is a better fit than the simple linear regression model.

Figure: residuals of quadratic model vs. \(X\)

Assessing normality of errors: QQ-plot for linear model

Figure: quantiles of poorly fitting model’s residuals vs. expected Gaussian quantiles

QQ-plot for quadratic model

Figure: quantiles of quadratic model’s residuals vs. expected Gaussian quantiles

  • qqnorm does not seem vastly different \(\implies\) several diagnostic tools can be useful in assessing a model.

Default plots in R

Assessing constant variance assumption

Removing an outlier

Heteroscedastic errors

When we plot the residuals against the fitted values for this model (even with the outlier removed) we see that the variance clearly depends on \(GSS\). They also do not seem symmetric around 0 so perhaps the Gaussian model is not appropriate.

Plots from lm

  • We can see some of these plots in R: