2024-04-01
Goodness of fit of regression: analysis of variance.
\(F\)-statistics.
Residuals.
Diagnostic plots.
\[ Y = \beta_0 + \beta_1 X + \epsilon \]
\[ Y = \beta_0 \cdot 1 + \beta_1 \cdot X + \epsilon. \]
\[ \hat{Y} = \hat{Y}_F = \hat{\beta}_0 \cdot 1 + \hat{\beta}_1 \cdot X \]
\[ Y = \beta_0 \cdot 1 + \epsilon \]
\[ \hat{Y} = \hat{Y}_R = \bar{Y} \cdot 1 \]
The closer \(\hat{Y}\) is to the \({1}\) axis, the less “variation” there is along the \(X\) axis.
This closeness can be measured by the length of the vector \(\hat{Y}-\bar{Y} \cdot 1\).
Its length is
\[ SSR = \|\hat{Y} - \bar{Y} \cdot 1 \|^2 = \|\hat{Y}_F - \hat{Y}_R\|^2 \]
Sides of the triangle: SSR, SSE
Hypotenuse: SST
Sides of the triangle: SSR has 1 d.f., SSE has n-2 d.f.
Hypotenuse: SST has n-1 d.f.
Each sum of squares gets an extra bit of information associated to them, called their degrees of freedom.
Roughly speaking, the degrees of freedom can be determined by dimension counting.
\[ \begin{aligned} MSE &= \frac{1}{n-2}\sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 \\ MSR &= \frac{1}{1} \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 \\ MST &= \frac{1}{n-1}\sum_{i=1}^n(Y_i - \overline{Y})^2 \\ \end{aligned} \]
The \(SSE\) has \(n-2\) degrees of freedom because it is the squared length of a vector that lies in \(n-2\) dimensions. To see this, note that it is perpendicular to the 2-dimensional plane formed by the \(X\) axis and the \(1\) axis.
The \(SST\) has \(n-1\) degrees of freedom because it is the squared length of a vector that lies in \(n-1\) dimensions. In this case, this vector is perpendicular to the \(1\) axis.
The \(SSR\) has 1 degree of freedom because it is the squared length of a vector that lies in the 2-dimensional plane but is perpendicular to the \(1\) axis.
These sums of squares can be visualized by other means as well. We will illustrate with a synthetic dataset.
This figure depicts the total sum of squares, \(SST\) – the sum of the squared differences between the Y values and the sample mean of the Y values.
This figure depicts the error sum of squares, \(SSE\) - the sum of the squared differences between the \(Y\) values and the \(\hat{Y}\) values, i.e. the fitted values of the regression model.
This figure depicts the regression sum of squares, \(SSR\) - the sum of the squared differences between the \(\hat{Y}\) values and the sample mean of the \(Y\) values.
As noted above, if the regression model fits very well, then \(SSR\) will be large relative to \(SST\). The \(R^2\) score is just the ratio of these sums of squares.
Let’s verify this on the big_bang
data.
Let’s verify our claim \(SST=SSE+SSR\):
Finally, \(R=\sqrt{R^2}\) is called the (absolute) correlation coefficient because it is equal to the absolute value of sample correlation coefficient of \(X\) and \(Y\).
After a \(t\)-statistic, the next most commonly encountered statistic is a \(\chi^2\) statistic, or its closely related cousin, the \(F\) statistic.
A \(\chi^2_k\) random variable is the distribution of the squared length of a centered normal vector in \(k\) dimensions (proper definition needs slightly more detail).
Sums of squares are squared lengths!
\[\begin{equation} F=\frac{SSR/1}{SSE/(n-2)} = \frac{(SST-SSE)/1}{SSE/(n-2)} = \frac{MSR}{MSE} \end{equation}\]
The \(R^2\) is also closely related to the \(F\) statistic reported as the goodness of fit in summary of lm.
Simple manipulations yield
\[\begin{equation} F = \frac{(n-2) \cdot R^2}{1-R^2} \end{equation}\]
\[ F \sim F_{1, n-2} \]
because
\[ \begin{aligned} SSR &= \|\hat{Y} - \bar{Y} \cdot 1\|^2 \\ SSE &= \|Y - \hat{Y}\|^2 \end{aligned} \]
and from our “right triangle”, these vectors are orthogonal.
An \(F\)-statistic is a ratio of mean squares: it has a numerator, \(N\), and a denominator, \(D\) that are independent.
Let
\[N \sim \frac{\chi^2_{\rm num} }{ df_{{\rm num}}}, \qquad D \sim \frac{\chi^2_{\rm den} }{ df_{{\rm den}}}\] and define
\[ F = \frac{N}{D}. \]
\(F\) statistics are computed to test some \(H_0\).
When that \(H_0\) is true, the \(F\) statistic has this \(F\) distribution (with appropriate degrees of freedom).
If \(T \sim t_{\nu}\), then
\[ T^2 \sim \frac{N(0,1)^2}{\chi^2_{\nu}/\nu} \sim \frac{\chi^2_1/1}{\chi^2_{\nu}/\nu}.\]
In other words, the square of a \(t\)-statistic is an \(F\)-statistic. Because it is always positive, an \(F\)-statistic has no direction associated with it.
In fact
\[ F = \frac{MSR}{MSE} = \frac{\widehat{\beta}_1^2}{SE(\widehat{\beta}_1)^2}.\]
Let’s check this in our example.
The \(t\) statistic for education is the \(t\)-statistic for the parameter \(\beta_1\) under \(H_0:\beta_1=0\). Its value is 6.024 above. If we square it, we should get about the same as the F-statistic.
In regression, the numerator is usually a difference in goodness of fit of two (nested) models.
The denominator is \(\hat{\sigma}^2\) – an estimate of \(\sigma^2\).
In our example today: the bigger model is the simple linear regression model, the smaller is the model with constant mean (one sample model).
If the \(F\) is large, it says that the bigger model explains a lot more variability in \(Y\) (relative to \(\sigma^2\)) than the smaller one.
\[ F=\frac{(SSE_R - SSE_F) / (df_R - df_F)}{SSE_F / df_F} \]
Sides of the triangle: \(SSE_R-SSE_F\), \(SSE_F\)
Hypotenuse: \(SSE_R\)
Sides of the triangle: \(df_R-df_F\), \(df_F\)
Hypotenuse: \(df_R\)
\[ H_0: \text{reduced model (R) is correct}. \]
The usual \(\alpha\) rejection rule would be to reject \(H_0\) if the \(F_{\text{obs}}\) the observed \(F\) statistic is greater than \(F_{1,n-2,1-\alpha}\).
In our case, the observed \(F\) was 36.3, \(n-2=22\) and the appropriate 5% threshold is computed below to be 4.30. Therefore, we strongly reject \(H_0\).
Voltage
: can be treated as a category (i.e. factor
)Sides of the triangle: \(df_R-df_F=5\), \(df_F=69\)
Hypotenuse: \(df_R=74\)
Using a linear regression function can be wrong: maybe regression function should be quadratic.
We assumed independent Gaussian errors with the same variance. This may be incorrect.
The errors may not be normally distributed.
The errors may not be independent.
The errors may not have the same variance.
Detecting problems is more art then science, i.e. we cannot test for all possible problems in a regression model.
The basic idea of most diagnostic measures is the following:
If the model is correct then residuals \(e_i = Y_i -\widehat{Y}_i, 1 \leq i \leq n\) should look like a sample of (not quite independent) \(N(0, \sigma^2)\) random variables.
quadratic.lm = lm(Y ~ poly(X, 2))
The residuals of the quadratic model have no apparent pattern in them, suggesting this is a better fit than the simple linear regression model.
qqnorm
does not seem vastly different \(\implies\) several diagnostic tools can be useful in assessing a model.R
When we plot the residuals against the fitted values for this model (even with the outlier removed) we see that the variance clearly depends on \(GSS\). They also do not seem symmetric around 0 so perhaps the Gaussian model is not appropriate.
lm
R
: