2024-01-01
Case studies:
Volume of left hippocampus in discordant twins (for schizophrenia)
Difference in finches’ beaks before and after drought
Descriptive statistics:
Median
Standard deviation
Median
\(t\)-distributions
One-sample \(t\)-test
Confidence intervals
Two sample \(t\)-test
\[ \overline{X} = \frac1n \sum_{i=1}^n X_i.\]
Affected
group of Schizophrenia studyGiven a sample of numbers \(X=(X_1, \dots, X_n)\) the sample standard deviation \(S_X\) is \[ S^2_X = \frac{1}{n-1} \sum_{i=1}^n (X_i-\overline{X})^2.\]
Affected
group of Schizophrenia studyGiven a sample of numbers \(X=(X_1, \dots, X_n)\) the sample median is the middle
of the sample: if \(n\) is even, it is the average of the middle two points. If \(n\) is odd, it is the midpoint.
Suppose we want to determine whether the volume of left hiccomapus in Affected
vs. Unaffected
.
Let’s compute the Differences
of the two
Formally, could set up the above test as drawing from a box of differences in discordant twins’ volume of left hippocampus.
We can think of the sample of differences as having drawn 15 patients at random from a large population (box) containing all possible such differences.
Under \(H_0\): the average of all possible such differences is 0.
\(H_0\): the average difference between Affected
and Unaffected
is 0.
The alternative hypothesis is \(H_a:\) the average difference is not zero.
\[ T = \frac{\overline{X}-0}{S_X/\sqrt{n}} \]
The formula can be read in three parts:
Estimating the mean: \(\overline{X}\)
Comparing to 0: subtracting 0 in the numerator. Why 0?
Converting difference to standardized units: dividing by \(S_X/\sqrt{n}\).
The denominator above serves as an estimate of \(SD(\overline{X})\) the standard deviation of \(\overline{X}\) .
The result of the two-sided test is
If reject
is TRUE
, then we reject \(H_0\) the mean is 0 at a level of 5%, while if it is FALSE
we do not reject.
If \(H_0\) is true (\(P \in H_0\)) and some modelling assumptions hold, then
\[ P({\tt reject}) = 5\% \]
For us to believe the Type I error to be exactly 5% we should be comfortable assuming that the distribution of Differences
in the population follows a normal curve.
For us to believe the Type I error to be close to 5% we should be comfortable assuming that the distribution of the \(T\)-statistic behaves similarly to as if the data were from a normal curve…
Looks like a normal curve, but heavier tails…
For a test of size \(\alpha\) we write this cutoff \(t_{n-1,1-\alpha/2}\).
If the 5% cutoff is \(q\) for our test, then the 95% confidence interval is \[ [\bar{X} - q \cdot S_X / \sqrt{n}, \bar{X} + q \cdot S_X / \sqrt{n}] \] where we recall \(q=t_{n-1,0.975}\) with \(n=15\).
If we wanted 90% confidence interval, we would use \(q=t_{14,0.95}\). Why?
CI
A useful picture is to plot all these intervals so we can see the randomness in the intervals, while the true mean of the box is unchanged.
Depth
stratified by Year
1976
and 1978
)1976
)1978
)\[ \begin{aligned} A_i & \sim N(\mu_A, \sigma^2_A) \\ B_i & \sim N(\mu_B, \sigma^2_B) \\ \end{aligned} \]
If variances are equal \((\sigma^2_A=\sigma^2_B)\), the pooled \(t\)-test is appropriate.
The test statistic is
\[ \begin{aligned} T &= \frac{\overline{A} - \overline{B} - 0}{S_P \sqrt{\frac{1}{89} + \frac{1}{89}}} \\ S^2_P &= \frac{88\cdot S^2_A + 88 \cdot S^2_B}{176} \end{aligned} \]
The parts of the \(t\)-statistic are similar to the one-sample case:
Estimate of our parameter: \(\overline{A} - \overline{B}\): our estimate of \(\mu_A-\mu_B\)
Compare to 0 (we are testing \(H_0:\mu_A-\mu_B=0\))
Convert to standard units (formula is different, reason is the same)
For two-sided test at level \(\alpha=0.05\), reject if \(|T| > t_{176, 0.975}\).
Confidence interval: for example, a \(90\%\) confidence interval for \(\mu_A-\mu_B\) is
\[ \overline{A}-\overline{B} \pm S_P \cdot \sqrt{\frac{1}{89} + \frac{1}{89}} \cdot t_{176,0.95}.\]
R
R
has a builtin function to perform such \(t\)-tests.If we don’t make the assumption of equal variance, R
will give a slightly different result.
More on this in Chapter 4.
The rule for the \(SD\) of differences is \[ SD(\overline{A}-\overline{B}) = \sqrt{SD(\overline{A})^2+SD(\overline{B})^2}\]
By this rule, we might take our estimate to be \[ SE(\overline{A}-\overline{B}) = \widehat{SD(\overline{A}-\overline{B})} = \sqrt{\frac{S^2_A}{89} + \frac{S^2_B}{89}}. \]
The pooled estimate assumes \(E(S^2_A)=E(S^2_B)=\sigma^2\) and replaces the \(S^2\)’s above with \(S^2_P\), a better estimate of n \(\sigma^2\) than either \(S^2_A\) or \(S^2_B\).
Therefore, the total degrees of freedom is \(88+88=176\).
We can put the two samples together: \[Y=(A_1,\dots, A_{89}, B_1, \dots, B_{89}).\]
Under the same assumptions as the pooled \(t\)-test: \[ \begin{aligned} Y_i &\sim N(\mu_i, \sigma^2)\\ \mu_i &= \begin{cases} \mu_A & 1 \leq i \leq 89 \\ \mu_B & 90 \leq i \leq 178. \end{cases} \end{aligned} \]
This is a regression model for the sample \(Y\). The (qualitative) variable Year
is called a covariate or predictor.
The Depth
is the outcome or response.
We assume that the relationship between Depth
and Year
is simple: it depends only on which group a subject is in.
This relationship is modelled through the mean vector \(\mu=(\mu_1, \dots, \mu_{178})\).
lm
: