2024-01-01
Review++ (e.g. STATS60 and a little more)
Beyond \(t\)-tests
Multiple group: ANOVA
Simple linear regression
Multiple linear regression
Model diagnostics
Model selection
Logistic regression
…
It is a course on applied statistics.
Hands-on: we use R, an open-source statistics software environment.
Course notes will be available as jupyter notebooks.
We will start out with a review of introductory statistics to see R
in action.
Main topic is (linear) regression models: these are the bread and butter of applied statistics.
A regression model is a model of the relationships between some covariates (predictors) and an outcome.
Specifically, regression is a model of the average outcome given or having fixed the covariates.
We will consider the of fathers and sons collected by Karl Pearson in the late 19th century. Perhaps the first regression model!
One of our goals is to understand height of the son, S
, knowing the height of the father, F
.
A mathematical model might look like
\[ S = g(F) + \varepsilon\]
Above \(g\) gives the average height of the son of a father of height F
and \(\varepsilon\) is error: not every son whose fathers have the same height themselves have the same height.
A statistical question: is there any relationship between covariates and outcomes – is \(g\) just a constant?
How do we find this line? With a model.
We might model the data as
\[ S = \beta_0+ \beta_1 \cdot F + \varepsilon. \]
F
(the father’s height), it is a simple linear regression model.\[ S = \beta_0 + \beta_1 F + \beta_2 F^2 + \varepsilon \]
Also linear (in \((\beta_0, \beta_1, \beta_2)\), the coefficients of \(1, F, F^2\)).
Which model is better? We will need a tool to compare models… more to come later.
Our example here was rather simple: we only had one independent variable: F
.
brains
Brain
: average brain weight (in grams)brains
Body
: average body weight (in kilograms)
Gestation
: gestation period (in days)
Litter
: average litter size
Some of the main goals of this course:
Build a statistical model describing the effect of Gestation
on Brain
.
This model should recognize that other variables also affect Brain
.
What sort of statistical conclusions can we make based on our model?
Is the model we choose adequate describe this dataset?
Are there other (simpler, more complicated) better models?