Logistic regression for binary response

STATS 191

2024-04-01

Outline

Case studies:

A. Survival in the Donner party

B. Smoking and bird-keeping
Binary outcomes
Estimation and inference: likelihood

Case study A: Donner party

Binary outcomes

Most models so far have had response \(Y\) as continuous.
Status is Died/Survived
Other examples:
1. medical: presence or absence of cancer
2. financial: bankrupt or solvent
3. industrial: passes a quality control test or not

Modelling probabilities

For \(0-1\) responses we need to model

\[\pi(x)=\pi(x_1, \dots, x_p) = P(Y=1|X_1=x_1,\dots, X_p=x_p)\]

That is, \(Y\) is Bernoulli with a probability that depends on covariates \(\{X_1, \dots, X_p\}.\)
Note: \(\text{Var}(Y|X) = \pi(X) ( 1 - \pi(X)) = E(Y|X) \cdot ( 1- E(Y|X))\)
Or, the binary nature forces a relation between mean and variance of \(Y\).
This makes logistic regression a Generalized Linear Model.

A possible model for `Donner`

Set Y=Status=='Survived'
Simple model: lm(Y ~ Age + Sex)
Problems / issues:
1. We must have \(0 \leq E(Y|X) = \pi(X_1,X_2) \leq 1\). OLS will not force this.
2. Inference methods for will not work because of relation between mean and variance. Need to change how we compute variance!

Logistic regression

Set X_1=Age, X_2=Sex
Logistic model

\[\pi(X_1,X_2) = \frac{\exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}{1 + \exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}\]

This automatically fixes \(0 \leq E(Y|X) = \pi(X_1,X_2) \leq 1\).

Logit transform

Definition: \(\text{logit}(\pi(X_1, X_2)) = \log\left(\frac{\pi(X_1, X_2)}{1 - \pi(X_1,X_2)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2\)

Logistic distribution

Logistic transform: `logit`

Binary regression models

Models \(E(Y)\) as \(F(\beta_0 + \beta_1 X_1 + \beta_2 X_2)\) for some increasing function \(F\) (usually a distribution function).
The logistic model uses the function (we called logit.inv above)

\[F(x)=\frac{e^x}{1+e^x}.\]

Can be fit using Maximum Likelihood / Iteratively Reweighted Least Squares.
For logistic regression, coefficients have nice interpretation in terms of odds ratios (to be defined shortly).
What about inference?

Criterion used to fit model

Instead of sum of squares, logistic regression uses deviance:

\(DEV(\mu| Y) = -2 \log L(\mu| Y) + 2 \log L(Y| Y)\) where \(\mu\) is a location estimator for \(Y\).
If \(Y\) is Gaussian with independent \(N(\mu_i,\sigma^2)\) entries then \(DEV(\mu| Y) = \frac{1}{\sigma^2}\sum_{i=1}^n(Y_i - \mu_i)^2\)
If \(Y\) is a binary vector, with mean vector \(\pi\) then \(DEV(\pi| Y) = -2 \sum_{i=1}^n \left( Y_i \log(\pi_i) + (1-Y_i) \log(1-\pi_i) \right)\)
Minimizing deviance \(\iff\) Maximum Likelihood

Deviance for logistic regression

For any binary regression model, \(\pi=\pi(\beta)\).
The deviance is:

\[\begin{aligned} DEV(\beta| Y) &= -2 \sum_{i=1}^n \left( Y_i {\text{logit}}(\pi_i(\beta)) + \log(1-\pi_i(\beta)) \right) \end{aligned}\]

For the logistic model, the RHS is:

\[ \begin{aligned} -2 \left[ (X\beta)^Ty + \sum_{i=1}^n\log \left(1 + \exp \left(\beta_0 + \sum_{j=1}^p X_{ij} \beta_j\right) \right)\right] \end{aligned}\]

The logistic model is special in that \(\text{logit}(\pi(\beta))=X\beta\). If we used a different transformation, the first part would not be linear in \(X\beta\).

Odds Ratios

One reason logistic models are popular is that the parameters have simple interpretations in terms of odds

\[ODDS(A) = \frac{P(A)}{1-P(A)}.\]
Logistic model:

\[OR_{X_j} = \frac{ODDS(Y=1|\dots, X_j=x_j+h, \dots)}{ODDS(Y=1|\dots, X_j=x_j, \dots)} = e^{h \beta_j}\]

If \(X_j \in {0, 1}\) is dichotomous, then odds for group with \(X_j = 1\) are \(e^{\beta_j}\) higher, other parameters being equal.

Tempting to think of odds ratio as probability ratio, but not quite the same
In Donner, the odds ratio for a 45 year old male compared to a 35 year old female

\[e^{-0.7820-1.5973}=-2.3793\]

The estimated probabilities yield a ratio of \(0.1317/0.6209 \approx 0.21\).

When success is rare (rare disease hypothesis) then odds ratios are approximately probability ratios…

Inference

Model is fit to minimize deviance
Fit using IRLS (Iteratively Reweighted Least Squares) algorithm
General formula for \(\text{Var}(\hat{\beta})\) from IRLS.
Intervals formed width vcov() are called Wald intervals.

Covariance

The estimated covariance uses the “weights” computed from the fitted model.

Confidence intervals

The intervals above are slightly different from what R will give you if you ask it for confidence intervals.
R uses so-called profile intervals.
For large samples the two methods should agree quite closely.

Example from book

The book considers testing whether there is any interaction between Sex and Age in the Donner party data:

\(Z\)-test for testing \(H_0:\beta_{\texttt Age:SexMale}=0\):

\[ Z = \frac{0.1616-0}{0.09426} \approx 1.714 \]

When \(H_0\) is true, \(Z\) is approximately \(N(0,1)\).

Comparing models in logistic regression

For a model \({\cal M}\), \(DEV({\cal M})\) replaces \(SSE({\cal M})\).
In least squares regression (with \(\sigma^2\) known), we use

\[\frac{1}{\sigma^2}\left( SSE({\cal M}_R) - SSE({\cal M}_F) \right) \overset{H_0:{\cal M}_R}{\sim} \chi^2_{df_R-df_F}\]

This is closely related to \(F\) with large \(df_F\): approximately \(F_{df_R-df_F, df_F} \cdot (df_R-df_F)\).
For logistic regression this difference in \(SSE\) is replaced with

\[DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty, H_0:{\cal M}_R}{\sim} \chi^2_{df_R-df_F}\]

Resulting tests do not agree numerically with those coming from IRLS (Wald tests). Both are often used.

Comparing `~ Age + Sex` to `~1`

Corresponding \(p\)-value:

Comparing `~ Age + Sex` to `~ Age`

Corresponding \(p\)-value:

Let’s compare this with the Wald test:

Diagnostics

Similar to least square regression, only residuals used are usually deviance residuals

\[r_i = \text{sign}(Y_i-\widehat{\pi}_i) \sqrt{DEV(\widehat{\pi}_i|Y_i)}.\]
These agree with usual residual for least square regression.

Model selection

Models all have AIC scores \(\implies\) stepwise selection can be used easily!
Package bestglm has something like leaps functionality (not covered here)

Case Study B: birdkeeping and smoking

Outcome: LC, lung cancer status
Features:
1. FM: sex of patient
2. AG: age (years)
3. SS: socio-economic status
4. YR: years of smoking prior to examination / diagnosis
5. CD: cigarettes per day
6. BK: whether or not subjects keep birds at home

Reduced model: `~ . - BK`

Further reduction: `~ FM + SS + AG + YR`

Effect of smoking

As in the salary discrimination study, the book first selects model without BK
We’ll start with all possible two-way interactions

Let’s see the effect of BK when added to our selected model

Probit model

We used \(F(x)=e^x/(1+e^x)\). There are other choices…
Probit regression model:

\[\Phi^{-1}(E(Y|X))= \beta_0 + \sum_{j=1}^{p} \beta_j X_{j}\]

Above, \(\Phi\) is CDF of \(N(0,1)\), i.e. \(\Phi(t) = {\tt pnorm(t)}\), \(\Phi^{-1}(q) = {\tt qnorm}(q)\).

Regression function:

\[ \begin{aligned} E(Y|X) &= E(Y|X_1,\dots,X_p) \\ &= P(Y=1|X_1, \dots, X_p) \\ & = {\tt pnorm}\left(\beta_0 + \sum_{j=1}^p \beta_j X_j \right)\\ \end{aligned} \]

In logit, probit and cloglog \({\text{Var}}(Y|X)=\pi(X)(1-\pi(X))\) but the model for the mean is different.
Coefficients no longer have an odds ratio interpretation.

Generalized linear models: `glm`

Given a dataset \((Y_i, X_{i1}, \dots, X_{ip}), 1 \leq i \leq n\) we consider a model for the distribution of \(Y|X_1, \dots, X_p\).

-If \(\eta_i=g(E(Y_i|X_i)) = g(\mu_i) = \beta_0 + \sum_{j=1}^p \beta_j X_{ij}\) then \(g\) is called the link function for the model.

If \({\text{Var}}(Y_i) = \phi \cdot V({\mathbb{E}}(Y_i)) = \phi \cdot V(\mu_i)\) for \(\phi > 0\) and some function \(V\), then \(V\) is the called variance function for the model.
Canonical reference Generalized linear models.

Binary regression as GLM

For a logistic model, \(g(\mu)={\text{logit}}(\mu), \qquad V(\mu)=\mu(1-\mu).\)
For a probit model, \(g(\mu)=\Phi^{-1}(\mu), \qquad V(\mu)=\mu(1-\mu).\)
For a cloglog model, \(g(\mu)=-\log(-\log(\mu)), \qquad V(\mu)=\mu(1-\mu).\)
All of these have dispersion \(\phi=1\).