Logistic regression for binary response

STATS 191

2024-04-01

Outline

  • Case studies:

    A. Survival in the Donner party

    B. Smoking and bird-keeping

  • Binary outcomes

  • Estimation and inference: likelihood

Case study A: Donner party

Binary outcomes

  • Most models so far have had response \(Y\) as continuous.

  • Status is Died/Survived

  • Other examples:

    1. medical: presence or absence of cancer
    2. financial: bankrupt or solvent
    3. industrial: passes a quality control test or not

Modelling probabilities

  • For \(0-1\) responses we need to model

\[\pi(x)=\pi(x_1, \dots, x_p) = P(Y=1|X_1=x_1,\dots, X_p=x_p)\]

  • That is, \(Y\) is Bernoulli with a probability that depends on covariates \(\{X_1, \dots, X_p\}.\)

  • Note: \(\text{Var}(Y|X) = \pi(X) ( 1 - \pi(X)) = E(Y|X) \cdot ( 1- E(Y|X))\)

  • Or, the binary nature forces a relation between mean and variance of \(Y\).

  • This makes logistic regression a Generalized Linear Model.

A possible model for Donner

  • Set Y=Status=='Survived'

  • Simple model: lm(Y ~ Age + Sex)

  • Problems / issues:

    1. We must have \(0 \leq E(Y|X) = \pi(X_1,X_2) \leq 1\). OLS will not force this.

    2. Inference methods for will not work because of relation between mean and variance. Need to change how we compute variance!

Logistic regression

  • Set X_1=Age, X_2=Sex

  • Logistic model

\[\pi(X_1,X_2) = \frac{\exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}{1 + \exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}\]

  • This automatically fixes \(0 \leq E(Y|X) = \pi(X_1,X_2) \leq 1\).

Logit transform

  • Definition: \(\text{logit}(\pi(X_1, X_2)) = \log\left(\frac{\pi(X_1, X_2)}{1 - \pi(X_1,X_2)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2\)

Logistic distribution

Logistic transform: logit

Binary regression models

  • Models \(E(Y)\) as \(F(\beta_0 + \beta_1 X_1 + \beta_2 X_2)\) for some increasing function \(F\) (usually a distribution function).

  • The logistic model uses the function (we called logit.inv above)

\[F(x)=\frac{e^x}{1+e^x}.\]

  • Can be fit using Maximum Likelihood / Iteratively Reweighted Least Squares.

  • For logistic regression, coefficients have nice interpretation in terms of odds ratios (to be defined shortly).

  • What about inference?

Criterion used to fit model

Instead of sum of squares, logistic regression uses deviance:

  • \(DEV(\mu| Y) = -2 \log L(\mu| Y) + 2 \log L(Y| Y)\) where \(\mu\) is a location estimator for \(Y\).

  • If \(Y\) is Gaussian with independent \(N(\mu_i,\sigma^2)\) entries then \(DEV(\mu| Y) = \frac{1}{\sigma^2}\sum_{i=1}^n(Y_i - \mu_i)^2\)

  • If \(Y\) is a binary vector, with mean vector \(\pi\) then \(DEV(\pi| Y) = -2 \sum_{i=1}^n \left( Y_i \log(\pi_i) + (1-Y_i) \log(1-\pi_i) \right)\)

  • Minimizing deviance \(\iff\) Maximum Likelihood

Deviance for logistic regression

  • For any binary regression model, \(\pi=\pi(\beta)\).

  • The deviance is:

\[\begin{aligned} DEV(\beta| Y) &= -2 \sum_{i=1}^n \left( Y_i {\text{logit}}(\pi_i(\beta)) + \log(1-\pi_i(\beta)) \right) \end{aligned}\]

  • For the logistic model, the RHS is:

\[ \begin{aligned} -2 \left[ (X\beta)^Ty + \sum_{i=1}^n\log \left(1 + \exp \left(\beta_0 + \sum_{j=1}^p X_{ij} \beta_j\right) \right)\right] \end{aligned}\]

  • The logistic model is special in that \(\text{logit}(\pi(\beta))=X\beta\). If we used a different transformation, the first part would not be linear in \(X\beta\).

Odds Ratios

  • One reason logistic models are popular is that the parameters have simple interpretations in terms of odds

    \[ODDS(A) = \frac{P(A)}{1-P(A)}.\]

  • Logistic model:

\[OR_{X_j} = \frac{ODDS(Y=1|\dots, X_j=x_j+h, \dots)}{ODDS(Y=1|\dots, X_j=x_j, \dots)} = e^{h \beta_j}\]

  • If \(X_j \in {0, 1}\) is dichotomous, then odds for group with \(X_j = 1\) are \(e^{\beta_j}\) higher, other parameters being equal.
  • Tempting to think of odds ratio as probability ratio, but not quite the same

  • In Donner, the odds ratio for a 45 year old male compared to a 35 year old female

\[e^{-0.7820-1.5973}=-2.3793\]

The estimated probabilities yield a ratio of \(0.1317/0.6209 \approx 0.21\).

  • When success is rare (rare disease hypothesis) then odds ratios are approximately probability ratios…

Inference

  • Model is fit to minimize deviance

  • Fit using IRLS (Iteratively Reweighted Least Squares) algorithm

  • General formula for \(\text{Var}(\hat{\beta})\) from IRLS.

  • Intervals formed width vcov() are called Wald intervals.

Covariance

  • The estimated covariance uses the “weights” computed from the fitted model.

Confidence intervals

  • The intervals above are slightly different from what R will give you if you ask it for confidence intervals.

  • R uses so-called profile intervals.

  • For large samples the two methods should agree quite closely.

Example from book

  • The book considers testing whether there is any interaction between Sex and Age in the Donner party data:
  • \(Z\)-test for testing \(H_0:\beta_{\texttt Age:SexMale}=0\):

\[ Z = \frac{0.1616-0}{0.09426} \approx 1.714 \]

  • When \(H_0\) is true, \(Z\) is approximately \(N(0,1)\).

Comparing models in logistic regression

  • For a model \({\cal M}\), \(DEV({\cal M})\) replaces \(SSE({\cal M})\).

  • In least squares regression (with \(\sigma^2\) known), we use

\[\frac{1}{\sigma^2}\left( SSE({\cal M}_R) - SSE({\cal M}_F) \right) \overset{H_0:{\cal M}_R}{\sim} \chi^2_{df_R-df_F}\]

  • This is closely related to \(F\) with large \(df_F\): approximately \(F_{df_R-df_F, df_F} \cdot (df_R-df_F)\).

  • For logistic regression this difference in \(SSE\) is replaced with

\[DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty, H_0:{\cal M}_R}{\sim} \chi^2_{df_R-df_F}\]

  • Resulting tests do not agree numerically with those coming from IRLS (Wald tests). Both are often used.

Comparing ~ Age + Sex to ~1

Corresponding \(p\)-value:

Comparing ~ Age + Sex to ~ Age

Corresponding \(p\)-value:

Let’s compare this with the Wald test:

Diagnostics

  • Similar to least square regression, only residuals used are usually deviance residuals

    \[r_i = \text{sign}(Y_i-\widehat{\pi}_i) \sqrt{DEV(\widehat{\pi}_i|Y_i)}.\]

  • These agree with usual residual for least square regression.

Model selection

  • Models all have AIC scores \(\implies\) stepwise selection can be used easily!

  • Package bestglm has something like leaps functionality (not covered here)

Case Study B: birdkeeping and smoking

  • Outcome: LC, lung cancer status

  • Features:

    1. FM: sex of patient

    2. AG: age (years)

    3. SS: socio-economic status

    4. YR: years of smoking prior to examination / diagnosis

    5. CD: cigarettes per day

    6. BK: whether or not subjects keep birds at home

Reduced model: ~ . - BK

Further reduction: ~ FM + SS + AG + YR

Effect of smoking

  • As in the salary discrimination study, the book first selects model without BK

  • We’ll start with all possible two-way interactions

  • Let’s see the effect of BK when added to our selected model

Probit model

  • We used \(F(x)=e^x/(1+e^x)\). There are other choices…

  • Probit regression model:

\[\Phi^{-1}(E(Y|X))= \beta_0 + \sum_{j=1}^{p} \beta_j X_{j}\]

  • Above, \(\Phi\) is CDF of \(N(0,1)\), i.e. \(\Phi(t) = {\tt pnorm(t)}\), \(\Phi^{-1}(q) = {\tt qnorm}(q)\).
  • Regression function:

\[ \begin{aligned} E(Y|X) &= E(Y|X_1,\dots,X_p) \\ &= P(Y=1|X_1, \dots, X_p) \\ & = {\tt pnorm}\left(\beta_0 + \sum_{j=1}^p \beta_j X_j \right)\\ \end{aligned} \]

  • In logit, probit and cloglog \({\text{Var}}(Y|X)=\pi(X)(1-\pi(X))\) but the model for the mean is different.

  • Coefficients no longer have an odds ratio interpretation.

Generalized linear models: glm

Given a dataset \((Y_i, X_{i1}, \dots, X_{ip}), 1 \leq i \leq n\) we consider a model for the distribution of \(Y|X_1, \dots, X_p\).

-If \(\eta_i=g(E(Y_i|X_i)) = g(\mu_i) = \beta_0 + \sum_{j=1}^p \beta_j X_{ij}\) then \(g\) is called the link function for the model.

  • If \({\text{Var}}(Y_i) = \phi \cdot V({\mathbb{E}}(Y_i)) = \phi \cdot V(\mu_i)\) for \(\phi > 0\) and some function \(V\), then \(V\) is the called variance function for the model.

  • Canonical reference Generalized linear models.

Binary regression as GLM

  • For a logistic model, \(g(\mu)={\text{logit}}(\mu), \qquad V(\mu)=\mu(1-\mu).\)

  • For a probit model, \(g(\mu)=\Phi^{-1}(\mu), \qquad V(\mu)=\mu(1-\mu).\)

  • For a cloglog model, \(g(\mu)=-\log(-\log(\mu)), \qquad V(\mu)=\mu(1-\mu).\)

  • All of these have dispersion \(\phi=1\).