Nonlinear Methods

web.stanford.edu/class/stats202

Sergio Bacallado, Jonathan Taylor

Autumn 2022

Basis expansions

Problem: How do we model a non-linear relationship?

Left: Regression of wage onto age.
Right: Logistic regression for classes wage>250 and wage<250

Strategy:

Define a model:

\[Y = \beta_0 + \beta_1 f_1(X) + \beta_2 f_2(X) + \dots + \beta_d f_d(X) + \epsilon.\] - Fit this model through least-squares regression: \(f_j\)’s are nonlinear, model is linear!

Some options for \(f_1,\dots,f_d\):
1. Polynomials, \(f_i(x) = x^i\).
2. Indicator functions, \(f_i(x) = \mathbf{1}(c_i \leq x < c_{i+1})\).

Piecewise constant functions

Piecewise polynomial functions

Splines

Cubic splines

Define a set of knots \(\xi_1< \xi_2 < \dots<\xi_K\).
We want the function \(f\) in \(Y= f(X) + \epsilon\) to:
1. Be a cubic polynomial between every pair of knots \(\xi_i,\xi_{i+1}\).
2. Be continuous at each knot.
3. Have continuous first and second derivatives at each knot.
It turns out, we can write \(f\) in terms of \(K+3\) basis functions:

\[f(X) = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 h(X,\xi_1) + \dots + \beta_{K+3} h(X,\xi_K)\]

Above,

\[ h(x,\xi) = \begin{cases} (x-\xi)^3 \quad \text{if }x>\xi \\ 0 \quad \text{otherwise} \end{cases} \]

Natural cubic splines

Spline which is linear instead of cubic for \(X<\xi_1\), \(X>\xi_K\).
The predictions are more stable for extreme values of \(X\).

Choosing the number and locations of knots

The locations of the knots are typically quantiles of \(X\).
The number of knots, \(K\), is chosen by cross validation.

Natural cubic splines vs. polynomial regression

Splines can fit complex functions with few parameters.
Polynomials require high degree terms to be flexible.
High-degree polynomials can be unstable at the edges.

Smoothing splines

Find the function \(f\) which minimizes

\[\color{Blue}{\sum_{i=1}^n (y_i - f(x_i))^2} + \color{Red}{ \lambda \int f''(x)^2 dx}\]

The RSS when using \(f\) to predict.
A penalty for the roughness of the function.

Facts

The minimizer \(\hat f\) is a natural cubic spline, with knots at each sample point \(x_1,\dots,x_n\).
Obtaining \(\hat f\) is similar to a Ridge regression.

Advanced: deriving a smoothing spline

Show that if you fix the values \(f(x_1),\dots,f(x_2)\), the roughness

\[\int f''(x)^2 dx\]

is minimized by a natural cubic spline.

Deduce that the solution to the smoothing spline problem is a natural cubic spline, which can be written in terms of its basis functions.

\[f(x) = \beta_0 + \beta_1 f_1(x) + \dots \beta_{n+3} f_{n+3}(x)\]

Letting \(\mathbf N\) be a matrix with \(\mathbf N(i,j) = f_j(x_i)\), we can write the objective function:

\[ (y - \mathbf N\beta)^T(y - \mathbf N\beta) + \lambda \beta^T \Omega_{\mathbf N}\beta, \]

where \(\Omega_{\mathbf N}(i,j) = \int N_i''(t) N_j''(t) dt\).

By simple calculus, the coefficients \(\hat \beta\) which minimize

\[ (y - \mathbf N\beta)^T(y - \mathbf N\beta) + \lambda \beta^T \Omega_{\mathbf N}\beta, \]

are \(\hat \beta = (\mathbf N^T \mathbf N + \lambda \Omega_{\mathbf N})^{-1} \mathbf N^T y\).

Note that the predicted values are a linear function of the observed values:

\[\hat y = \underbrace{\mathbf N (\mathbf N^T \mathbf N + \lambda \Omega_{\mathbf N})^{-1} \mathbf N^T}_{\mathbf S_\lambda} y\]

Degrees of freedom

The degrees of freedom for a smoothing spline are:

\[\text{Trace}(\mathbf S_\lambda)= \mathbf S_\lambda(1,1) + \mathbf S_\lambda(2,2) + \cdots + \mathbf S_\lambda(n,n) \]

Natural cubic splines vs. Smoothing splines

Natural cubic splines	Smoothing splines
Fix the locations of \(K\) knots at quantiles of \(X\) and number of knots \(K<n\).	Put \(n\) knots at \(x_1,\dots,x_n\).
Find the natural cubic spline \(\hat f\) which minimizes the RSS:\(\sum_{i=1}^n (y_i - f(x_i) )^2\) with these knots.	Find the fitted values \(\hat f(x_1),\dots,\hat f(x_n)\) through an algorithm similar to Ridge regression.
Choose \(K\) by cross validation.	Choose smoothing parameter \(\lambda\) by cross validation.

Choosing the regularization parameter \(\lambda\)

We typically choose \(\lambda\) through cross validation.
Fortunately, we can solve the problem for any \(\lambda\) with the same complexity of diagonalizing an \(n\times n\) matrix.
There is a shortcut for LOOCV: \[ \begin{aligned} RSS_\text{loocv}(\lambda) &= \sum_{i=1}^n (y_i - \hat f_\lambda^{(-i)}(x_i))^2 \\ &= \sum_{i=1}^n \left[\frac{y_i-\hat f_\lambda(x_i)}{1-\mathbf S_\lambda(i,i)}\right]^2 \end{aligned} \]

Local linear regression

Sample points nearer \(x\) are weighted higher in corresponding regression.

Algorithm

To predict the regression function \(f\) at an input \(x\):

Assign a weight \(K_i(x)\) to the training point \(x_i\), such that:
- \(K_i(x)=0\) unless \(x_i\) is one of the \(k\) nearest neighbors of \(x\) (not strictly necessary).
- \(K_i(x)\) decreases when the distance \(d(x,x_i)\) increases.
Perform a weighted least squares regression; i.e. find \((\beta_0,\beta_1)\) which minimize

\[\hat{\beta}(x) = \text{argmin}_{(\beta_0, \beta_1)} \sum_{i=i}^n K_i(x) (y_i - \beta_0 -\beta_1 x_i)^2.\]

Predict \(\hat f(x) = \hat \beta_0(x) + \hat \beta_1(x) x\).

Generalized nearest neighbors

Set \(K_i(x)=1\) if \(x_i\) is one of \(x\)’s \(k\) nearest neighbors.
Perform a regression with only an intercept; i.e. find \(\beta_0\) which minimizes

\[\hat{\beta}_0(x) = \text{argmin}_{\beta_0} \sum_{i=i}^n K_i(x)(y_i - \beta_0)^2.\]

Predict \(\hat f(x) = \hat \beta_0(x)\).

Gaussian (radial basis function) kernel

Common choice that is smoother than nearest neighbors

\[K_i(x) = \exp(-\|x-x_i\|^2/2\lambda)\]

Local linear regression

The span \(k/n\), is chosen by cross-validation.

Generalized Additive Models (GAMs)

Extension of non-linear models to multiple predictors:

\[\;\;\mathtt{wage} = \beta_0 + \beta_1\times \mathtt{year} + \beta_2\times \mathtt{age} + \beta_3 \times\mathtt{education} +\epsilon\]

\[\longrightarrow \;\;\mathtt{wage} = \beta_0 + f_1(\mathtt{year}) + f_2(\mathtt{age}) + f_3(\mathtt{education}) +\epsilon\]

The functions \(f_1,\dots,f_p\) can be polynomials, natural splines, smoothing splines, local regressions…

Fitting a GAM

\[\mathtt{wage} = \beta_0 + f_1(\mathtt{year}) + f_2(\mathtt{age}) + f_3(\mathtt{education}) +\epsilon\]

If the functions \(f_1\) have a basis representation, we can simply use least squares:
- Natural cubic splines
- Polynomials
- Step functions

Backfitting

Keep \(\beta_0,f_2,\dots,f_p\) fixed, and fit \(f_1\) using the partial residuals as response: \[y_i - \beta_0 - f_2( x_{i2}) -\dots - f_p( x_{ip}),\]
Keep \(\beta_0,f_1,f_3,\dots,f_p\) fixed, and fit \(f_2\) using the partial residuals as response: \[y_i - \beta_0 - f_1( x_{i1}) - f_3( x_{i3}) -\dots - f_p( x_{ip}),\]
…
Iterate

This works for smoothing splines and local regression.
For smoothing splines this is a descent method, descending on convex loss …

Also works for linear regression…

Initialize \(\hat{\beta}^{(0)} = 0\) and, (say).
Given \(\hat{\beta}^{(T-1)}\), choose a coordinate \(0 \leq k(T) \leq p\) and find \[ \begin{aligned} \hat{\alpha}(T) &= \text{argmin}_{\alpha} \sum_{i=1}^n\left(Y_i - \hat{\beta}^{(T-1)}_0 - \sum_{j:j \neq k(T)} X_{ij} \hat{\beta}^{(T-1)}_j - \alpha X_{ik(T)}\right)^2 \\ &= \frac{\sum_{i=1}^n X_{ik(T)}\left(Y_i - \hat{\beta}^{(T-1)}_0 - \sum_{j: j \neq k(T)} X_{ij} \hat{\beta}^{(T-1)}_j\right)} {\sum_{i=1}^n X_{ik(T)}^2} \end{aligned} \]
Set \(\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}\) except \(k(T)\) entry which we set to \(\hat{\alpha}(T)\).
Iterate

Backfitting: coordinate descent and LASSO

Initialize \(\hat{\beta}^{(0)} = 0\) and, (say).
Given \(\hat{\beta}^{(T-1)}\), choose a coordinate \(0 \leq k(T) \leq p\) and find \[ \begin{aligned} \hat{\alpha}_{\lambda}(T) &= \text{argmin}_{\alpha} \sum_{i=1}^n\left(r^{(T-1)}_{ik(T)} - \alpha X_{ik(T)}\right)^2 \\ & \qquad + \lambda \sum_{j: j \neq k(T)} |\hat{\beta}^{(T-1)}_j| + \lambda |\alpha| \end{aligned} \] with \(r^{(T-1)}_j\) the \(j\)-th partial residual at iteration \(T\) \[ r^{(T-1)}_j = Y - \hat{\beta}^{(T-1)}_0 - \sum_{l:l \neq j} X_l \hat{\beta}^{(T-1)}_l. \] Solution is a simple soft-thresholded version of previous \(\hat{\alpha}(T)\) – Very fast! Used in glmnet
Set \(\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}\) except \(k(T)\) entry which we set to \(\hat{\alpha}_{\lambda}(T)\).
Iterate…

Backfitting with basis functions

Initialize \(\hat{\beta}^{(0)} = 0\) and, (say).
Given \(\hat{\beta}^{(T-1)}\), choose a coordinate \(0 \leq k(T) \leq p\) and find \[ \begin{aligned} \hat{\alpha}(T) &= \text{argmin}_{\alpha \in \mathbb{R}^{n_{k(T)}}} \\ & \qquad \sum_{i=1}^n\biggl(Y_i - \hat{\beta}^{(T-1)}_0 - \sum_{j:j \neq k(T)} \sum_{l=1}^{n_j} f_{lj}(X_{ij}) \hat{\beta}^{(T-1)}_{lj} \\ & \qquad \quad - \sum_{l=1}^{n_{k(T)}} \alpha_{l} f_{lk(T)}(X_{ik(T)})\biggr)^2 \end{aligned} \] 3.Set \(\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}\) except \(k(T)\) entries which we set to \(\hat{\alpha}(T)\). (This is blockwise coordinate descent!)
Iterate…

Properties

GAMs are a step from linear regression toward a fully nonparametric method.
The only constraint is additivity. This can be partially addressed by adding key interaction variables \(X_iX_j\) (or tensor product of basis functions – e.g. polynomials of two variables).
We can report degrees of freedom for many non-linear functions.
As in linear regression, we can examine the significance of each of the variables.

Example: Regression for `Wage`

year: natural spline with df=4.
age: natural spline with df=5.
education: factor.

Example: Regression for `Wage`

year: smoothing spline with df=4.
age: smoothing spline with df=5.
education: step function

Classification

We can model the log-odds in a classification problem using a GAM:

\[\log \frac{P(Y=1\mid X)}{P(Y=0\mid X)} = \beta_0 + f_1(X_1) + \dots + f_p(X_p).\]

Again fit by backfitting …

Backfitting with logistic loss

Initialize \(\hat{\beta}^{(0)} = 0\) and, (say).
Given \(\hat{\beta}^{(T-1)}\), choose a coordinate \(0 \leq k(T) \leq p\) with \(\ell\) logistic loss, find \[ \begin{aligned} \hat{\alpha}(T) &= \text{argmin}_{\alpha \in \mathbb{R}^{n_{k(T)}}} \\ & \qquad \sum_{i=1}^n \ell\biggl(Y_i, \hat{\beta}^{(T-1)}_0 + \sum_{j:j \neq k(T)} \sum_{l=1}^{n_j} f_{lj}(X_{ij}) \hat{\beta}^{(T-1)}_{lj} \\ & \qquad \quad + \sum_{l=1}^{n_{k(T)}} \alpha_{l} f_{lk(T)}(X_{ik(T)}) \biggr) \end{aligned} \]
Set \(\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}\) except \(k(T)\) entries which we set to \(\hat{\alpha}(T)\).
Works for losses that have a linear predictor.
For GAMs, the linear predictor is \[ \beta_0 + f_1(X_1) + \dots + f_p(X_p) \]

Example: Classification for `Wage>250`

year: linear
age: smoothing spline with df=5
education: step function

Example: Classification for `Wage>250`

Same model excluding cases education == "<HS"

Nonlinear Methods

web.stanford.edu/class/stats202

Basis expansions

Strategy:

Piecewise constant functions

Piecewise polynomial functions

Splines

Cubic splines

Natural cubic splines

Choosing the number and locations of knots

Natural cubic splines vs. polynomial regression

Smoothing splines

Facts

Advanced: deriving a smoothing spline

Degrees of freedom

Natural cubic splines vs. Smoothing splines

Choosing the regularization parameter \(\lambda\)

Local linear regression

Algorithm

Generalized nearest neighbors

Gaussian (radial basis function) kernel

Local linear regression

Generalized Additive Models (GAMs)

Fitting a GAM

Backfitting

Also works for linear regression…

Backfitting: coordinate descent and LASSO

Backfitting with basis functions

Properties

Example: Regression for Wage

Example: Regression for Wage

Classification

Backfitting with logistic loss

Example: Classification for Wage>250

Example: Classification for Wage>250

Example: Regression for `Wage`

Example: Regression for `Wage`

Example: Classification for `Wage>250`

Example: Classification for `Wage>250`