September 16, 2015

Iterative Methods

Iterative methods

We come back to the problem of solving a linear system \(Ax = b\).

The iterative methods seek the solution through a series of approximations. Starting from an initial guess, \(x^{(0)}\), the algorithm refines the estimates though a sequence of repeated steps

\[ x^{(0)} \rightarrow x^{(1)} \rightarrow x^{(2)} \rightarrow x^{(3)} \rightarrow \dots \]

where \(x^{(k)}\) approach the true solution \(x\) as \(k\) grows.

We stop the process when the approximation is good enough, i.e. when the residual becomes acceptably small.


  • Iterative methods are designed to minimize the work required to find a reasonable approximation


  • The work means both:
    • the number of iteration steps
    • the work done per iteration


  • In general, these methods require a number of matrix-vector products and solving simple systems.

Iterative methods can be divided into:

  • (stationary) splitting of the matrix, e.g.
    • Jacobi
    • Gauss-Seidel
    • Successive Over Relaxation (SOR)


  • (nonstationary) search methods, e.g.
    • Steepest descent
    • Conjugate gradient (CG)
    • General Minimum Residual (GMRES)

Today we will discuss the stationary methods. The nonstationary methods are covered in detail in CME 302.

Splitting of the matrix methods

Stationary methods are based on a clever idea. They split the matrix \(A\) into 2 components \(A = M- N\) where \(M\) is nonsingular.

\[ Ax = b \rightarrow (M - N)x = b \rightarrow Mx = Nx + b \]

We convert the last equation as follows:

\[ M x^{(k+1)} = N x^{(k)} + b, \; k = 0, 1, \dots \]

The algorithm then performs the following operations at each iteration step:

  • computing: \(c = N x^{(k)}\)
  • solving: \(Mx^{(k+1)} = c\)

Before we discuss different stationary methods, we need to consider the following questions:

  • If the iteration method converges, does it converge to the correct solution?

  • For what choices of \(M\) and \(N\) does the iteration converge?

  • How fast does it converge?

  • How to test for convergence?

  • How to pick \(M\) optimally so the work involved is minimized?

Convergence

We show that when the stationary method converges, it converges to a true solution.

Let \(x^{(k+1)} \rightarrow x^*\) and \(x^{(k)} \rightarrow x^*\) as \(k \rightarrow \infty\). Then,

\[ \begin{aligned} Mx^{(k+1)} &= N x^{(k)} + b \\ Mx^* &= Nx^* + b \\ (M - N)x^* &= b\\ Ax^* &= b \end{aligned} \]

Thus, the iteration converged to a true solution.

How to choose \(M\) and \(N\) so that the algorithm converges?

Let us define the error at iteration \(k\): \(e^{(k)} = x^{(k)} - x\). We can then get

\[ \begin{aligned} Mx^{(k+1)} &= N x^{(k)} + b \\ Mx^{(k+1)} - Mx + Mx &= N x^{(k)} - Nx + Nx + b \\ Me^{(k+1)} &= Ne^{(k)} - Mx + Nx + b \\ Me^{(k+1)} &= Ne^{(k)} - Ax + b \\ Me^{(k+1)} &= Ne^{(k)} \end{aligned} \]

We thus, have

\[ e^{(k+1)} = M^{-1}Ne^{(k)} = (M^{-1}N)^{k+1} e^{(0)} \]

and define \(G = M^{-1}N\) as the amplification matrix.

We can use the norm of \(G = M^{-1}N\) to bind the error

\[ \|e^{(k)}\| = \|G^k e^{(0)}\| \le \|G^k\|\|e^{(0)}\| \]


We can then choose \(M\) and \(N\) so that \(\|G\| < 1\), which guarantees convergence as \(k \rightarrow \infty\).

It turns that a weaker condition:

\[ \rho(G) = \lambda_{\max} < 1 \]

is a necessary and sufficient condition for convergence.

[Proof] Can be found section 10.10 in prof. Gerritsen notes http://margot.stanford.edu/?page_id=15

The rate of convergence is typically defined as the increase in the number of correct decimal places in the solution per iteration step.

\[ \begin{aligned} \|G\|\|e^{(k)}\| &\ge \|e^{(k+1)}\| \\ \frac{\|e^{(k)}\|}{\|e^{(k+1)}\|} &\ge \frac{1}{\|G\|}\\ \log_{10} \|e^{(k)}\| - \log_{10}\|e^{(k+1)}\| &\ge - \log_{10}\|G\| \end{aligned} \]

Thus, the increase in the number of correct decimal places when stepping from iteration \(k\) to \(k+1\) is at least \(-\log_{10}\|G\|\)

Note that we cannot use \(\|e^{k}\| < \epsilon\) or \(\frac{\|e^{k}\|}{x^{(k)}} < \epsilon\) as a stopping criteria, since we have no knowledge of the true \(x\) (and therefore of \(e^{(k)}\)).

In practice we use the residual \(r^{(k)} = b - Ax^{(k)}\) to decide whether or not we should stop the iteration process.

With a chosen level of \(\epsilon > 0\) we often use a relative stopping criteria, and halt the algorithm when for a large enough \(k\) we have:

\[ \|r^{(k)}\| \le \epsilon (\|A\|\|x^{(k)}\| + \|b\|) \]

or

\[ \|r^{(k)}\| \le \epsilon \|b\| \]

if \(\|A\|\) is not available.

After discussing the convergence criteria, we now know what that a good choice of \(M\) be such that:

  • \(M\) should be nonsingular (otherwise we can't solve for the update)

  • \(\|M^{-1}N\|\) < 1 or (more leniently \(\rho(M^{-1}N) < 1\))

  • Must be simple enough so solving \(Mx^{(k+1)} = c\) is easy

  • The algorithm converges fast.


Now we are ready to go over examples of the splitting matrix (stationary) methods…

Jacobi Method

We observe that every square matrix is a sum of 3 matrices \(A = D + L + U\)

where \(D\) is a diagonal part, whereas \(L\) and \(U\) are the lower- and upper-triangular parts of \(A\) respectively (not containing the diagonal).

The Jacobi method sets \(M = D\) and \(N = - (L + U)\) which means \(M\) is very easy to invert.

The updating step is then simply:

\[ \begin{aligned} Mx^{(k+1)} &= Nx^{(k)} + b\\ Dx^{(k+1)} &= -(L+U)x^{(k)} + b \\ x^{(k+1)} &= -D^{-1}(b -(L+U)x^{(k)}) \end{aligned} \]

Note that \(D^{-1} = \text{diag}(\frac{1}{a_{11}}, \dots, \frac{1}{a_{nn}})\) and one can write the update

\[x^{(k+1)} = D^{-1}(b - (L+U)x^{(k)})\]

in an enty wise form:

\[ x^{(k+1)}_i = \frac{1}{a_{ii}} \left ( b_i - \sum_{j \neq i} a_{ij} x_{j}^{(k)}\right ) \]
which means that the Jacobi method is highly parrallelizable.

Now, we need to see when does the Jacobi method converge: \[ G = M^{-1}N = - D^{-1}(L+U) = -D^{-1}(A - D) = I - D^{-1}A \]

This matrix has zeros on the diagonal and its off-diagonal entries are \(a_{ij}/ a_{ii}\). Now let's consider the maximum norm of \(G\):

\[ \|G\|_{\infty} = \max_{1 \le i \le n} \sum_{j}^n |g_{ij}| = \max_{1 \le i \le n} \sum_{j \ne i} \frac{|a_{ij}|}{|a_{ii}|} \]

This means that for a diagonaly dominant matrix, i.e. for \(A\) such that \(|a_{ii}| > \sum_{j \ne i}^n |a_{ij}|, \; i =1, \dots , n.\)

Then \(\forall i, \sum_{j \ne i} \frac{|a_{ij}|}{|a_{ii}|} < 1\), and \(\|G\|_{\infty} < 1\), and so we have a guaranteed convergence of the Jacobi method for all diagonally dominant matrices.

Gauss-Seidel

In Gauss-Seidel method we choose \(M=D + L\) and \(N = −U\).

Since, \(M\) is no longer diagonal, but lower-triangular. We cannot separate the updates to each components and so lose the parallelism.

However, the update step \[(D+L)x^{(k+1)} = -Ux^{(k)}+b,\] is still relatively easy and can be done using forward substitition.

The extra expence when solving for the update is hoped to be offset by a higher convergence rate (a smaller number of required iterations), so the total cost ends up lower than the one for Jacobi method.

Gauss-Seidel can be shown to converge for diagonally dominant and symmetric positive definite matrices.

Successive Over Relaxation

The SOR method is slightly more creative in splitting up the matrix \(A\).

It picks

\[M = \alpha D + L, \text{ and } N = -(1-\alpha )D - U, \quad \alpha \in \mathbb{R}\]

The idea is to tune the free parameter, \(\alpha\), for an improved the convergence rate.


Since the spectral radius \(\rho(G)\) is a good indecation of the convergence rate, we choose \(\alpha\) such that the largest in magnitude eigenvalue, \(\lambda_1\) of \(G = M^{-1}N\) is minimized.

Exercises

  1. Give example of a \(2 \times 2\) matrix for which Jacobi method does not converge.

  2. Show that Jacobi and Gauss-Seidel iterations converge when solving \(Ax = b\) with

\(A = \begin{pmatrix} 2 & 1\\ 1 & 3 \end{pmatrix}\)

Which one is faster?

Answers:

Give example of a \(2 \times 2\) matrix for which Jacobi method does not converge.

Pick a diagonally inferior matrix e.g. \(A = \begin{pmatrix} -1 & 2\\ 2 & -1 \end{pmatrix}\) \[ G = -D^{-1}(L+U) =\begin{pmatrix} -1 & 0\\ 0 & -1 \end{pmatrix}^{-1}\begin{pmatrix} 0 & -2\\ -2 & 0 \end{pmatrix} = \begin{pmatrix} 0 & 2\\ 2 & 0 \end{pmatrix} \] the spectral radius of the amplification matrix \(\rho(G) = |\lambda_1| = 2 > 1\) so the Jacobi method would not converge.

Answers:

Show that Jacobi and Gauss-Seidel iterations converge for \(A = \begin{pmatrix} 2 & 1\\ 1 & 3 \end{pmatrix}\)


\[ G_{Jacobi} = -D^{-1}(L+U) = \begin{pmatrix} 2 & 0\\ 0 & 3 \end{pmatrix}^{-1}\begin{pmatrix} 0 & -1\\ -1 & 0 \end{pmatrix} = \begin{pmatrix} 0 & -1/2\\ -1/3 & 0 \end{pmatrix} \]

\[ \rho (G_{Jacobi}) \le \| G_{Jacobi} \|_F \approx 0.60 < 1 \]

\[ G_{GS} = -(D+L)^{-1}U = \begin{pmatrix} 2 & 0\\ 1 & 3 \end{pmatrix}^{-1}\begin{pmatrix} 0 & -1\\ 0 & 0 \end{pmatrix} = \begin{pmatrix} 0 & -1/2\\ 0 & 1/6 \end{pmatrix} \]

\[\rho (G_{GS}) \le \|G_{GS}\|_F \approx 0.53 < 1 \]

Both methods converge, but \(Gauss-Seidel\) with fewer iterations.

Least Squares Methods

Over-determined systems

So far we focused on the linear systems, \(Ax = b\), that have a solution.

But what if \(b \notin \mathcal{R}(A)\) i.e. there is no exact solution?

  • this might happen if the matrix \(A \in \mathbb{C}^{m \times n},\) \(m > n\) is tall-and-skinny (i.e. the system is over-determined).

Note that in that case \(\text{rank}(A) \le n < m\), many vectors \(b \in \mathbb{C}^{m}\) might not be in \(\mathcal{R}(A)\).

Because there is no exact solution, we want to look for 'the closest' one. How do we define closeness?

\[ \text{a natural way is to use norms on the residual, } r = \|b - Ax\| \]

Least Squares

What norms can we use?

  • we need a norm that is easily differentiable (since we need to deal with minimization)

Least Squares problem uses the 2-norm to find the closest solution:

\[ x_{LS} = \text{arg min}_{x} \|b - Ax\|^2_2 = \text{arg min}_{x} \|r\|^2_2 \]

Note that according to the fundamental theory of Linear Algebra, every vector \(b \in \mathbb{C}^{m}\) can be decomposed into 2 parts

\[ b = b_1 + b_2 \]

where \(b_1 \in \mathcal{R}(A)\) and \(b_2 \in \mathcal{N}(A^*)\)

Since \(b_2 \perp \mathcal{R}(A)\) we can obtain the following formula for the residual norm:

\[ \|r\|_2^2 = \|b_1 + b_2 - Ax\|^2_2 = \|b_1 - Ax\|_2^2 + \|b_2\|_2^2 \]

This is minimized when \(Ax^* = b_1\) and \(r = b_2 \in \mathcal{N}(A^*)\).

So, the residual of the solution to the Least Squares problem will always be perpendicular to the range of \(A\)

\[ r = (b - Ax^*) \perp \mathcal{R}(A)\]

That is, \(A^*r = A^*Ax^* - A^*b = 0\).

\[r =( b - Ax^*) \perp \mathcal{R}(A)\]


Normal Equations

We want to minimize \(\|r\|_2^2\) which is equal to:

\[ \begin{aligned} \|b - Ax\|^2_2 &= (b - Ax)^*(b - Ax) \\ &= b^*b - 2x*A^*b + x^*A^*Ax \end{aligned} \]

Taking the derivative and setting it to zero to find the minimizer we get \(-2A^*b + 2A^*Ax = 0\), which simplifies to

\[ A^*Ax = A^*b, \]

and is called the normal equation

If \(A\) has full column rank, \(\text{rank}(A) = n\), then \(A^*A\) is invertible, and we can obtain a unique solution to the normal equation:

\[ x^* = (A^*A)^{-1}A^*b \]

Orthogonal projection onto \(\mathcal{R}(A)\)

If \(x^*\) is the least squares solution to \(Ax = b\), then \(Ax^*\) is the orthogonal projection of \(b\) onto \(\mathcal{R}(A)\)

\[ \|b - Ax^*\|_2 \le \|b - Ax\|_2, \quad \forall x \in \mathbb{C}^n \\ Ax^* = P_Ab \] since the projection is the closest vector in the subspace.

Combining the formula for the least squares \(x^* = (A^*A)^{-1}A^*b\) solution and \(Ax^* = P_Ab\) we get the matrix for the orthogonal projection onto \(\mathcal{R}(A)\):

\[ P_A = A(A^*A)^{-1}A^* \]

Exercise

Show \(P = A(A^*A)^{-1}A^*\) is indeed an orthogonal projection. Onto what does this matrix project?

\[ \begin{aligned} P^2 &= (A(A^*A)^{-1}A^*)(A(A^*A)^{-1}A^*) \\ &= A(A^*A)^{-1}((A^*A)(A^*A)^{-1})A^* = A(A^*A)^{-1}A^* \\ &= P \end{aligned} \]

\[ P = P^* \]

(As stated in the previous slide) \(P\) projects onto the range of \(A\).

Solving Least Squares Problem

We assume that \(A\) is full-column-rank, othewise the problem is ill-posed.

We have a formula for the solution \(x^* = (A^*A)^{-1}A^*b\), but in practice we do not compute \((A^*A)^{-1}\) directly.

Instead the least squares problem methods use:

  • Cholesky factorization of \(A^*A\)
  • QR factorization of \(A\)
  • SVD of \(A\)

Least Squares with Cholesky

Use Cholesky factorization and the normal equation \(LL^* x = A^*Ax = A^*b = y\)

Algorithm:

  • Compute the matrix \(A^*A\) and the vector \(y = A^*b\)

  • Perform Cholesky factorization \(A^*A = LL^*\) (possible since \(A^*A\) is HPD)

  • Solve the lower-\(\Delta\) system \(Lw = y\) for \(w\) using forward-subsitution

  • Solve the upper-\(\Delta\) system \(L^*x = w\) for \(x\) using back-subsitution

Least Squares with QR

We use the QR factorization \(A = QR\):

\[ \begin{aligned} A^*Ax &= A^*b \\ R^*Q^*QRx &= R^*Q^*b \\ R^*Rx &= R^*Q^*b \\ Rx &= Q^*b \end{aligned} \]

Algorithm:

  • Compute the QR factorization of \(A\)

  • Form \(y = Q^*b\)

  • Solve \(Rx = y\) by back-substitution

Least Squares with SVD

We use the SVD of the matrix \(A = U\Sigma V^*\) and observe that:

\[ \begin{aligned} P_A &= A(A^*A)^{-1}A^* \\ &= U\Sigma V^*(V\Sigma ^2 V^*)^{-1} V\Sigma U^*\\ &= U \Sigma V^* V \Sigma^{-2} V^* V \Sigma U^*\\ &= UU^* \end{aligned} \]

We use the fact that \(Ax^*\) is the orthogonal projection of \(b\) onto the range of \(A\):

\[Ax^* = P_Ab \rightarrow U \Sigma V^*x^* = U U^* b\]

which gives us an equation:

\[ \Sigma V^* x = U^*b \]

Keeping the equation \(\Sigma V^* x = U^*b\) in mind we proceed with the following algorithm


Algorithm:

  • Compute the SVD of \(A,\) \(A = U \Sigma V^*\)

  • Form the vector \(y = U^*b\)

  • Solve \(\Sigma w = y\) for \(w\)

  • Set \(x = Vw\)

Thanks to

I would like to thank the following people whose notes I heavily used when preparing this slide deck.

  1. Margot
  2. Victor
  3. Karianne
  4. Anil