Conjugate Gradient Method

Minimization

The conjugate gradient method is the most popular iterative method for solving large systems of linear equations. This is effective for systems of the form Ax=bAx=b where xx is an unkown vector, bb is a known vector, and AA is a known, square, symmetric, positive-definite matrix. This form is mostly used to find the minimum of a cost function f(x)f(x). Since AA is a symmetric and positive-definite matrix, f(x)f(x) is shaped like a paraboloid bowl, so it has a minimum as mentioned in this note. If AA might be singular, in which case no solution is unique. In this case, the set of solutions is a line or hyperplane having a uniform value for ff. If AA is indefinite, the gradient descent and conjugate gradient methods will likely fail.

Eigenvectors & Eigenvalues

Consider an eigenvector vv and an eigenvalue λ\lambda of a matrix BB such that Bv=λvBv = \lambda v. Iterative methods often depend on applying BB to a vector over and over again. When BB is repeatedly applied to an eigenvector vv, one of two things can happen. If λ<1\vert \lambda \vert < 1, then Biv=λivB^iv = \lambda^iv will vanish as ii approaches infinity. If λ>1\vert \lambda \vert > 1, then BivB^iv will grow to infinity. Each time BB is applied, the vector grows or shrinks according to the value of λ\vert \lambda \vert.

If BB is symmetric, then all different eigenvectors of BB are perpendicular. That is, all eigenvectors form a basis for Rn\mathbb{R}^n. As such, any nn-dimensional vector can be expressed as a linear combination of these eigenvectors. Applying BB to xx that is not an eigenvector is equivalent to applying BB to the eigenvectors and summing the result.

Conjugate Gradient Method

The conjugate gradient method is especially suitable for very large problems since it does not require explicit second derivatives. Indeed, it does not even store an approximation to the Hessian matrix. While the gradient descent method tends to search in the same directions repeatedly, leading to very slow convergence, the conjugate gradient method avoids repeatedly searching the same directions by modifying the new gradient at each step to remove components in previous directions. The resulting sequence of conjugate search directions implicitly accumulates information about the Hessian matrix as iterations proceed. Compare that the gradient descent method follows a zigzag path, which appears because each gradient is orthogonal to the previous gradient.

CGOrtho

Here is the idea behind this method. Suppose that d0,d1,,dn1d_0, d_1, \cdots, d_{n-1} are a set of orthogonal search directions whose each search direction is used for exactly one step. Let x1,,xnx_1, \cdots, x_n be the estimated solutions by step starting from the initial guess x0x_0. Also, let the error in ii-step be ei=xixe_i = x_i - x to indicate how far from the solution xx. Then, ei+1e_{i+1} should be orthogonal to did_i so that did_i will never be used again. Using this fact, xix_i can be updated along with αi\alpha_i per one-iteration where αi\alpha_i is a line search parameter that indicates how big a step takes. xi+1=xi+αidi    ditei+1=dit(xi+1x)=dit(xi+αidix)=dit(ei+αidi)=0    αi=diteiditdi\begin{aligned} &x_{i+1} = x_i + \alpha_i d_i \\\\ \implies &d_i^t e_{i+1} = d_i^t (x_{i+1} - x) = d_i^t (x_i + \alpha_i d_i - x) = d_i^t (e_i + \alpha_i d_i) = 0 \\ \implies &\alpha_i = -\cfrac{d_i^t e_i}{d_i^t d_i} \end{aligned}

However, the contradiction occurs αi\alpha_i requires eie_i to update xix_i and eie_i needs the solution xx. The workaround is to make the search directions mutually AA-orthogonal. Vectors yy and zz are called AA-orthogonal if ytAz=0y^t Az = 0, or conjugate. That is, yy is orthogonal to the transformed zz by AA. As shown below, the main idea is to stretch the ellipses to appear circular. In this perspective, ei+1e_{i+1} should be AA-orthogonal to did_i this time.

CGStretch

Considering that the residual ri=bAxi=AxAxi=Aeir_i = b - Ax_i = Ax - Ax_i = -Ae_i which indicates how far from the correct value of bb, ditAei+1=ditA(xi+1x)=ditA(xi+αidix)=ditA(ei+αidi)=ditAei+αiditAdi=ditri+αiditAdi=0    αi=ditriditAdi\begin{aligned} d_i^t A e_{i+1} &= d_i^t A (x_{i+1} - x) = d_i^t A(x_i + \alpha_i d_i - x) = d_i^t A(e_i + \alpha_i d_i) \\ &= d_i^t A e_i + \alpha_i d_i^t A d_i = -d_i^t r_i + \alpha_i d_i^t A d_i = 0 \\ &\implies \alpha_i = \cfrac{d_i^t r_i}{d_i^t A d_i} \end{aligned}

Gram-Schmidt Conjugation

Then how can the search directions be obtained? The conjugate Gram-Schmidt process makes it simple to generate them. Recalling the Gram-Schmidt process, it is a method of constructing an orthonormal basis from a linearly independent set of vectors. More specifically, did_i can be constructed from a set of nn linearly independent vectors u0,,un1u_0, \cdots, u_{n-1} as follows. d0=u0,d1=u1(u1d0)d0,d2=u2(u2d0)d0(u2d1)d1,di=uik=0i1(uidk)dk\begin{aligned} d_0 &= u_0, \\ d_1 &= u_1 - (u_1 \cdot d_0)d_0, \\ d_2 &= u_2 - (u_2 \cdot d_0)d_0 - (u_2 \cdot d_1)d_1, \\ &\vdots \\ d_i &= u_i - \sum_{k=0}^{i-1} (u_i \cdot d_k)d_k \end{aligned}

Note that these should be normalized later. But, how is this formula derived? This is based on the orthogonality. di=ui+k=0i1βikdk    0=didj=uidj+k=0i1βik(dkdj)=uidj+βij(djdj)=uidj+βij(i>j)    βij=uidj\begin{aligned} &d_i = u_i + \sum_{k=0}^{i-1} \beta_{ik} d_k \\ \implies &0 = d_i \cdot d_j = u_i \cdot d_j + \sum_{k=0}^{i-1} \beta_{ik} (d_k \cdot d_j) = u_i \cdot d_j + \beta_{ij} (d_j \cdot d_j) = u_i \cdot d_j + \beta_{ij} \quad (i > j)\\ \implies &\beta_{ij} = -u_i \cdot d_j \end{aligned}

The same logic is used for the conjugate Gram-Schmidt process. This is based on the AA-orthogonality this time. di=ui+k=0i1βikdk    0=ditAdj=uitAdj+k=0i1βikdktAdj=uitAdj+βijdjtAdj(i>j)    βij=uitAdjdjtAdj\begin{aligned} &d_i = u_i + \sum_{k=0}^{i-1} \beta_{ik} d_k \\ \implies &0 = d_i^t A d_j = u_i^t A d_j + \sum_{k=0}^{i-1} \beta_{ik} d_k^t A d_j = u_i^t A d_j + \beta_{ij} d_j^t A d_j \quad (i > j)\\ \implies &\beta_{ij} = -\cfrac{u_i^t A d_j}{d_j^t A d_j} \end{aligned}

Then how can the set {u0,,un1}\{ u_0, \cdots, u_{n-1} \} selected? It is possible by setting ui=riu_i = r_i, that is, by conjugation of the residuals. This choice makes sense because the residual has the nice property that it is orthogonal to the previous search directions. In other words, it is guaranteed always to produce a new, linearly independent search direction unless the residual is zero, in which case the problem is already solved. djtri=djtAei=djtA(xix)=djtA(xi1+αi1di1x)=djtA(ei1+αi1di1)=djtri1αi1djtAdi1==djtrjk=ji1αkdjtAdk=djtrjαjdjtAdj=djtrjdjtrj=0(i>j)\begin{aligned} d_j^t r_i &= -d_j^t Ae_i = -d_j^t A(x_i - x) = -d_j^t A(x_{i-1} + \alpha_{i-1} d_{i-1} - x) \\ &= -d_j^t A(e_{i-1} + \alpha_{i-1} d_{i-1}) = d_j^t r_{i-1} - \alpha_{i-1} d_j^t A d_{i-1} \\ &= \cdots = d_j^t r_{j} - \sum_{k=j}^{i-1} \alpha_{k} d_j^t A d_{k} = d_j^t r_{j} - \alpha_{j} d_j^t A d_{j} = d_j^t r_{j} - d_j^t r_{j} = 0 \quad (i > j) \end{aligned}

Note that the residual has the recurrence property. As each residual is orthogonal to the previous search directions, it is also orthogonal to the previous residuals. That is, ritrj=0r_i^t r_j = 0 when iji \ne j. dj=uj+k=0j1βjkdk    0=djtri=ujtri+k=0j1βjkdktri=ujtri=rjtri(i>j,uj=rj)\begin{aligned} d_j = u_j + \sum_{k=0}^{j-1} \beta_{jk} d_k \implies 0 = d_j^t r_i = u_j^t r_i + \sum_{k=0}^{j-1} \beta_{jk} d_k^t r_i = u_j^t r_i = r_j^t r_i \quad (i > j, u_j = r_j) \end{aligned}

As another important property, the following identity updates αi\alpha_i. ditri=uitri+k=0i1βikdktri=uitri=ritri    αi=ditriditAdi=ritriditAdi\begin{aligned} d_i^t r_i = u_i^t r_i + \sum_{k=0}^{i-1} \beta_{ik} d_k^t r_i = u_i^t r_i = r_i^t r_i \implies \alpha_i = \cfrac{d_i^t r_i}{d_i^t A d_i} = \cfrac{r_i^t r_i}{d_i^t A d_i} \end{aligned}

Only New Search Direction

Since the search directions can be obtained from the conjugate Gram-Schmidt process, the final update function can be set up as well. From the recurrence of the residual checked previously, rj+1=rjαjAdjr_{j+1} = r_j - \alpha_j A d_j, ritrj+1=ritrjαjritAdj    αjritAdj=ritrjritrj+1ritAdj={  ritriαi(i=j)  ritriαi1(i=j+1)  0otherwiseβij=uitAdjdjtAdj=ritAdjdjtAdj={  1αi1ritridi1tAdi1(i=j+1)  0(i>j+1)\begin{aligned} r_i^t r_{j+1} &= r_i^t r_j - \alpha_j r_i^t A d_j \implies \alpha_j r_i^t A d_j = r_i^t r_j - r_i^t r_{j+1} \\\\ r_i^t A d_j &= \begin{cases} \; \cfrac{r_i^t r_i}{\alpha_i} & (i = j) \\ \; -\cfrac{r_i^t r_i}{\alpha_{i-1}} & (i = j + 1) \\ \; 0 & \text{otherwise} \end{cases} \\\\ \beta_{ij} &= -\cfrac{u_i^t A d_j}{d_j^t A d_j} = -\cfrac{r_i^t A d_j}{d_j^t A d_j} = \begin{cases} \; \cfrac{1}{\alpha_{i-1}} \cfrac{r_i^t r_i}{d_{i-1}^t A d_{i-1}} & (i = j + 1) \\ \; 0 & (i > j + 1) \end{cases} \end{aligned}

Note that βij\beta_{ij} is defined only when i>ji > j. This result shows that most of the βij\beta_{ij} terms have disappeared. It is no longer necessary to store old search directions to ensure the AA-orthogonality of new search directions. This major property reduces the space and time complexities per iteration to O(m)O(m), where mm is the number of nonzero entries of AA. Simplyfing βi,i1\beta_{i, i-1} further, αi=ritriditAdi    βi,i1=1αi1ritridi1tAdi1=ritriri1tri1\begin{aligned} \alpha_i = \cfrac{r_i^t r_i}{d_i^t A d_i} \implies \beta_{i, i-1} = \cfrac{1}{\alpha_{i-1}} \cfrac{r_i^t r_i}{d_{i-1}^t A d_{i-1}} = \cfrac{r_i^t r_i}{r_{i-1}^t r_{i-1}} \end{aligned}

As a result, the conjugate gradient method can be summed up as follows. {  d0=r0=bAx0  αi=ritriditAdi  xi+1=xi+αidi  ri+1=riαiAdi  βi+1,i=ri+1tri+1ritri  di+1=ri+1+βi+1di\begin{aligned} \begin{cases} \; d_0 = r_0 = b - A x_0 \\ \; \alpha_i = \cfrac{r_i^t r_i}{d_i^t A d_i} \\ \; x_{i+1} = x_i + \alpha_i d_i \\ \; r_{i+1} = r_i - \alpha_i A d_i \\ \; \beta_{i+1, i} = \cfrac{r_{i+1}^t r_{i+1}}{r_{i}^t r_{i}} \\ \; d_{i+1} = r_{i+1} + \beta_{i+1} d_i \end{cases} \end{aligned}

Then how to choose a starting point, and when to stop? If a rough estimate of the value of xx is available, it can be used as the starting point x0x_0. If not, set x0=0x_0 = \mathbf{0}. This will eventually converge when used to solve linear systems. Nonlinear minimization is trickier, though, because there may be several local minima. Usually, it is customary to stop when the norm of the residual falls below a specific value such as ri<ϵr0\Vert r_i \Vert < \epsilon \Vert r_0 \Vert.

Preconditioning

Thd conjugate gradientn method can still converge very slowly if AA is ill-conditioned. Preconditioning is a technique for improving the condition number of a matrix. Suppose that MM is a symmetric, positive-definite matrix that approximates AA, but is easier to invert. Then, Ax=bAx=b can be indirectly solved by solving M1Ax=M1bM^{-1}Ax = M^{-1}b because M1AM^{-1}A is relatively well-conditioned. If the eigenvalues of M1AM^{-1}A are better clustered than those of AA, this system can be iteratively solved more quickly than the original problem. But, M1AM^{-1}A is not generally symmetric nor definite, even if MM and AA are. To avoid this difficulty, EE can be obtained such that EEt=MEE^t = M by Cholesky factorization since MM is positive-definite. Then, M1AM^{-1}A and E1AEtE^{-1}AE^{-t} have the same eigenvalues. Let vv be an eigenvector of M1AM^{-1}A with eigenvalue λ\lambda. (E1AEt)(Etv)=E1Av=(EtEt)E1Av=EtM1Av=λEtv\begin{aligned} (E^{-1}AE^{-t})(E^t v) = E^{-1} Av = (E^t E^{-t})E^{-1} Av = E^t M^{-1} Av = \lambda E^t v \end{aligned}

With this fact, the system Ax=bAx=b can be transformed into Ax=b    E1Ax=E1b    (E1AEt)Etx=E1b    (E1AEt)x^=E1b\begin{aligned} Ax = b \implies E^{-1}Ax = E^{-1}b \implies (E^{-1}AE^{-t})E^t x = E^{-1}b \implies (E^{-1}AE^{-t})\hat{x} = E^{-1}b \end{aligned}

So, xx can be obtained after solving it for x^=Etx\hat{x} = E^t x first. Note that E1AEtE^{-1}AE^{-t} is symmetric and positive-definite. Then, the transformed preconditioned conjugate gradient method as follows. A^=E1AEt,b^=E1b,x^=Etxr^0=E1b(E1AEt)Etx0=E1(bAx0)    r^0=E1r0d^itA^d^j=d^itE1AEtd^j=0    di=Etd^i    d^i=Etdi\begin{aligned} \hat{A} = E^{-1}AE^{-t}, \quad \hat{b} &= E^{-1}b, \quad \hat{x} = E^t x \\\\ \hat{r}_0 = E^{-1}b - (E^{-1}AE^{-t})E^t x_0 &= E^{-1}(b - Ax_0) \implies \hat{r}_0 = E^{-1}r_0 \\\\ \hat{d}_{i}^t \hat{A} \hat{d}_j = \hat{d}_{i}^t E^{-1}AE^{-t} \hat{d}_j = 0 &\implies d_i = E^{-t} \hat{d}_i \implies \hat{d}_i = E^t d_i \end{aligned}

The difficulty is that EE must be computed. However, there is a way to avoid this computation. Considering that r^itr^i=ritEtE1ri=ritM1ri\hat{r}_i^t \hat{r}_i = r_i^t E^{-t}E^{-1} r_i = r_i^t M^{-1} r_i since EtE1=M1E^{-t}E^{-1} = M^{-1}, the untransformed preconditioned conjugate gradient method is finally updated as follows. {  r0=bAx0  d^0=r^0    d0=EtE1r0=M1r0  αi=r^itr^id^itA^d^i=ritM1riditEE1AEtEtdi=ritM1riditAdi  xi+1=xi+αidi  ri+1=riαiAdi  βi+1,i=r^i+1tr^i+1r^itr^i=ri+1tM1ri+1ritM1ri  di+1=M1ri+1+βi+1di\begin{aligned} \begin{cases} \; r_0 = b - A x_0 \\\\ \; \hat{d}_0 = \hat{r}_0 \implies d_0 = E^{-t}E^{-1} r_0 = M^{-1} r_0 \\\\ \; \alpha_i = \cfrac{\hat{r}_i^t \hat{r}_i}{\hat{d}_i^t \hat{A} \hat{d}_i} = \cfrac{r_i^t M^{-1} r_i}{d_i^t E E^{-1} A E^{-t} E^t d_i} = \cfrac{r_i^t M^{-1} r_i}{d_i^t A d_i} \\\\ \; x_{i+1} = x_i + \alpha_i d_i \\\\ \; r_{i+1} = r_i - \alpha_i A d_i \\\\ \; \beta_{i+1, i} = \cfrac{\hat{r}_{i+1}^t \hat{r}_{i+1}}{\hat{r}_{i}^t \hat{r}_{i}} = \cfrac{r_{i+1}^t M^{-1} r_{i+1}}{r_{i}^t M^{-1} r_{i}} \\\\ \; d_{i+1} = M^{-1} r_{i+1} + \beta_{i+1} d_i \end{cases} \end{aligned}

As shown above, EE disappear and only M1M^{-1} is required. The effectiveness of a preconditioner MM is determined by the condition number of M1AM^{-1}A, and occasionally by its clustering of eigenvalues. The problem remains of finding a preconditioner that approximates AA well enough to improve convergence enough to make up for the cost of computing M1riM^{-1}r_i once per iteration. Intuitively, preconditioning is an attempt to stretch the quadratic form to make it appear more spherical so that the eigenvalues are close to each other. A perfect preconditioner is M=AM=A which makes M1riM^{-1}r_i have a condition number of one, and the quadratic form is perfectly spherical, so the solution takes only one iteration.

The simplest preconditioner is a diagonal matrix whose diagonal entries are identical to those of AA, known as diagoanl preconditioning or Jacobi preconditioning, which is scaling the quadratic form along the coordinate axes. By comparison, the perfect preconditioner M=AM=A scales the quadratic form along its eigenvector axes.

A more elaborate preconditioner is incomplete Cholesky preconditioning. While the Cholesky factorization finds LL such that A=LLtA=LL^t, incomplete Cholesky factorization finds L^\hat{L} such that AL^L^tA \approx \hat{L}\hat{L}^t where L^\hat{L} might be restricted to have the same pattern of nonzero elements as AA and other elements of LL are thrown away. To use L^L^t\hat{L}\hat{L}^t as a preconditioner, the solution to L^L^tw=z\hat{L}\hat{L}^t w = z is computed by backsubstitution. Unfortunately, incomplete Cholesky preconditioning is not always stable.

Normal Equation

What if AA is not square, not symmetric, and not positive-definite? In this case, the normal equation should be applied, which means that the solution can be found from the linear system AtAx=AtbA^tAx = A^tb. If AA is square and nonsingular, the solution is the same as the one of Ax=bAx=b. If AA is not square, and Ax=bAx=b is overconstrained, then there may or may not be a solution to Ax=bAx=b, but it is always possible to find xx closest to the solution. If Ax=bAx=b is not underconstrained, then AtAA^t A is nonsingular since AtAA^t A is symmetric and positive-definite, and the conjugate gradient method can be used to solve this normal equation. However, note that the condition number of AtAA^t A is the square of that of AA, so convergence is significantly slower.

Nonlinear Conjugate Gradient Method

As explained in this note, suppose that ff is a function for xRnx \in \mathbb{R}^n and its Hessian matrix HfH_f is positive definite. By Taylor’s theorem, f(x+s)f(x)+f(x)ts+12stHf(x)s\begin{aligned} f(x + s) \approx f(x) + \nabla f(x)^t s + \frac{1}{2} s^t H_f(x) s \end{aligned}

That is, xx should be moved by s\Vert s \Vert in direction ss. However, the direction to move in the case of the conjugate gradient method is already fixed as dd. As such, what to know is how much xx should be moving by. More specifically, α\alpha such that s=αds = \alpha d should be selected. To find α\alpha, the approximated equation should be differentiated against α\alpha, not ss this time, unlike the original process mentioned in this note. αdtHf(x)d=f(x)td\begin{aligned} \alpha d^t H_f(x) d = - \nabla f(x)^t d \end{aligned}

References

[1] An Introduction to the Conjugate Gradient Method Without the Agonizing Pain

[2] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.


© 2024. All rights reserved.