Normal Equation

For an m×nm \times n matrix AA and a vector bRmb \in \mathbb{R}^m, the linear system Ax=bAx = b is to find the optimal solution xRnx \in \mathbb{R}^n.

The Residual Minimization

The problem Ax=bAx = b can be replaced by the problem minimizing the residual, bAx22\left\| b - Ax \right\|_2^2. Let AiA_i be the ii-th row vector of AA. Then the cost function f(x)f(x) can be defined as follows. f(x)=bAx22=(b1A1txbmAmtx)22=k(bkAktx)2\begin{aligned} f(x) &= \left\| b - Ax \right\|_2^2 = \left\| \left(\begin{array}{c} b_1 - A^t_1 x \\ \vdots \\ b_m - A^t_m x \end{array}\right) \right\|_2^2 = \sum_{k} (b_k - A^t_k x)^2 \end{aligned}

A necessary condition for a minimum is that xx be a critical point of ff, where the gradient vector f(x)\nabla f(x) is zero. Let aija_{ij} be the (i,j)(i, j) element of AA. Then, fxi=k2(bkAktx)xi(bkAktx)=2k(bkAktx)(aki)=0    kakiAktx=kakibk=(a1iami)(A1tAmt)x=(a1iami)(b1bm)    (a11am1a1namn)(A1tAmt)x=(a11am1a1namn)(b1bm)    AtAx=Atb\begin{aligned} \frac{\partial f}{\partial x_i} &= \sum_{k} 2(b_k - A^t_k x) \frac{\partial}{\partial x_i} (b_k - A^t_k x) = 2 \sum_{k} (b_k - A^t_k x) (- a_{ki}) = 0 \\ &\implies \sum_{k} a_{ki} A^t_k x = \sum_{k} a_{ki} b_k = \left(\begin{array}{ccc} a_{1i} & \cdots & a_{mi} \end{array}\right) \left(\begin{array}{c} A^t_1 \\ \vdots \\ A^t_m \end{array}\right) x = \left(\begin{array}{ccc} a_{1i} & \cdots & a_{mi} \end{array}\right) \left(\begin{array}{c} b_1 \\ \vdots \\ b_m \end{array}\right) \\ &\implies \left(\begin{array}{ccc} a_{11} & \cdots & a_{m1} \\ & \vdots & \\ a_{1n} & \cdots & a_{mn} \end{array}\right) \left(\begin{array}{c} A^t_1 \\ \vdots \\ A^t_m \end{array}\right) x = \left(\begin{array}{ccc} a_{11} & \cdots & a_{m1} \\ & \vdots & \\ a_{1n} & \cdots & a_{mn} \end{array}\right) \left(\begin{array}{c} b_1 \\ \vdots \\ b_m \end{array}\right) \\ \\ &\implies \color{red}{A^tAx = A^t b} \end{aligned}

This equation is called normal equation. Moreover, AtAA^tA is positive (semi-)definite because xtAtAx=(Ax)t(Ax)=Ax220x^t A^t Ax = (Ax)^t (Ax) = \left\| Ax \right\|_2^2 \ge 0, so it minimizes ff. Especially, if Ax220\left\| Ax \right\|_2^2 \not = 0, it tells that there is no xx such that Ax=0Ax = \mathbf{0} except x=0x = \mathbf{0}. It also means that AA is nonsingular whose column vectors are linearly independent. Therefore, the solution of AtAx=AtbA^t Ax = A^t b is the same as the optimal solution of Ax=bAx = b when AA is nonsingular.

Another Approach When m>nm > n

In this case, AA maps the lower dimension to the higher dimension. For example, when m=3m = 3 and n=2n = 2, there may be a point bb in the objective space that AxAx cannot completely cover.


So, the linear system Ax=bAx = b cannot be solved exactly. However, finding the point closest to bb might be a good approximation. Let y=Axy = Ax be the closest point to bb.


When yy is closest to bb, the vector by=bAxb - y = b - Ax is parallel to the normal of the hyperplane ImAIm A. Besides, Ae1,,AenAe_1, \cdots, Ae_n are the basis of ImAIm A where e1,,ene_1, \cdots, e_n are the standard basis of the original space. In other words, it implies that column vectors of AA are the basis of ImAIm A, which means they are all perpendicular to bAxb - Ax. Therefore, At(bAx)=0A^t (b - Ax) = \mathbf{0}, and it yields the normal equation. Accordingly, the normal equation estimates the optimal solution of Ax=bAx = b even though there exists no the exact solution.

Another Approach Using Projector


A square matrix PP is called a projector if it is idempotent, which means P2=PP^2 = P. Such PP projects any vector to ImPIm P but leaves unchanged any vector that is already on ImPIm P. If PP is symmetric, it is called an orthogonal projector, and IPI - P is also an orthogonal projector to a hyperplane perpendicular to ImPIm P. Obviously, (IP)2=I2P+P2=IP(I - P)^2 = I - 2P + P^2 = I - P. More specifically, for any vector v=Pv+(IP)vv = Pv + (I - P)v, (Pv)t((IP)v)=vtPt(vPv)=vtP(vPv)=vtPvvtP2v=vtPvvtPv=0    Pv(IP)v\begin{aligned} (Pv)^t ((I - P)v) &= v^t P^t (v - Pv) = v^t P (v - Pv) \\ &= v^t P v - v^t P^2 v = v^t P v - v^t P v = 0 \\ \\ \implies &Pv \perp (I - P)v \end{aligned}


Now, assume that PP is an orthogonal projector onto span(A)\text{span}(A) such that PA=APA = A. Then the cost function f(x)f(x) can be derived as follows. f(x)=bAx22=P(bAx)+(IP)(bAx)22=P(bAx)22+(IP)(bAx)22=PbPAx22+(IP)b(IP)Ax22=PbAx22+(IP)b22\begin{aligned} f(x) = \left\| b - Ax \right\|^2_2 &= \left\| P(b - Ax) + (I - P)(b - Ax) \right\|^2_2 \\ &= \left\| P(b - Ax) \right\|^2_2 + \left\| (I - P)(b - Ax) \right\|^2_2 \\ &= \left\| Pb - PAx \right\|^2_2 + \left\| (I - P)b - (I - P)Ax \right\|^2_2 \\ &= \left\| Pb - Ax \right\|^2_2 + \left\| (I - P)b \right\|^2_2 \end{aligned}

Note that (IP)A=0(I - P)A = \mathbf{0} since PA=APA = A. The second term does not depend on xx, so the residual norm is minimized when xx is chosen so that the first term is zero. PbAx=0    At(PbAx)=AtPbAtAx=AtPtbAtAx=(PA)tbAtAx=AtbAtAx=0\begin{aligned} Pb - Ax = \mathbf{0} \implies A^t(Pb - Ax) &= A^t Pb - A^t Ax = A^t P^t b - A^t Ax \\ &= (PA)^t b - A^t Ax = A^t b - A^t Ax = \mathbf{0} \end{aligned}

Therefore, it produces the normal equation as well.


The solution from the normal equation can be evaluated by the condition number and angle between bb and AxAx. The condition number of AA tells how close AA is a singular matrix. However, if mnm \not = n, AA is not invertible, so the pseudoinverse A+=(AtA)1AtA^{+} = (A^t A)^{-1} A^t should be introduced. In this case, the condition number of AA is Cond2(A)=A2A+2\begin{aligned} Cond_2 (A) = \left\| A \right\|_2 \left\| A^{+} \right\|_2 \end{aligned}

Meanwhile, as the angle between bb and AxAx is smaller, bb is closer to AxAx, so it is a good measure to find whether the solution is well-conditioned. This angle can be obtained as cosθ=Ax2/b2\cos \theta = \left\| Ax \right\|_2 / \left\| b \right\|_2.



[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.

© 2024. All rights reserved.