Normal Equation

For an m×nm \times n matrix AA and a vector bRmb \in \mathbb{R}^m, the linear system Ax=bAx = b is to find the optimal solution xRnx \in \mathbb{R}^n.

1. The Residual Minimization

The problem Ax=bAx = b can be replaced by the problem minimizing the residual, bAx22\left\| b - Ax \right\|_2^2. Let AiA_i be the ii-th row vector of AA. Then the cost function f(x)f(x) can be defined as follows: f(x)=bAx22=(b1A1txbmAmtx)22=k(bkAktx)2\begin{aligned} f(x) &= \left\| b - Ax \right\|_2^2 = \left\| \left(\begin{array}{c} b_1 - A^t_1 x \\ \vdots \\ b_m - A^t_m x \end{array}\right) \right\|_2^2 = \sum_{k} (b_k - A^t_k x)^2 \end{aligned}

For minimizing ff, finding the critical point is required. Let aija_{ij} be the (i,j)(i, j) element of AA, then fxi=k2(bkAktx)xi(bkAktx)=2k(bkAktx)(aki)=0    kakiAktx=kakibk=(a1iami)(A1tAmt)x=(a1iami)(b1bm)    (a11am1a1namn)(A1tAmt)x=(a11am1a1namn)(b1bm)    AtAx=Atb\begin{aligned} \frac{\partial f}{\partial x_i} &= \sum_{k} 2(b_k - A^t_k x) \frac{\partial}{\partial x_i} (b_k - A^t_k x) = 2 \sum_{k} (b_k - A^t_k x) (- a_{ki}) = 0 \\ &\implies \sum_{k} a_{ki} A^t_k x = \sum_{k} a_{ki} b_k = \left(\begin{array}{ccc} a_{1i} & \cdots & a_{mi} \end{array}\right) \left(\begin{array}{c} A^t_1 \\ \vdots \\ A^t_m \end{array}\right) x = \left(\begin{array}{ccc} a_{1i} & \cdots & a_{mi} \end{array}\right) \left(\begin{array}{c} b_1 \\ \vdots \\ b_m \end{array}\right) \\ &\implies \left(\begin{array}{ccc} a_{11} & \cdots & a_{m1} \\ & \vdots & \\ a_{1n} & \cdots & a_{mn} \end{array}\right) \left(\begin{array}{c} A^t_1 \\ \vdots \\ A^t_m \end{array}\right) x = \left(\begin{array}{ccc} a_{11} & \cdots & a_{m1} \\ & \vdots & \\ a_{1n} & \cdots & a_{mn} \end{array}\right) \left(\begin{array}{c} b_1 \\ \vdots \\ b_m \end{array}\right) \\ \\ &\implies \color{red}{A^tAx = A^t b} \end{aligned}

This equation is called normal equation. Moreover, AtAA^tA is positive (semi-)definite because xtAtAx=(Ax)t(Ax)=Ax220x^t A^t Ax = (Ax)^t (Ax) = \left\| Ax \right\|_2^2 \ge 0, so it minimizes ff. Especially, if Ax220\left\| Ax \right\|_2^2 \not = 0, it tells that there is no xx such that Ax=0Ax = 0 except x=0x = 0. It also means that AA is nonsingular whose column vectors are linearly independent. Therefore, the solution of AtAx=AtbA^t Ax = A^t b is the optimal solution of Ax=bAx = b when AA is nonsingular.

2. Another Approach When m>nm > n

In this case, AA maps the lower dimension to the higher dimension. For example, when m=3m = 3 and n=2n = 2, there may be a point bb in the objective space which AxAx cannot cover.


So, the linear system Ax=bAx = b cannot be solved exactly. However, we can still find the point which is closest to bb. Let y=Axy = Ax be the closest point to bb.


When yy is closest to bb, the vector by=bAxb - y = b - Ax is perpendicular to the normal of hyperplane ImAIm A. Suppose that e1,,ene_1, \cdots, e_n are the standard basis in the original space. Then Ae1,,AenAe_1, \cdots, Ae_n consist of the basis of the objective space. In other words, each column vector of AA consists of the hyperplane ImAIm A, which means it is perpendicular to bAxb - Ax. Therefore, At(bAx)=0A^t (b - Ax) = 0, it yields the normal equation. Accordingly, the normal equation estimates the optimal solution of Ax=bAx = b even though the exact solution cannot be found.

3. Another Approach Using Projector


A square matrix PP is idempotent if P2=PP^2 = P, and it is called projector. PP projects any vector to ImPIm P and if the vectors are already on ImPIm P, they would be still after projected. If PP is symmetric, it is called orthogonal projector, and IPI - P is also orthogonal projector to the hyperplane which is perpendicular to ImPIm P. Of course, (IP)2=I2P+P2=IP(I - P)^2 = I - 2P + P^2 = I - P. For a vector vv, (Pv)t((IP)v)=vtPt(vPv)=vtP(vPv)=vtPvvtP2v=vtPvvtPv=0    Pv(IP)v    v=Pv+(IP)v\begin{aligned} (Pv)^t ((I - P)v) &= v^t P^t (v - Pv) = v^t P (v - Pv) \\ &= v^t P v - v^t P^2 v = v^t P v - v^t P v = 0 \\ \\ &\implies Pv \perp (I - P)v \\ &\implies v = Pv + (I - P)v \end{aligned}


Now, to apply this to solving Ax=bAx = b, assume that PP is the orthogonal projector onto spanAspan A, which means PA=APA = A. Then the cost function f(x)f(x) can be rewritten as f(x)=bAx22=P(bAx)+(IP)(bAx)22=P(bAx)22+(IP)(bAx)22=PbPAx22+(IP)b(IP)Ax22=PbAx22+(IP)b22\begin{aligned} f(x) = \left\| b - Ax \right\|^2_2 &= \left\| P(b - Ax) + (I - P)(b - Ax) \right\|^2_2 \\ &= \left\| P(b - Ax) \right\|^2_2 + \left\| (I - P)(b - Ax) \right\|^2_2 \\ &= \left\| Pb - PAx \right\|^2_2 + \left\| (I - P)b - (I - P)Ax \right\|^2_2 \\ &= \left\| Pb - Ax \right\|^2_2 + \left\| (I - P)b \right\|^2_2 \end{aligned}

Therefore, for minimizing this cost function, PbAx22\left\| Pb - Ax \right\|^2_2 should be zero. PbAx=AtPbAtAx=AtPtbAtAx=(PA)tbAtAx=AtbAtAx=O\begin{aligned} Pb - Ax &= A^t Pb - A^t Ax = A^t P^t b - A^t Ax \\ &= (PA)^t b - A^t Ax = A^t b - A^t Ax = O \end{aligned}

Note that it produces the normal equation as well.

4. Evaluation

The solution from the normal equation can be evaluated by the condition number and angle between bb and AxAx. The condition number of AA tells how close to singular AA is. However, if mnm \not = n, AA is not invertible, so the pseudoinverse A+=(AtA)1AtA^{+} = (A^t A)^{-1} A^t should be introduced. In this case, the condition number of AA is Cond2(A)=A2A+2\begin{aligned} Cond_2 (A) = \left\| A \right\|_2 \left\| A^{+} \right\|_2 \end{aligned}

Meanwhile, as the angle between bb and AxAx is smaller, bb is closer to AxAx, so it is a good measure to find whether the solution is well-conditioned. This angle can be obtained as cosθ=Ax2b2\cos \theta = \frac{\left\| Ax \right\|_2}{\left\| b \right\|_2}.



[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.

© 2022. All rights reserved.