Nonlinear Least Squares

Consider the residual function r:RnRmr: \mathbb{R}^n \to \mathbb{R}^m, for xRnx \in \mathbb{R}^n ri(x)=yif(tix),(i=1,2,,m)\begin{aligned} r_i(x) = y_i - f(t_i - x), \quad (i = 1, 2, \cdots, m) \end{aligned}

where yiy_i is the ground truth and tit_i is the observed value for a function f:Rn+1Rf: \mathbb{R}^{n+1} \to \mathbb{R}. Given rr, we want to minimize the cost function ϕ\phi, ϕ(x)=12r(x)tr(x)=12(r1(x)rm(x))(r1(x)rm(x))=12(r12(x)++rm2(x))\begin{aligned} \phi (x) &= \frac{1}{2} r(x)^t r(x) = \frac{1}{2} \left(\begin{array}{ccc} r_1(x) & \cdots & r_m(x) \end{array}\right) \left(\begin{array}{c} r_1(x) \\ \vdots \\ r_m(x) \end{array}\right) \\ &= \frac{1}{2} (r_1^2(x) + \cdots + r_m^2(x)) \end{aligned}

For minimizing ϕ\phi, the gradient of ϕ\phi is required. ϕxi=r1(x)r1xi++rm(x)rmxi    ϕ(x)=(ϕx1)ϕxn))=(r1(x)r1x1++rm(x)rmx1r1(x)r1xn++rm(x)rmxn)=(r1x1rmx1r1xirmxir1xnrmxn)(r1(x)rm(x))=J(x)tr(x)\begin{aligned} \frac{\partial \phi}{\partial x_i} &= r_1(x) \frac{\partial r_1}{\partial x_i} + \cdots + r_m(x) \frac{\partial r_m}{\partial x_i} \\ \\ \implies \nabla \phi (x) &= \left(\begin{array}{c} \frac{\partial \phi}{\partial x_1}) \\ \vdots \\ \frac{\partial \phi}{\partial x_n}) \end{array}\right) = \left(\begin{array}{c} r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \\ \vdots \\ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_i} & \cdots & \frac{\partial r_m}{\partial x_i} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \\ \end{array}\right) \left(\begin{array}{c} r_1(x) \\ \vdots \\ r_m(x) \end{array}\right) = \color{limegreen}{J(x)^t r(x)} \end{aligned}

where J(x)J(x) is the Jacobian matrix of r(x)r(x). In addition, the Hessian matrix Hϕ(x)H_{\phi}(x) of ϕ(x)\phi (x) is as follows: Hϕ(x)=(2ϕx122ϕx1xn2ϕxnx12ϕxn2)=(x1[r1(x)r1x1++rm(x)rmx1]xn[r1(x)r1x1++rm(x)rmx1]x1[r1(x)r1xn++rm(x)rmxn]xn[r1(x)r1xn++rm(x)rmxn])    (Hϕ)ij=xj[r1(x)r1xi++rm(x)rmxi]=(r1xjr1xi+r1(x)2r1xixj)++(rmxjrmxi+rm(x)2rmxixj)\begin{aligned} H_{\phi}(x) &= \left(\begin{array}{ccc} \frac{\partial^2 \phi}{\partial x_1^2} & \cdots & \frac{\partial^2 \phi}{\partial x_1 \partial x_n} \\ & \vdots & \\ \frac{\partial^2 \phi}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 \phi}{\partial x_n^2} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial}{\partial \color{orange}x_1} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \right] & \cdots & \frac{\partial}{\partial \color{orange}x_n} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \right] \\ & \vdots & \\ \frac{\partial}{\partial \color{orange}x_1} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \right] & \cdots & \frac{\partial}{\partial \color{orange}x_n} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \right] \end{array}\right) \\ \\ \implies (H_{\phi})_{ij} &= \frac{\partial}{\partial \color{orange}x_j} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_i} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_i} \right] \\ \\ &= \left( \frac{\partial r_1}{\partial \color{orange}x_j} \frac{\partial r_1}{\partial \color{limegreen}x_i} + r_1(x)\frac{\partial^2 r_1}{\partial \color{limegreen}{x_i} \color{black}\partial \color{orange}{x_j}} \right) + \cdots + \left( \frac{\partial r_m}{\partial \color{orange}x_j} \frac{\partial r_m}{\partial \color{limegreen}x_i} + r_m(x)\frac{\partial^2 r_m}{\partial \color{limegreen}{x_i} \color{black}\partial \color{orange}{x_j}} \right) \end{aligned}

where (Hϕ)ij(H_{\phi})_{ij} is the (i,j)(i, j) element of HϕH_{\phi}. So it can be rewritten as Hϕ(x)=(r1x1r1x1++rmx1rmx1r1xnr1x1++rmxnrmx1r1x1r1xn++rmx1rmxnr1xnr1xn++rmxnrmxn)+(r1(x)2r1x1x1++rm(x)2rmx1x1r1(x)2r1x1xn++rm(x)2rmx1xnr1(x)2r1xnx1++rm(x)2rmxnx1r1(x)2r1xnxn++rm(x)2rmxnxn)=(r1x1rmx1r1xnrmxn)(r1x1r1xnrmx1rmxn)+i=1mri(x)(2rix1x12rix1xn2rixnx12rixnxn)=J(x)tJ(x)+i=1mri(x)Hri(x)\begin{aligned} H_{\phi}(x) &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_1} \frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_n} \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_1} \frac{\partial r_m}{\partial x_n} & \cdots & \frac{\partial r_1}{\partial x_n} \frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_n} \frac{\partial r_m}{\partial x_n} \end{array}\right) \\ \\ &+ \left(\begin{array}{ccc} r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_n} \\ & \vdots & \\ r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_n} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \end{array}\right) \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \\ & \vdots & \\ \frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_n} \end{array}\right) + \sum_{i=1}^m r_i(x) \left(\begin{array}{ccc} \frac{\partial^2 r_i}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_1 \partial x_n} \\ & \vdots & \\ \frac{\partial^2 r_i}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_n \partial x_n} \end{array}\right) \\ \\ &= \color{limegreen}{J(x)^t J(x) + \sum_{i=1}^m r_i(x) H_{r_i}(x)} \end{aligned}

Meanwhile, for a convex function f:RRf: \mathbb{R} \to \mathbb{R}, f(x+h)f(x)+f(x)h+12f(x)h2\begin{aligned} f(x + h) \approx f(x) + f'(x) h + \frac{1}{2} f''(x) h^2 \end{aligned}

by Taylor Theorem. It tells us that a good hh can make f(x)f(x) move to the direction ff is decreasing. For finding this value hh, we can define the function gg for hh and find the critical point of gg. g(h)=h2(12f(x))+hf(x)+f(x)g(h)=hf(x)+f(x)=0h=f(x)f(x)(f(x)>0)\begin{aligned} g(h) &= h^2 \left( \frac{1}{2} f''(x) \right) + hf'(x) + f(x) \\ g'(h) &= h f''(x) + f'(x) = 0 \\ h &= - \frac{f'(x)}{f''(x)} \quad (f''(x) > 0) \end{aligned}

In other words, xx should be moved by hh for decreasing ff. In a higher dimension, the minimum of ff can be found in the same manner. Suppose that ff is a function for xRnx \in \mathbb{R}^n and its Hessian matrix HfH_f is positive definite. By Taylor Theorem, f(x+s)f(x)+f(x)ts+12stHf(x)s\begin{aligned} f(x + s) \approx f(x) + \nabla f(x)^t s + \frac{1}{2} s^t H_f(x) s \end{aligned}

Following the process in one dimension, xx should be moved by ss, then the solution of Hf(x)s=f(x)\begin{aligned} H_f(x) s = - \nabla f(x) \end{aligned}

Returning to the original problem, ϕ(x)\phi(x) can be minimized by finding ss such that Hϕ(x)s=ϕ(x)H_{\phi}(x)s = - \nabla \phi(x). More specifically, this linear system is as follows: Hϕ(x)s=ϕ(x)    (J(x)tJ(x)+i=1mri(x)Hri(x))s=J(x)tr(x)\begin{aligned} H_{\phi}(x) s = - \nabla \phi(x) \iff \left( J(x)^t J(x) + \sum_{i=1}^m r_i(x) H_{r_i}(x) \right) s = - J(x)^t r(x) \end{aligned}

This linear system can be solved by some famous methods.

1. Gauss-Newton Method

This method assumes that i=1mri(x)Hri(x)\sum_{i=1}^m r_i(x) H_{r_i}(x) will be negligible when the residual function rr is small. It is possible when the model function ff fits the data well so that ri(x)=yif(ti,x)r_i(x) = y_i - f(t_i, x) can be small. In this case, J(x)tJ(x)sJ(x)tr(x)\begin{aligned} J(x)^t J(x) s \approx - J(x)^t r(x) \end{aligned}

It recalls the normal equation, so this problem can be viewed as solving the linear system J(x)s=r(x)J(x)s = -r(x). However, there are some drawbacks. This problem may fail to converge when it is not started close enough to the real solution. Moreover, if i=1mri(x)Hri(x)\sum_{i=1}^m r_i(x) H_{r_i}(x) is not small, it cannot be negligible. Finally, the solution could hardly find when Hϕ(x)H_{\phi}(x), Jt(x)J(x)J^t(x) J(x), or J(x)J(x) is ill-conditioned.

2. Levenberg-Marquardt Method

This is useful even when Hϕ(x)H_{\phi}(x), Jt(x)J(x)J^t(x) J(x), or J(x)J(x) is ill-conditioned, or rank-deficient. It does not ignore the term i=1mri(x)Hri(x)\sum_{i=1}^m r_i(x) H_{r_i}(x), but replace this by μI\mu I for μR\mu \in \mathbb{R}. (J(x)tJ(x)+μI)sJ(x)tr(x)\begin{aligned} \left( J(x)^t J(x) + \mu I \right) s \approx - J(x)^t r(x) \end{aligned}

This technique, which is called regularization, makes the pseudo-Hf(x)H_{f(x)} positive definite. Its characteristics is more obvious when it is considered as that of the lower dimension. f(x+h)f(x + h) is decreasing when its second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of ff is not greater than zero, it can be biased by μ\mu as follows: {f(x)>0    h=f(x)f(x)f(x)0    h=f(x)f(x)+μwhere f(x)+μ>0\begin{aligned} \begin{cases} f''(x) > 0 \implies h = - \frac{f'(x)}{f''(x)} \\ \\ f''(x) \le 0 \implies h = - \frac{f'(x)}{f''(x) + \mu} \quad \text{where } f''(x) + \mu > 0 \end{cases} \end{aligned}

Considering this process and applying it to the higer dimension in the same manner, the following properties are induced. f(x)+μ>0where f(x)0J(x)tJ(x)+μI is positive definitewhere Hϕ(x)(J(x)tJ(x)) is NOT positive definite\begin{gather} f''(x) + \mu > 0 \quad \text{where } f''(x) \le 0 \\ \Updownarrow \\ J(x)^t J(x) + \mu I \text{ is positive definite} \\ \text{where } H_{\phi}(x) \left( \approx J(x)^t J(x) \right) \text{ is NOT positive definite} \end{gather}

Therefore, it can be said that the regularization term, μI\mu I makes the pseudo-Hf(x)H_{f(x)}, which is J(x)tJ(x)J(x)^t J(x), positive definite. Moreover, Levenberg-Marquardt method can be interpreted as the weighted combination of Gauss-Newton and gradient descent methods. {J(x)tJ(x)sJ(x)tr(x)(μI=O): Gauss-NewtonsJ(x)tr(x)=ϕ(x)(μI=IJ(x)tJ(x)): Gradient Descent\begin{aligned} \begin{cases} J(x)^t J(x) s \approx - J(x)^t r(x) \quad (\mu I = O) \quad \text{: Gauss-Newton} \\ \\ s \approx - J(x)^t r(x) = - \nabla \phi(x) \quad (\mu I = I - J(x)^t J(x)) \quad \text{: Gradient Descent} \end{cases} \end{aligned}

Meanwhile, there exists a singular value of J(x)J(x) close to zero when J(x)J(x) is ill-conditioned. Using the fact that J(x)J(x) can be diagonalizable, J(x)tJ(x)=(UΣVt)t(UΣVt)=VΣtUtUΣV=VΣ2V    Vt(J(x)tJ(x))V=Σ2=diag(σ12,,σn2)\begin{aligned} J(x)^t J(x) &= (U \Sigma V^t)^t (U \Sigma V^t) = V \Sigma^t U^t U \Sigma V = V \Sigma^2 V \\ \implies V^t (J(x)^t J(x)) V &= \Sigma^2 = diag(\sigma_1^2, \cdots, \sigma_n^2) \end{aligned}

where σi\sigma_i is the ii-th singular value of J(x)J(x). It means that the eigenvalues of J(x)tJ(x)J(x)^t J(x) are the square of the singular values of J(x)J(x). In case that J(x)J(x) is ill-conditioned, for the eigenvalue λ\lambda which is close to zero and the eigenvector vv corresponding to λ\lambda, J(x)tJ(x)v=λv    (J(x)tJ(x)+μI)v=λv+μv=(λ+μ)v\begin{aligned} J(x)^t J(x)v &= \lambda v \implies (J(x)^t J(x) + \mu I) v = \lambda v + \mu v = (\lambda + \mu)v \end{aligned}

Note that J(x)tJ(x)+μIJ(x)^t J(x) + \mu I is symmetric since J(x)tJ(x)J(x)^t J(x) is symmetric and it is also positive definite if μ\mu is selected well. Accordingly, the Cholesky factorization can be applied to it as follows: (J(x)tJ(x)+μI)v=(λ+μ)v=J^(x)tJ^(x)v\begin{aligned} (J(x)^t J(x) + \mu I) v = (\lambda + \mu)v = \hat{J}(x)^t \hat{J}(x) v \end{aligned}

where J^(x)tJ^(x)\hat{J}(x)^t \hat{J}(x) is positive definite and its all eigenvalues are positive. Then the eigenvalue of J^(x)tJ^(x)\hat{J}(x)^t \hat{J}(x) is λ+μ\lambda + \mu, so the singular value of J^(x)\hat{J}(x) is λ+μ\sqrt{\lambda + \mu}. Therefore, the original singular value of J(x)J(x), λ\sqrt{\lambda}, is increased to λ+μ\sqrt{\lambda + \mu}, and it helps the ill-conditioned J(x)J(x) to be far from this condition. The corresponding linear least squares problem to Levenberg-Marquardt method is, (J(x)μ I)s(r(x)0)\begin{aligned} \left(\begin{array}{c} J(x) \\ \sqrt{\mu} \ I \end{array}\right) s \cong \left(\begin{array}{c} -r(x) \\ 0 \end{array}\right) \end{aligned}


[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.

© 2023. All rights reserved.