Nonlinear Least Squares

Overview

Least squares data fitting can be viewed as an optimization problem. Given data points (ti,yi)(t_i, y_i) for i=1,,mi = 1, \cdots, m, the main goal is to find the vector xRnx \in \mathbb{R}^n of parameters that gives the best fit in the least squares sense to the nonlinear model function f(t,x)f(t, x), where f:Rn+1Rf: \mathbb{R}^{n+1} \to \mathbb{R}. Note that ff is a parametric function where tt is the parameter. If ff is linear, then an example could be as follows. f(t,x)=x1+x2t+x3t2++xntn1\begin{aligned} f(t, x) = x_1 + x_2 t + x_3 t^2 + \cdots + x_n t^{n-1} \end{aligned}

If ff is nonlinear, an example could be as follows. f(t,x)=x1ex2t++xn1exnt\begin{aligned} f(t, x) = x_1 e^{x_2 t} + \cdots + x_{n-1} e^{x_n t} \end{aligned}

Derivation

First, define the residual function r:RnRmr: \mathbb{R}^n \to \mathbb{R}^m, ri(x)=yif(tix)(i=1,2,,m)\begin{aligned} r_i(x) = y_i - f(t_i - x) \quad (i = 1, 2, \cdots, m) \end{aligned}

where yiy_i is the ground truth and f(tix)f(t_i - x) is the estimated value for a function f:Rn+1Rf: \mathbb{R}^{n+1} \to \mathbb{R}. As such, this residual function can lead to the minimization problem of the cost function ϕ\phi defined by ϕ(x)=12r(x)tr(x)=12(r1(x)rm(x))(r1(x)rm(x))=12(r12(x)++rm2(x))\begin{aligned} \phi (x) &= \frac{1}{2} r(x)^t r(x) = \frac{1}{2} \left(\begin{array}{ccc} r_1(x) & \cdots & r_m(x) \end{array}\right) \left(\begin{array}{c} r_1(x) \\ \vdots \\ r_m(x) \end{array}\right) \\ &= \frac{1}{2} (r_1^2(x) + \cdots + r_m^2(x)) \end{aligned}

Then, the gradient of ϕ\phi is obtained as follows. ϕxi=r1(x)r1xi++rm(x)rmxi    ϕ(x)=(ϕx1ϕxn)=(r1(x)r1x1++rm(x)rmx1r1(x)r1xn++rm(x)rmxn)=(r1x1rmx1r1xirmxir1xnrmxn)(r1(x)rm(x))=J(x)tr(x)\begin{aligned} \frac{\partial \phi}{\partial x_i} &= r_1(x) \frac{\partial r_1}{\partial x_i} + \cdots + r_m(x) \frac{\partial r_m}{\partial x_i} \\ \\ \implies \nabla \phi (x) &= \left(\begin{array}{c} \frac{\partial \phi}{\partial x_1} \\ \vdots \\ \frac{\partial \phi}{\partial x_n} \end{array}\right) = \left(\begin{array}{c} r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \\ \vdots \\ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_i} & \cdots & \frac{\partial r_m}{\partial x_i} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \\ \end{array}\right) \left(\begin{array}{c} r_1(x) \\ \vdots \\ r_m(x) \end{array}\right) = \color{limegreen}{J(x)^t r(x)} \end{aligned}

where J(x)J(x) is the Jacobian matrix of r(x)r(x). In addition, the Hessian matrix Hϕ(x)H_{\phi}(x) of ϕ(x)\phi (x) is as follows: Hϕ(x)=(2ϕx122ϕx1xn2ϕxnx12ϕxn2)=(x1[r1(x)r1x1++rm(x)rmx1]xn[r1(x)r1x1++rm(x)rmx1]x1[r1(x)r1xn++rm(x)rmxn]xn[r1(x)r1xn++rm(x)rmxn])    (Hϕ)ij=xj[r1(x)r1xi++rm(x)rmxi]=(r1xjr1xi+r1(x)2r1xixj)++(rmxjrmxi+rm(x)2rmxixj)\begin{aligned} H_{\phi}(x) &= \left(\begin{array}{ccc} \frac{\partial^2 \phi}{\partial x_1^2} & \cdots & \frac{\partial^2 \phi}{\partial x_1 \partial x_n} \\ & \vdots & \\ \frac{\partial^2 \phi}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 \phi}{\partial x_n^2} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial}{\partial \color{orange}x_1} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \right] & \cdots & \frac{\partial}{\partial \color{orange}x_n} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \right] \\ & \vdots & \\ \frac{\partial}{\partial \color{orange}x_1} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \right] & \cdots & \frac{\partial}{\partial \color{orange}x_n} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \right] \end{array}\right) \\ \\ \implies (H_{\phi})_{ij} &= \frac{\partial}{\partial \color{orange}x_j} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_i} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_i} \right] \\ \\ &= \left( \frac{\partial r_1}{\partial \color{orange}x_j} \frac{\partial r_1}{\partial \color{limegreen}x_i} + r_1(x)\frac{\partial^2 r_1}{\partial \color{limegreen}{x_i} \color{black}\partial \color{orange}{x_j}} \right) + \cdots + \left( \frac{\partial r_m}{\partial \color{orange}x_j} \frac{\partial r_m}{\partial \color{limegreen}x_i} + r_m(x)\frac{\partial^2 r_m}{\partial \color{limegreen}{x_i} \color{black}\partial \color{orange}{x_j}} \right) \end{aligned}

where (Hϕ)ij(H_{\phi})_{ij} is the (i,j)(i, j) element of HϕH_{\phi}. So it can be rewritten as Hϕ(x)=(r1x1r1x1++rmx1rmx1r1xnr1x1++rmxnrmx1r1x1r1xn++rmx1rmxnr1xnr1xn++rmxnrmxn)+(r1(x)2r1x1x1++rm(x)2rmx1x1r1(x)2r1x1xn++rm(x)2rmx1xnr1(x)2r1xnx1++rm(x)2rmxnx1r1(x)2r1xnxn++rm(x)2rmxnxn)=(r1x1rmx1r1xnrmxn)(r1x1r1xnrmx1rmxn)+i=1mri(x)(2rix1x12rix1xn2rixnx12rixnxn)=J(x)tJ(x)+i=1mri(x)Hri(x)\begin{aligned} H_{\phi}(x) &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_1} \frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_n} \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_1} \frac{\partial r_m}{\partial x_n} & \cdots & \frac{\partial r_1}{\partial x_n} \frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_n} \frac{\partial r_m}{\partial x_n} \end{array}\right) \\ \\ &+ \left(\begin{array}{ccc} r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_n} \\ & \vdots & \\ r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_n} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \end{array}\right) \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \\ & \vdots & \\ \frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_n} \end{array}\right) + \sum_{i=1}^m r_i(x) \left(\begin{array}{ccc} \frac{\partial^2 r_i}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_1 \partial x_n} \\ & \vdots & \\ \frac{\partial^2 r_i}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_n \partial x_n} \end{array}\right) \\ \\ &= \color{limegreen}{J(x)^t J(x) + \sum_{i=1}^m r_i(x) H_{r_i}(x)} \end{aligned}

Before using this gradient and Hessian matrix, look into ff in the lower dimension to help an obvious understaning. For a while, let ff a convex function where f:RRf: \mathbb{R} \to \mathbb{R}. Then, by Taylor’s theorem, f(x+h)f(x)+f(x)h+12f(x)h2\begin{aligned} f(x + h) \approx f(x) + f'(x) h + \frac{1}{2} f''(x) h^2 \end{aligned}

It implies that a well-selected hh can make f(x)f(x) move to the direction ff is decreasing. For finding this value hh, this approximation can be viewed as a function gg for hh, and the critical point of gg can be searched for. g(h)=h2(12f(x))+hf(x)+f(x)g(h)=hf(x)+f(x)=0h=f(x)f(x)(f(x)>0)\begin{aligned} g(h) &= h^2 \left( \frac{1}{2} f''(x) \right) + hf'(x) + f(x) \\ g'(h) &= h f''(x) + f'(x) = 0 \\ h &= - \frac{f'(x)}{f''(x)} \quad (f''(x) > 0) \end{aligned}

Note that f(x)>0f''(x) > 0 since ff is convex. In other words, xx should be moved by hh for decreasing ff. More specifically, xx should be moved by hh in the direction (1,0)(1, 0). In the same manner, the minimum of ff in a higher dimension can be obtained as well. Suppose that ff is a function for xRnx \in \mathbb{R}^n and its Hessian matrix HfH_f is positive definite, which corresponds to the previous convex function according to this note. By Taylor’s theorem, f(x+s)f(x)+f(x)ts+12stHf(x)s\begin{aligned} f(x + s) \approx f(x) + \nabla f(x)^t s + \frac{1}{2} s^t H_f(x) s \end{aligned}

Following the process in the previous one-dimension, xx should be moved by ss. More specifically, xx should be moved by s\Vert s \Vert in direction ss. This ss can be obtained by solving the following linear system. Hf(x)s=f(x)\begin{aligned} H_f(x) s = - \nabla f(x) \end{aligned}

Returning to the original problem, this result implies that ϕ(x)\phi(x) can be minimized by finding ss such that Hϕ(x)s=ϕ(x)H_{\phi}(x)s = - \nabla \phi(x). More specifically, Hϕ(x)s=ϕ(x)    (J(x)tJ(x)+i=1mri(x)Hri(x))s=J(x)tr(x)\begin{aligned} H_{\phi}(x) s = - \nabla \phi(x) \iff \left( J(x)^t J(x) + \sum_{i=1}^m r_i(x) H_{r_i}(x) \right) s = - J(x)^t r(x) \end{aligned}

This linear system can be solved by some famous methods.

Gauss-Newton Method

This method assumes that i=1mri(x)Hri(x)\sum_{i=1}^m r_i(x) H_{r_i}(x) will be negligible when the residual function rr is small. It is possible when the model function ff fits the data well so that ri(x)=yif(ti,x)r_i(x) = y_i - f(t_i, x) can be small. In this case, J(x)tJ(x)sJ(x)tr(x)\begin{aligned} J(x)^t J(x) s \approx - J(x)^t r(x) \end{aligned}

It reminds the normal equation. In this perspective, this problem can be viewed as solving the linear system J(x)s=r(x)J(x)s = -r(x). In effect, the Gauss-Newton method replaces a nonlinear squares problem by a sequence of linear least squares problems. However, there are some drawbacks. This problem may fail to converge when it is not started close enough to the solution. Moreover, if i=1mri(x)Hri(x)\sum_{i=1}^m r_i(x) H_{r_i}(x) is not negligible, the approximation may be inaccurate. Finally, the solution could hardly find when Hϕ(x)H_{\phi}(x), Jt(x)J(x)J^t(x) J(x), or J(x)J(x) is ill-conditioned.

Levenberg-Marquardt Method

This is useful even when Hϕ(x)H_{\phi}(x), Jt(x)J(x)J^t(x) J(x), or J(x)J(x) is ill-conditioned, or rank-deficient. It does not ignore the term i=1mri(x)Hri(x)\sum_{i=1}^m r_i(x) H_{r_i}(x), but replace this by μI\mu I for μR\mu \in \mathbb{R}. (J(x)tJ(x)+μI)sJ(x)tr(x)\begin{aligned} \left( J(x)^t J(x) + \mu I \right) s \approx - J(x)^t r(x) \end{aligned}

This technique, which is called regularization, makes the pseudo-Hf(x)H_{f(x)} positive definite. Its characteristic is more obvious when it is considered as that of the lower dimension. f(x+h)f(x + h) is decreasing when its second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of ff is not greater than zero, it can be biased by μ\mu as follows. {f(x)>0    h=f(x)f(x)f(x)0    h=f(x)f(x)+μ(f(x)+μ>0)\begin{aligned} \begin{cases} f''(x) > 0 \implies h = - \frac{f'(x)}{f''(x)} \\ \\ f''(x) \le 0 \implies h = - \frac{f'(x)}{f''(x) + \mu} \quad (f''(x) + \mu > 0) \end{cases} \end{aligned}

Considering this process and applying it to the higer dimension in the same manner, the following properties are induced. f(x)+μ>0(f(x)0)J(x)tJ(x)+μI is positive definite (J(x)tJ(x) is NOT positive definite)\begin{aligned} f''(x) + \mu > 0 \quad &(f''(x) \leq 0) \\ \Updownarrow &\quad \\ J(x)^t J(x) + \mu I \text{ is positive definite } \quad &(J(x)^t J(x) \text{ is NOT positive definite}) \end{aligned}

Therefore, it can be said that the regularization term μI\mu I makes the pseudo-HϕH_{\phi} positive definite. If this regularization term is adjusted, the Levenberg-Marquardt method can be interpreted as the weighted combination of the Gauss-Newton and gradient descent methods. {J(x)tJ(x)sJ(x)tr(x)(μI=O)Gauss-NewtonsJ(x)tr(x)=ϕ(x)(μI=IJ(x)tJ(x))Gradient Descent\begin{aligned} \begin{cases} J(x)^t J(x) s \approx - J(x)^t r(x) \quad (\mu I = O) \quad &\text{Gauss-Newton} \\\\ s \approx - J(x)^t r(x) = - \nabla \phi(x) \quad (\mu I = I - J(x)^t J(x)) \quad &\text{Gradient Descent} \end{cases} \end{aligned}

Meanwhile, there exists a singular value of J(x)J(x) close to zero when J(x)J(x) is ill-conditioned. Using the fact that J(x)J(x) is diagonalizable since it is symmetric, J(x)tJ(x)=(UΣVt)t(UΣVt)=VΣtUtUΣV=VΣ2V    Vt(J(x)tJ(x))V=Σ2=diag(σ12,,σn2)\begin{aligned} &J(x)^t J(x) = (U \Sigma V^t)^t (U \Sigma V^t) = V \Sigma^t U^t U \Sigma V = V \Sigma^2 V \\\\ &\implies V^t (J(x)^t J(x)) V = \Sigma^2 = diag(\sigma_1^2, \cdots, \sigma_n^2) \end{aligned}

where UU is an m×mm \times m orthogonal matrix, VV is an n×nn \times n orthogonal matrix, Σ\Sigma is an m×nm \times n diagonal matrix, and σi\sigma_i is the ii-th singular value of J(x)J(x). It means that the eigenvalues of J(x)tJ(x)J(x)^t J(x) are the square of the singular values of J(x)J(x). In case that J(x)J(x) is ill-conditioned, for the eigenvalue λ\lambda which is close to zero and the eigenvector vv corresponding to λ\lambda, J(x)tJ(x)v=λv    (J(x)tJ(x)+μI)v=λv+μv=(λ+μ)v\begin{aligned} J(x)^t J(x)v &= \lambda v \implies (J(x)^t J(x) + \mu I) v = \lambda v + \mu v = (\lambda + \mu)v \end{aligned}

Note that J(x)tJ(x)+μIJ(x)^t J(x) + \mu I is symmetric since J(x)tJ(x)J(x)^t J(x) is symmetric and it is also positive definite if μ\mu is selected well. Accordingly, the Cholesky factorization can be applied to it. (J(x)tJ(x)+μI)v=(λ+μ)v=J^(x)tJ^(x)v\begin{aligned} (J(x)^t J(x) + \mu I) v = (\lambda + \mu)v = \hat{J}(x)^t \hat{J}(x) v \end{aligned}

where J^(x)tJ^(x)\hat{J}(x)^t \hat{J}(x) is positive definite and its all eigenvalues are positive. Then the eigenvalue of J^(x)tJ^(x)\hat{J}(x)^t \hat{J}(x) is λ+μ\lambda + \mu, so the singular value of J^(x)\hat{J}(x) is λ+μ\sqrt{\lambda + \mu}. Therefore, the original singular value of J(x)J(x), λ\sqrt{\lambda}, is increased to λ+μ\sqrt{\lambda + \mu}, and it helps the ill-conditioned J(x)J(x) to be far from this condition. The corresponding linear least squares problem to Levenberg-Marquardt method is, (J(x)μ I)s(r(x)0)\begin{aligned} \left(\begin{array}{c} J(x) \\ \sqrt{\mu} \ I \end{array}\right) s \cong \left(\begin{array}{c} -r(x) \\ \mathbf{0} \end{array}\right) \end{aligned}

Reference

[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.


© 2025. All rights reserved.