Least squares data fitting can be viewed as an optimization problem. Given data points (ti,yi) for i=1,⋯,m, the main goal is to find the vector x∈Rn of parameters that gives the best fit in the least squares sense to the nonlinear model function f(t,x), where f:Rn+1→R. Note that f is a parametric function where t is the parameter. If f is linear, then an example could be as follows. f(t,x)=x1+x2t+x3t2+⋯+xntn−1
If f is nonlinear, an example could be as follows. f(t,x)=x1ex2t+⋯+xn−1exnt
Derivation
First, define the residual function r:Rn→Rm, ri(x)=yi−f(ti−x)(i=1,2,⋯,m)
where yi is the ground truth and f(ti−x) is the estimated value for a function f:Rn+1→R. As such, this residual function can lead to the minimization problem of the cost function ϕ defined by ϕ(x)=21r(x)tr(x)=21(r1(x)⋯rm(x))⎝⎛r1(x)⋮rm(x)⎠⎞=21(r12(x)+⋯+rm2(x))
Then, the gradient of ϕ is obtained as follows. ∂xi∂ϕ⟹∇ϕ(x)=r1(x)∂xi∂r1+⋯+rm(x)∂xi∂rm=⎝⎛∂x1∂ϕ⋮∂xn∂ϕ⎠⎞=⎝⎛r1(x)∂x1∂r1+⋯+rm(x)∂x1∂rm⋮r1(x)∂xn∂r1+⋯+rm(x)∂xn∂rm⎠⎞=⎝⎛∂x1∂r1∂xi∂r1∂xn∂r1⋯⋮⋯⋮⋯∂x1∂rm∂xi∂rm∂xn∂rm⎠⎞⎝⎛r1(x)⋮rm(x)⎠⎞=J(x)tr(x)
where J(x) is the Jacobian matrix of r(x). In addition, the Hessian matrix Hϕ(x) of ϕ(x) is as follows: Hϕ(x)⟹(Hϕ)ij=⎝⎛∂x12∂2ϕ∂xn∂x1∂2ϕ⋯⋮⋯∂x1∂xn∂2ϕ∂xn2∂2ϕ⎠⎞=⎝⎛∂x1∂[r1(x)∂x1∂r1+⋯+rm(x)∂x1∂rm]∂x1∂[r1(x)∂xn∂r1+⋯+rm(x)∂xn∂rm]⋯⋮⋯∂xn∂[r1(x)∂x1∂r1+⋯+rm(x)∂x1∂rm]∂xn∂[r1(x)∂xn∂r1+⋯+rm(x)∂xn∂rm]⎠⎞=∂xj∂[r1(x)∂xi∂r1+⋯+rm(x)∂xi∂rm]=(∂xj∂r1∂xi∂r1+r1(x)∂xi∂xj∂2r1)+⋯+(∂xj∂rm∂xi∂rm+rm(x)∂xi∂xj∂2rm)
where (Hϕ)ij is the (i,j) element of Hϕ. So it can be rewritten as Hϕ(x)=⎝⎛∂x1∂r1∂x1∂r1+⋯+∂x1∂rm∂x1∂rm∂x1∂r1∂xn∂r1+⋯+∂x1∂rm∂xn∂rm⋯⋮⋯∂xn∂r1∂x1∂r1+⋯+∂xn∂rm∂x1∂rm∂xn∂r1∂xn∂r1+⋯+∂xn∂rm∂xn∂rm⎠⎞+⎝⎛r1(x)∂x1∂x1∂2r1+⋯+rm(x)∂x1∂x1∂2rmr1(x)∂xn∂x1∂2r1+⋯+rm(x)∂xn∂x1∂2rm⋯⋮⋯r1(x)∂x1∂xn∂2r1+⋯+rm(x)∂x1∂xn∂2rmr1(x)∂xn∂xn∂2r1+⋯+rm(x)∂xn∂xn∂2rm⎠⎞=⎝⎛∂x1∂r1∂xn∂r1⋯⋮⋯∂x1∂rm∂xn∂rm⎠⎞⎝⎛∂x1∂r1∂x1∂rm⋯⋮⋯∂xn∂r1∂xn∂rm⎠⎞+i=1∑mri(x)⎝⎛∂x1∂x1∂2ri∂xn∂x1∂2ri⋯⋮⋯∂x1∂xn∂2ri∂xn∂xn∂2ri⎠⎞=J(x)tJ(x)+i=1∑mri(x)Hri(x)
Before using this gradient and Hessian matrix, look into f in the lower dimension to help an obvious understaning. For a while, let f a convex function where f:R→R. Then, by Taylor’s theorem, f(x+h)≈f(x)+f′(x)h+21f′′(x)h2
It implies that a well-selected h can make f(x) move to the direction f is decreasing. For finding this value h, this approximation can be viewed as a function g for h, and the critical point of g can be searched for. g(h)g′(h)h=h2(21f′′(x))+hf′(x)+f(x)=hf′′(x)+f′(x)=0=−f′′(x)f′(x)(f′′(x)>0)
Note that f′′(x)>0 since f is convex. In other words, x should be moved by h for decreasing f. More specifically, x should be moved by h in the direction (1,0). In the same manner, the minimum of f in a higher dimension can be obtained as well. Suppose that f is a function for x∈Rn and its Hessian matrix Hf is positive definite, which corresponds to the previous convex function according to this note. By Taylor’s theorem, f(x+s)≈f(x)+∇f(x)ts+21stHf(x)s
Following the process in the previous one-dimension, x should be moved by s. More specifically, x should be moved by ∥s∥ in direction s. This s can be obtained by solving the following linear system. Hf(x)s=−∇f(x)
Returning to the original problem, this result implies that ϕ(x) can be minimized by finding s such that Hϕ(x)s=−∇ϕ(x). More specifically, Hϕ(x)s=−∇ϕ(x)⟺(J(x)tJ(x)+i=1∑mri(x)Hri(x))s=−J(x)tr(x)
This linear system can be solved by some famous methods.
Gauss-Newton Method
This method assumes that ∑i=1mri(x)Hri(x) will be negligible when the residual function r is small. It is possible when the model function f fits the data well so that ri(x)=yi−f(ti,x) can be small. In this case, J(x)tJ(x)s≈−J(x)tr(x)
It reminds the normal equation. In this perspective, this problem can be viewed as solving the linear system J(x)s=−r(x). In effect, the Gauss-Newton method replaces a nonlinear squares problem by a sequence of linear least squares problems. However, there are some drawbacks. This problem may fail to converge when it is not started close enough to the solution. Moreover, if ∑i=1mri(x)Hri(x) is not negligible, the approximation may be inaccurate. Finally, the solution could hardly find when Hϕ(x), Jt(x)J(x), or J(x) is ill-conditioned.
Levenberg-Marquardt Method
This is useful even when Hϕ(x), Jt(x)J(x), or J(x) is ill-conditioned, or rank-deficient. It does not ignore the term ∑i=1mri(x)Hri(x), but replace this by μI for μ∈R. (J(x)tJ(x)+μI)s≈−J(x)tr(x)
This technique, which is called regularization, makes the pseudo-Hf(x) positive definite. Its characteristic is more obvious when it is considered as that of the lower dimension. f(x+h) is decreasing when its second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of f is not greater than zero, it can be biased by μ as follows. ⎩⎨⎧f′′(x)>0⟹h=−f′′(x)f′(x)f′′(x)≤0⟹h=−f′′(x)+μf′(x)(f′′(x)+μ>0)
Considering this process and applying it to the higer dimension in the same manner, the following properties are induced. f′′(x)+μ>0⇕J(x)tJ(x)+μI is positive definite (f′′(x)≤0)(J(x)tJ(x) is NOT positive definite)
Therefore, it can be said that the regularization term μI makes the pseudo-Hϕ positive definite. If this regularization term is adjusted, the Levenberg-Marquardt method can be interpreted as the weighted combination of the Gauss-Newton and gradient descent methods. ⎩⎨⎧J(x)tJ(x)s≈−J(x)tr(x)(μI=O)s≈−J(x)tr(x)=−∇ϕ(x)(μI=I−J(x)tJ(x))Gauss-NewtonGradient Descent
Meanwhile, there exists a singular value of J(x) close to zero when J(x) is ill-conditioned. Using the fact that J(x) is diagonalizable since it is symmetric, J(x)tJ(x)=(UΣVt)t(UΣVt)=VΣtUtUΣV=VΣ2V⟹Vt(J(x)tJ(x))V=Σ2=diag(σ12,⋯,σn2)
where U is an m×m orthogonal matrix, V is an n×n orthogonal matrix, Σ is an m×n diagonal matrix, and σi is the i-th singular value of J(x). It means that the eigenvalues of J(x)tJ(x) are the square of the singular values of J(x). In case that J(x) is ill-conditioned, for the eigenvalue λ which is close to zero and the eigenvector v corresponding to λ, J(x)tJ(x)v=λv⟹(J(x)tJ(x)+μI)v=λv+μv=(λ+μ)v
Note that J(x)tJ(x)+μI is symmetric since J(x)tJ(x) is symmetric and it is also positive definite if μ is selected well. Accordingly, the Cholesky factorization can be applied to it. (J(x)tJ(x)+μI)v=(λ+μ)v=J^(x)tJ^(x)v
where J^(x)tJ^(x) is positive definite and its all eigenvalues are positive. Then the eigenvalue of J^(x)tJ^(x) is λ+μ, so the singular value of J^(x) is λ+μ. Therefore, the original singular value of J(x), λ, is increased to λ+μ, and it helps the ill-conditioned J(x) to be far from this condition. The corresponding linear least squares problem to Levenberg-Marquardt method is, (J(x)μI)s≅(−r(x)0)
Reference
[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.
Keep going!Keep going ×2!Give me more!Thank you, thank youFar too kind!Never gonna give me up?Never gonna let me down?Turn around and desert me!You're an addict!Son of a clapper!No wayGo back to work!This is getting out of handUnbelievablePREPOSTEROUSI N S A N I T YFEED ME A STRAY CAT