Consider the residual function r:Rn→Rm, for x∈Rnri(x)=yi−f(ti−x),(i=1,2,⋯,m)
where yi is the ground truth and ti is the observed value for a function f:Rn+1→R. Given r, we want to minimize the cost function ϕ, ϕ(x)=21r(x)tr(x)=21(r1(x)⋯rm(x))⎝⎛r1(x)⋮rm(x)⎠⎞=21(r12(x)+⋯+rm2(x))
For minimizing ϕ, the gradient of ϕ is required. ∂xi∂ϕ⟹∇ϕ(x)=r1(x)∂xi∂r1+⋯+rm(x)∂xi∂rm=⎝⎛∂x1∂ϕ)⋮∂xn∂ϕ)⎠⎞=⎝⎛r1(x)∂x1∂r1+⋯+rm(x)∂x1∂rm⋮r1(x)∂xn∂r1+⋯+rm(x)∂xn∂rm⎠⎞=⎝⎛∂x1∂r1∂xi∂r1∂xn∂r1⋯⋮⋯⋮⋯∂x1∂rm∂xi∂rm∂xn∂rm⎠⎞⎝⎛r1(x)⋮rm(x)⎠⎞=J(x)tr(x)
where J(x) is the Jacobian matrix of r(x). In addition, the Hessian matrix Hϕ(x) of ϕ(x) is as follows: Hϕ(x)⟹(Hϕ)ij=⎝⎛∂x12∂2ϕ∂xn∂x1∂2ϕ⋯⋮⋯∂x1∂xn∂2ϕ∂xn2∂2ϕ⎠⎞=⎝⎛∂x1∂[r1(x)∂x1∂r1+⋯+rm(x)∂x1∂rm]∂x1∂[r1(x)∂xn∂r1+⋯+rm(x)∂xn∂rm]⋯⋮⋯∂xn∂[r1(x)∂x1∂r1+⋯+rm(x)∂x1∂rm]∂xn∂[r1(x)∂xn∂r1+⋯+rm(x)∂xn∂rm]⎠⎞=∂xj∂[r1(x)∂xi∂r1+⋯+rm(x)∂xi∂rm]=(∂xj∂r1∂xi∂r1+r1(x)∂xi∂xj∂2r1)+⋯+(∂xj∂rm∂xi∂rm+rm(x)∂xi∂xj∂2rm)
where (Hϕ)ij is the (i,j) element of Hϕ. So it can be rewritten as Hϕ(x)=⎝⎛∂x1∂r1∂x1∂r1+⋯+∂x1∂rm∂x1∂rm∂x1∂r1∂xn∂r1+⋯+∂x1∂rm∂xn∂rm⋯⋮⋯∂xn∂r1∂x1∂r1+⋯+∂xn∂rm∂x1∂rm∂xn∂r1∂xn∂r1+⋯+∂xn∂rm∂xn∂rm⎠⎞+⎝⎛r1(x)∂x1∂x1∂2r1+⋯+rm(x)∂x1∂x1∂2rmr1(x)∂xn∂x1∂2r1+⋯+rm(x)∂xn∂x1∂2rm⋯⋮⋯r1(x)∂x1∂xn∂2r1+⋯+rm(x)∂x1∂xn∂2rmr1(x)∂xn∂xn∂2r1+⋯+rm(x)∂xn∂xn∂2rm⎠⎞=⎝⎛∂x1∂r1∂xn∂r1⋯⋮⋯∂x1∂rm∂xn∂rm⎠⎞⎝⎛∂x1∂r1∂x1∂rm⋯⋮⋯∂xn∂r1∂xn∂rm⎠⎞+i=1∑mri(x)⎝⎛∂x1∂x1∂2ri∂xn∂x1∂2ri⋯⋮⋯∂x1∂xn∂2ri∂xn∂xn∂2ri⎠⎞=J(x)tJ(x)+i=1∑mri(x)Hri(x)
Meanwhile, for a convex function f:R→R, f(x+h)≈f(x)+f′(x)h+21f′′(x)h2
by Taylor Theorem. It tells us that a good h can make f(x) move to the direction f is decreasing. For finding this value h, we can define the function g for h and find the critical point of g. g(h)g′(h)h=h2(21f′′(x))+hf′(x)+f(x)=hf′′(x)+f′(x)=0=−f′′(x)f′(x)(f′′(x)>0)
In other words, x should be moved by h for decreasing f. In a higher dimension, the minimum of f can be found in the same manner. Suppose that f is a function for x∈Rn and its Hessian matrix Hf is positive definite. By Taylor Theorem, f(x+s)≈f(x)+∇f(x)ts+21stHf(x)s
Following the process in one dimension, x should be moved by s, then the solution of Hf(x)s=−∇f(x)
Returning to the original problem, ϕ(x) can be minimized by finding s such that Hϕ(x)s=−∇ϕ(x). More specifically, this linear system is as follows: Hϕ(x)s=−∇ϕ(x)⟺(J(x)tJ(x)+i=1∑mri(x)Hri(x))s=−J(x)tr(x)
This linear system can be solved by some famous methods.
1. Gauss-Newton Method
This method assumes that ∑i=1mri(x)Hri(x) will be negligible when the residual function r is small. It is possible when the model function f fits the data well so that ri(x)=yi−f(ti,x) can be small. In this case, J(x)tJ(x)s≈−J(x)tr(x)
It recalls the normal equation, so this problem can be viewed as solving the linear system J(x)s=−r(x). However, there are some drawbacks. This problem may fail to converge when it is not started close enough to the real solution. Moreover, if ∑i=1mri(x)Hri(x) is not small, it cannot be negligible. Finally, the solution could hardly find when Hϕ(x), Jt(x)J(x), or J(x) is ill-conditioned.
2. Levenberg-Marquardt Method
This is useful even when Hϕ(x), Jt(x)J(x), or J(x) is ill-conditioned, or rank-deficient. It does not ignore the term ∑i=1mri(x)Hri(x), but replace this by μI for μ∈R. (J(x)tJ(x)+μI)s≈−J(x)tr(x)
This technique, which is called regularization, makes the pseudo-Hf(x) positive definite. Its characteristics is more obvious when it is considered as that of the lower dimension. f(x+h) is decreasing when its second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of f is not greater than zero, it can be biased by μ as follows: ⎩⎨⎧f′′(x)>0⟹h=−f′′(x)f′(x)f′′(x)≤0⟹h=−f′′(x)+μf′(x)where f′′(x)+μ>0
Considering this process and applying it to the higer dimension in the same manner, the following properties are induced. f′′(x)+μ>0where f′′(x)≤0⇕J(x)tJ(x)+μI is positive definitewhere Hϕ(x)(≈J(x)tJ(x)) is NOT positive definite
Therefore, it can be said that the regularization term, μI makes the pseudo-Hf(x), which is J(x)tJ(x), positive definite. Moreover, Levenberg-Marquardt method can be interpreted as the weighted combination of Gauss-Newton and gradient descent methods. ⎩⎨⎧J(x)tJ(x)s≈−J(x)tr(x)(μI=O): Gauss-Newtons≈−J(x)tr(x)=−∇ϕ(x)(μI=I−J(x)tJ(x)): Gradient Descent
Meanwhile, there exists a singular value of J(x) close to zero when J(x) is ill-conditioned. Using the fact that J(x) can be diagonalizable, J(x)tJ(x)⟹Vt(J(x)tJ(x))V=(UΣVt)t(UΣVt)=VΣtUtUΣV=VΣ2V=Σ2=diag(σ12,⋯,σn2)
where σi is the i-th singular value of J(x). It means that the eigenvalues of J(x)tJ(x) are the square of the singular values of J(x). In case that J(x) is ill-conditioned, for the eigenvalue λ which is close to zero and the eigenvector v corresponding to λ, J(x)tJ(x)v=λv⟹(J(x)tJ(x)+μI)v=λv+μv=(λ+μ)v
Note that J(x)tJ(x)+μI is symmetric since J(x)tJ(x) is symmetric and it is also positive definite if μ is selected well. Accordingly, the Cholesky factorization can be applied to it as follows: (J(x)tJ(x)+μI)v=(λ+μ)v=J^(x)tJ^(x)v
where J^(x)tJ^(x) is positive definite and its all eigenvalues are positive. Then the eigenvalue of J^(x)tJ^(x) is λ+μ, so the singular value of J^(x) is λ+μ. Therefore, the original singular value of J(x), λ, is increased to λ+μ, and it helps the ill-conditioned J(x) to be far from this condition. The corresponding linear least squares problem to Levenberg-Marquardt method is, (J(x)μI)s≅(−r(x)0)
Reference
[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.
Keep going!Keep going ×2!Give me more!Thank you, thank youFar too kind!Never gonna give me up?Never gonna let me down?Turn around and desert me!You're an addict!Son of a clapper!No wayGo back to work!This is getting out of handUnbelievablePREPOSTEROUSI N S A N I T YFEED ME A STRAY CAT