Consider the residual function , for
where is the ground truth and is the observed value for a function . Given , we want to minimize the cost function ,
For minimizing , the gradient of is required.
where is the Jacobian matrix of . In addition, the Hessian matrix of is as follows:
where is the element of . So it can be rewritten as
Meanwhile, for a convex function ,
by Taylor Theorem. It tells us that a good can make move to the direction is decreasing. For finding this value , we can define the function for and find the critical point of .
In other words, should be moved by for decreasing . In a higher dimension, the minimum of can be found in the same manner. Suppose that is a function for and its Hessian matrix is positive definite. By Taylor Theorem,
Following the process in one dimension, should be moved by , then the solution of
Returning to the original problem, can be minimized by finding such that . More specifically, this linear system is as follows:
This linear system can be solved by some famous methods.
1. Gauss-Newton Method
This method assumes that will be negligible when the residual function is small. It is possible when the model function fits the data well so that can be small. In this case,
It recalls the normal equation, so this problem can be viewed as solving the linear system . However, there are some drawbacks. This problem may fail to converge when it is not started close enough to the real solution. Moreover, if is not small, it cannot be negligible. Finally, the solution could hardly find when , , or is ill-conditioned.
2. Levenberg-Marquardt Method
This is useful even when , , or is ill-conditioned, or rank-deficient. It does not ignore the term , but replace this by for .
This technique, which is called regularization, makes the pseudo- positive definite. Its characteristics is more obvious when it is considered as that of the lower dimension. is decreasing when its second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of is not greater than zero, it can be biased by as follows:
Considering this process and applying it to the higer dimension in the same manner, the following properties are induced.
Therefore, it can be said that the regularization term, makes the pseudo-, which is , positive definite. Moreover, Levenberg-Marquardt method can be interpreted as the weighted combination of Gauss-Newton and gradient descent methods.
Meanwhile, there exists a singular value of close to zero when is ill-conditioned. Using the fact that can be diagonalizable,
where is the -th singular value of . It means that the eigenvalues of are the square of the singular values of . In case that is ill-conditioned, for the eigenvalue which is close to zero and the eigenvector corresponding to ,
Note that is symmetric since is symmetric and it is also positive definite if is selected well. Accordingly, the Cholesky factorization can be applied to it as follows:
where is positive definite and its all eigenvalues are positive. Then the eigenvalue of is , so the singular value of is . Therefore, the original singular value of , , is increased to , and it helps the ill-conditioned to be far from this condition. The corresponding linear least squares problem to Levenberg-Marquardt method is,
 Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.