# Nonlinear Least Squares

Consider the residual function $r: \mathbb{R}^n \to \mathbb{R}^m$, for $x \in \mathbb{R}^n$ \begin{aligned} r_i(x) = y_i - f(t_i - x), \quad (i = 1, 2, \cdots, m) \end{aligned}

where $y_i$ is the ground truth and $t_i$ is the observed value for a function $f: \mathbb{R}^{n+1} \to \mathbb{R}$. Given $r$, we want to minimize the cost function $\phi$, \begin{aligned} \phi (x) &= \frac{1}{2} r(x)^t r(x) = \frac{1}{2} \left(\begin{array}{ccc} r_1(x) & \cdots & r_m(x) \end{array}\right) \left(\begin{array}{c} r_1(x) \\ \vdots \\ r_m(x) \end{array}\right) \\ &= \frac{1}{2} (r_1^2(x) + \cdots + r_m^2(x)) \end{aligned}

For minimizing $\phi$, the gradient of $\phi$ is required. \begin{aligned} \frac{\partial \phi}{\partial x_i} &= r_1(x) \frac{\partial r_1}{\partial x_i} + \cdots + r_m(x) \frac{\partial r_m}{\partial x_i} \\ \\ \implies \nabla \phi (x) &= \left(\begin{array}{c} \frac{\partial \phi}{\partial x_1}) \\ \vdots \\ \frac{\partial \phi}{\partial x_n}) \end{array}\right) = \left(\begin{array}{c} r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \\ \vdots \\ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_i} & \cdots & \frac{\partial r_m}{\partial x_i} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \\ \end{array}\right) \left(\begin{array}{c} r_1(x) \\ \vdots \\ r_m(x) \end{array}\right) = \color{limegreen}{J(x)^t r(x)} \end{aligned}

where $J(x)$ is the Jacobian matrix of $r(x)$. In addition, the Hessian matrix $H_{\phi}(x)$ of $\phi (x)$ is as follows: \begin{aligned} H_{\phi}(x) &= \left(\begin{array}{ccc} \frac{\partial^2 \phi}{\partial x_1^2} & \cdots & \frac{\partial^2 \phi}{\partial x_1 \partial x_n} \\ & \vdots & \\ \frac{\partial^2 \phi}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 \phi}{\partial x_n^2} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial}{\partial \color{orange}x_1} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \right] & \cdots & \frac{\partial}{\partial \color{orange}x_n} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \right] \\ & \vdots & \\ \frac{\partial}{\partial \color{orange}x_1} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \right] & \cdots & \frac{\partial}{\partial \color{orange}x_n} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \right] \end{array}\right) \\ \\ \implies (H_{\phi})_{ij} &= \frac{\partial}{\partial \color{orange}x_j} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_i} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_i} \right] \\ \\ &= \left( \frac{\partial r_1}{\partial \color{orange}x_j} \frac{\partial r_1}{\partial \color{limegreen}x_i} + r_1(x)\frac{\partial^2 r_1}{\partial \color{limegreen}{x_i} \color{black}\partial \color{orange}{x_j}} \right) + \cdots + \left( \frac{\partial r_m}{\partial \color{orange}x_j} \frac{\partial r_m}{\partial \color{limegreen}x_i} + r_m(x)\frac{\partial^2 r_m}{\partial \color{limegreen}{x_i} \color{black}\partial \color{orange}{x_j}} \right) \end{aligned}

where $(H_{\phi})_{ij}$ is the $(i, j)$ element of $H_{\phi}$. So it can be rewritten as \begin{aligned} H_{\phi}(x) &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_1} \frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_n} \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_1} \frac{\partial r_m}{\partial x_n} & \cdots & \frac{\partial r_1}{\partial x_n} \frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_n} \frac{\partial r_m}{\partial x_n} \end{array}\right) \\ \\ &+ \left(\begin{array}{ccc} r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_n} \\ & \vdots & \\ r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_n} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \end{array}\right) \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \\ & \vdots & \\ \frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_n} \end{array}\right) + \sum_{i=1}^m r_i(x) \left(\begin{array}{ccc} \frac{\partial^2 r_i}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_1 \partial x_n} \\ & \vdots & \\ \frac{\partial^2 r_i}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_n \partial x_n} \end{array}\right) \\ \\ &= \color{limegreen}{J(x)^t J(x) + \sum_{i=1}^m r_i(x) H_{r_i}(x)} \end{aligned}

Meanwhile, for a convex function $f: \mathbb{R} \to \mathbb{R}$, \begin{aligned} f(x + h) \approx f(x) + f'(x) h + \frac{1}{2} f''(x) h^2 \end{aligned}

by Taylor Theorem. It tells us that a good $h$ can make $f(x)$ move to the direction $f$ is decreasing. For finding this value $h$, we can define the function $g$ for $h$ and find the critical point of $g$. \begin{aligned} g(h) &= h^2 \left( \frac{1}{2} f''(x) \right) + hf'(x) + f(x) \\ g'(h) &= h f''(x) + f'(x) = 0 \\ h &= - \frac{f'(x)}{f''(x)} \quad (f''(x) > 0) \end{aligned}

In other words, $x$ should be moved by $h$ for decreasing $f$. In a higher dimension, the minimum of $f$ can be found in the same manner. Suppose that $f$ is a function for $x \in \mathbb{R}^n$ and its Hessian matrix $H_f$ is positive definite. By Taylor Theorem, \begin{aligned} f(x + s) \approx f(x) + \nabla f(x)^t s + \frac{1}{2} s^t H_f(x) s \end{aligned}

Following the process in one dimension, $x$ should be moved by $s$, then the solution of \begin{aligned} H_f(x) s = - \nabla f(x) \end{aligned}

Returning to the original problem, $\phi(x)$ can be minimized by finding $s$ such that $H_{\phi}(x)s = - \nabla \phi(x)$. More specifically, this linear system is as follows: \begin{aligned} H_{\phi}(x) s = - \nabla \phi(x) \iff \left( J(x)^t J(x) + \sum_{i=1}^m r_i(x) H_{r_i}(x) \right) s = - J(x)^t r(x) \end{aligned}

This linear system can be solved by some famous methods.

## 1. Gauss-Newton Method

This method assumes that $\sum_{i=1}^m r_i(x) H_{r_i}(x)$ will be negligible when the residual function $r$ is small. It is possible when the model function $f$ fits the data well so that $r_i(x) = y_i - f(t_i, x)$ can be small. In this case, \begin{aligned} J(x)^t J(x) s \approx - J(x)^t r(x) \end{aligned}

It recalls the normal equation, so this problem can be viewed as solving the linear system $J(x)s = -r(x)$. However, there are some drawbacks. This problem may fail to converge when it is not started close enough to the real solution. Moreover, if $\sum_{i=1}^m r_i(x) H_{r_i}(x)$ is not small, it cannot be negligible. Finally, the solution could hardly find when $H_{\phi}(x)$, $J^t(x) J(x)$, or $J(x)$ is ill-conditioned.

## 2. Levenberg-Marquardt Method

This is useful even when $H_{\phi}(x)$, $J^t(x) J(x)$, or $J(x)$ is ill-conditioned, or rank-deficient. It does not ignore the term $\sum_{i=1}^m r_i(x) H_{r_i}(x)$, but replace this by $\mu I$ for $\mu \in \mathbb{R}$. \begin{aligned} \left( J(x)^t J(x) + \mu I \right) s \approx - J(x)^t r(x) \end{aligned}

This technique, which is called regularization, makes the pseudo-$H_{f(x)}$ positive definite. Its characteristics is more obvious when it is considered as that of the lower dimension. $f(x + h)$ is decreasing when its second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of $f$ is not greater than zero, it can be biased by $\mu$ as follows: \begin{aligned} \begin{cases} f''(x) > 0 \implies h = - \frac{f'(x)}{f''(x)} \\ \\ f''(x) \le 0 \implies h = - \frac{f'(x)}{f''(x) + \mu} \quad \text{where } f''(x) + \mu > 0 \end{cases} \end{aligned}

Considering this process and applying it to the higer dimension in the same manner, the following properties are induced. $\begin{gather} f''(x) + \mu > 0 \quad \text{where } f''(x) \le 0 \\ \Updownarrow \\ J(x)^t J(x) + \mu I \text{ is positive definite} \\ \text{where } H_{\phi}(x) \left( \approx J(x)^t J(x) \right) \text{ is NOT positive definite} \end{gather}$

Therefore, it can be said that the regularization term, $\mu I$ makes the pseudo-$H_{f(x)}$, which is $J(x)^t J(x)$, positive definite. Moreover, Levenberg-Marquardt method can be interpreted as the weighted combination of Gauss-Newton and gradient descent methods. \begin{aligned} \begin{cases} J(x)^t J(x) s \approx - J(x)^t r(x) \quad (\mu I = O) \quad \text{: Gauss-Newton} \\ \\ s \approx - J(x)^t r(x) = - \nabla \phi(x) \quad (\mu I = I - J(x)^t J(x)) \quad \text{: Gradient Descent} \end{cases} \end{aligned}

Meanwhile, there exists a singular value of $J(x)$ close to zero when $J(x)$ is ill-conditioned. Using the fact that $J(x)$ can be diagonalizable, \begin{aligned} J(x)^t J(x) &= (U \Sigma V^t)^t (U \Sigma V^t) = V \Sigma^t U^t U \Sigma V = V \Sigma^2 V \\ \implies V^t (J(x)^t J(x)) V &= \Sigma^2 = diag(\sigma_1^2, \cdots, \sigma_n^2) \end{aligned}

where $\sigma_i$ is the $i$-th singular value of $J(x)$. It means that the eigenvalues of $J(x)^t J(x)$ are the square of the singular values of $J(x)$. In case that $J(x)$ is ill-conditioned, for the eigenvalue $\lambda$ which is close to zero and the eigenvector $v$ corresponding to $\lambda$, \begin{aligned} J(x)^t J(x)v &= \lambda v \implies (J(x)^t J(x) + \mu I) v = \lambda v + \mu v = (\lambda + \mu)v \end{aligned}

Note that $J(x)^t J(x) + \mu I$ is symmetric since $J(x)^t J(x)$ is symmetric and it is also positive definite if $\mu$ is selected well. Accordingly, the Cholesky factorization can be applied to it as follows: \begin{aligned} (J(x)^t J(x) + \mu I) v = (\lambda + \mu)v = \hat{J}(x)^t \hat{J}(x) v \end{aligned}

where $\hat{J}(x)^t \hat{J}(x)$ is positive definite and its all eigenvalues are positive. Then the eigenvalue of $\hat{J}(x)^t \hat{J}(x)$ is $\lambda + \mu$, so the singular value of $\hat{J}(x)$ is $\sqrt{\lambda + \mu}$. Therefore, the original singular value of $J(x)$, $\sqrt{\lambda}$, is increased to $\sqrt{\lambda + \mu}$, and it helps the ill-conditioned $J(x)$ to be far from this condition. The corresponding linear least squares problem to Levenberg-Marquardt method is, \begin{aligned} \left(\begin{array}{c} J(x) \\ \sqrt{\mu} \ I \end{array}\right) s \cong \left(\begin{array}{c} -r(x) \\ 0 \end{array}\right) \end{aligned}

## Reference

 Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.