# Nonlinear Least Squares

### Overview

Least squares data fitting can be viewed as an optimization problem. Given data points $(t_i, y_i)$ for $i = 1, \cdots, m$, the main goal is to find the vector $x \in \mathbb{R}^n$ of parameters that gives the best fit in the least squares sense to the nonlinear model function $f(t, x)$, where $f: \mathbb{R}^{n+1} \to \mathbb{R}$. Note that $f$ is a parametric function where $t$ is the parameter. If $f$ is linear, then an example could be as follows. \begin{aligned} f(t, x) = x_1 + x_2 t + x_3 t^2 + \cdots + x_n t^{n-1} \end{aligned}

If $f$ is nonlinear, an example could be as follows. \begin{aligned} f(t, x) = x_1 e^{x_2 t} + \cdots + x_{n-1} e^{x_n t} \end{aligned}

### Derivation

First, define the residual function $r: \mathbb{R}^n \to \mathbb{R}^m$, \begin{aligned} r_i(x) = y_i - f(t_i - x) \quad (i = 1, 2, \cdots, m) \end{aligned}

where $y_i$ is the ground truth and $f(t_i - x)$ is the estimated value for a function $f: \mathbb{R}^{n+1} \to \mathbb{R}$. As such, this residual function can lead to the minimization problem of the cost function $\phi$ defined by \begin{aligned} \phi (x) &= \frac{1}{2} r(x)^t r(x) = \frac{1}{2} \left(\begin{array}{ccc} r_1(x) & \cdots & r_m(x) \end{array}\right) \left(\begin{array}{c} r_1(x) \\ \vdots \\ r_m(x) \end{array}\right) \\ &= \frac{1}{2} (r_1^2(x) + \cdots + r_m^2(x)) \end{aligned}

Then, the gradient of $\phi$ is obtained as follows. \begin{aligned} \frac{\partial \phi}{\partial x_i} &= r_1(x) \frac{\partial r_1}{\partial x_i} + \cdots + r_m(x) \frac{\partial r_m}{\partial x_i} \\ \\ \implies \nabla \phi (x) &= \left(\begin{array}{c} \frac{\partial \phi}{\partial x_1} \\ \vdots \\ \frac{\partial \phi}{\partial x_n} \end{array}\right) = \left(\begin{array}{c} r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \\ \vdots \\ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_i} & \cdots & \frac{\partial r_m}{\partial x_i} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \\ \end{array}\right) \left(\begin{array}{c} r_1(x) \\ \vdots \\ r_m(x) \end{array}\right) = \color{limegreen}{J(x)^t r(x)} \end{aligned}

where $J(x)$ is the Jacobian matrix of $r(x)$. In addition, the Hessian matrix $H_{\phi}(x)$ of $\phi (x)$ is as follows: \begin{aligned} H_{\phi}(x) &= \left(\begin{array}{ccc} \frac{\partial^2 \phi}{\partial x_1^2} & \cdots & \frac{\partial^2 \phi}{\partial x_1 \partial x_n} \\ & \vdots & \\ \frac{\partial^2 \phi}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 \phi}{\partial x_n^2} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial}{\partial \color{orange}x_1} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \right] & \cdots & \frac{\partial}{\partial \color{orange}x_n} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_1} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_1} \right] \\ & \vdots & \\ \frac{\partial}{\partial \color{orange}x_1} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \right] & \cdots & \frac{\partial}{\partial \color{orange}x_n} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_n} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_n} \right] \end{array}\right) \\ \\ \implies (H_{\phi})_{ij} &= \frac{\partial}{\partial \color{orange}x_j} \left[ r_1(x) \frac{\partial r_1}{\partial \color{limegreen}x_i} + \cdots + r_m(x) \frac{\partial r_m}{\partial \color{limegreen}x_i} \right] \\ \\ &= \left( \frac{\partial r_1}{\partial \color{orange}x_j} \frac{\partial r_1}{\partial \color{limegreen}x_i} + r_1(x)\frac{\partial^2 r_1}{\partial \color{limegreen}{x_i} \color{black}\partial \color{orange}{x_j}} \right) + \cdots + \left( \frac{\partial r_m}{\partial \color{orange}x_j} \frac{\partial r_m}{\partial \color{limegreen}x_i} + r_m(x)\frac{\partial^2 r_m}{\partial \color{limegreen}{x_i} \color{black}\partial \color{orange}{x_j}} \right) \end{aligned}

where $(H_{\phi})_{ij}$ is the $(i, j)$ element of $H_{\phi}$. So it can be rewritten as \begin{aligned} H_{\phi}(x) &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_1} \frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \frac{\partial r_1}{\partial x_1} + \cdots + \frac{\partial r_m}{\partial x_n} \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_1} \frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_1} \frac{\partial r_m}{\partial x_n} & \cdots & \frac{\partial r_1}{\partial x_n} \frac{\partial r_1}{\partial x_n} + \cdots + \frac{\partial r_m}{\partial x_n} \frac{\partial r_m}{\partial x_n} \end{array}\right) \\ \\ &+ \left(\begin{array}{ccc} r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_1 \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_1 \partial x_n} \\ & \vdots & \\ r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_1} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_1} & \cdots & r_1(x)\frac{\partial^2 r_1}{\partial x_n \partial x_n} + \cdots + r_m(x)\frac{\partial^2 r_m}{\partial x_n \partial x_n} \end{array}\right) \\ \\ &= \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_1} \\ & \vdots & \\ \frac{\partial r_1}{\partial x_n} & \cdots & \frac{\partial r_m}{\partial x_n} \end{array}\right) \left(\begin{array}{ccc} \frac{\partial r_1}{\partial x_1} & \cdots & \frac{\partial r_1}{\partial x_n} \\ & \vdots & \\ \frac{\partial r_m}{\partial x_1} & \cdots & \frac{\partial r_m}{\partial x_n} \end{array}\right) + \sum_{i=1}^m r_i(x) \left(\begin{array}{ccc} \frac{\partial^2 r_i}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_1 \partial x_n} \\ & \vdots & \\ \frac{\partial^2 r_i}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 r_i}{\partial x_n \partial x_n} \end{array}\right) \\ \\ &= \color{limegreen}{J(x)^t J(x) + \sum_{i=1}^m r_i(x) H_{r_i}(x)} \end{aligned}

Before using this gradient and Hessian matrix, look into $f$ in the lower dimension to help an obvious understaning. For a while, let $f$ a convex function where $f: \mathbb{R} \to \mathbb{R}$. Then, by Taylor’s theorem, \begin{aligned} f(x + h) \approx f(x) + f'(x) h + \frac{1}{2} f''(x) h^2 \end{aligned}

It implies that a well-selected $h$ can make $f(x)$ move to the direction $f$ is decreasing. For finding this value $h$, this approximation can be viewed as a function $g$ for $h$, and the critical point of $g$ can be searched for. \begin{aligned} g(h) &= h^2 \left( \frac{1}{2} f''(x) \right) + hf'(x) + f(x) \\ g'(h) &= h f''(x) + f'(x) = 0 \\ h &= - \frac{f'(x)}{f''(x)} \quad (f''(x) > 0) \end{aligned}

Note that $f''(x) > 0$ since $f$ is convex. In other words, $x$ should be moved by $h$ for decreasing $f$. More specifically, $x$ should be moved by $h$ in the direction $(1, 0)$. In the same manner, the minimum of $f$ in a higher dimension can be obtained as well. Suppose that $f$ is a function for $x \in \mathbb{R}^n$ and its Hessian matrix $H_f$ is positive definite, which corresponds to the previous convex function according to this note. By Taylor’s theorem, \begin{aligned} f(x + s) \approx f(x) + \nabla f(x)^t s + \frac{1}{2} s^t H_f(x) s \end{aligned}

Following the process in the previous one-dimension, $x$ should be moved by $s$. More specifically, $x$ should be moved by $\Vert s \Vert$ in direction $s$. This $s$ can be obtained by solving the following linear system. \begin{aligned} H_f(x) s = - \nabla f(x) \end{aligned}

Returning to the original problem, this result implies that $\phi(x)$ can be minimized by finding $s$ such that $H_{\phi}(x)s = - \nabla \phi(x)$. More specifically, \begin{aligned} H_{\phi}(x) s = - \nabla \phi(x) \iff \left( J(x)^t J(x) + \sum_{i=1}^m r_i(x) H_{r_i}(x) \right) s = - J(x)^t r(x) \end{aligned}

This linear system can be solved by some famous methods.

### Gauss-Newton Method

This method assumes that $\sum_{i=1}^m r_i(x) H_{r_i}(x)$ will be negligible when the residual function $r$ is small. It is possible when the model function $f$ fits the data well so that $r_i(x) = y_i - f(t_i, x)$ can be small. In this case, \begin{aligned} J(x)^t J(x) s \approx - J(x)^t r(x) \end{aligned}

It reminds the normal equation. In this perspective, this problem can be viewed as solving the linear system $J(x)s = -r(x)$. In effect, the Gauss-Newton method replaces a nonlinear squares problem by a sequence of linear least squares problems. However, there are some drawbacks. This problem may fail to converge when it is not started close enough to the solution. Moreover, if $\sum_{i=1}^m r_i(x) H_{r_i}(x)$ is not negligible, the approximation may be inaccurate. Finally, the solution could hardly find when $H_{\phi}(x)$, $J^t(x) J(x)$, or $J(x)$ is ill-conditioned.

### Levenberg-Marquardt Method

This is useful even when $H_{\phi}(x)$, $J^t(x) J(x)$, or $J(x)$ is ill-conditioned, or rank-deficient. It does not ignore the term $\sum_{i=1}^m r_i(x) H_{r_i}(x)$, but replace this by $\mu I$ for $\mu \in \mathbb{R}$. \begin{aligned} \left( J(x)^t J(x) + \mu I \right) s \approx - J(x)^t r(x) \end{aligned}

This technique, which is called regularization, makes the pseudo-$H_{f(x)}$ positive definite. Its characteristic is more obvious when it is considered as that of the lower dimension. $f(x + h)$ is decreasing when its second order derivative is potisitive. This property is corresponding to the Hessian matrix of a function in a higher dimension. If the second order derivative of $f$ is not greater than zero, it can be biased by $\mu$ as follows. \begin{aligned} \begin{cases} f''(x) > 0 \implies h = - \frac{f'(x)}{f''(x)} \\ \\ f''(x) \le 0 \implies h = - \frac{f'(x)}{f''(x) + \mu} \quad (f''(x) + \mu > 0) \end{cases} \end{aligned}

Considering this process and applying it to the higer dimension in the same manner, the following properties are induced. \begin{aligned} f''(x) + \mu > 0 \quad &(f''(x) \leq 0) \\ \Updownarrow &\quad \\ J(x)^t J(x) + \mu I \text{ is positive definite } \quad &(J(x)^t J(x) \text{ is NOT positive definite}) \end{aligned}

Therefore, it can be said that the regularization term $\mu I$ makes the pseudo-$H_{\phi}$ positive definite. If this regularization term is adjusted, the Levenberg-Marquardt method can be interpreted as the weighted combination of the Gauss-Newton and gradient descent methods. \begin{aligned} \begin{cases} J(x)^t J(x) s \approx - J(x)^t r(x) \quad (\mu I = O) \quad &\text{Gauss-Newton} \\\\ s \approx - J(x)^t r(x) = - \nabla \phi(x) \quad (\mu I = I - J(x)^t J(x)) \quad &\text{Gradient Descent} \end{cases} \end{aligned}

Meanwhile, there exists a singular value of $J(x)$ close to zero when $J(x)$ is ill-conditioned. Using the fact that $J(x)$ is diagonalizable since it is symmetric, \begin{aligned} &J(x)^t J(x) = (U \Sigma V^t)^t (U \Sigma V^t) = V \Sigma^t U^t U \Sigma V = V \Sigma^2 V \\\\ &\implies V^t (J(x)^t J(x)) V = \Sigma^2 = diag(\sigma_1^2, \cdots, \sigma_n^2) \end{aligned}

where $U$ is an $m \times m$ orthogonal matrix, $V$ is an $n \times n$ orthogonal matrix, $\Sigma$ is an $m \times n$ diagonal matrix, and $\sigma_i$ is the $i$-th singular value of $J(x)$. It means that the eigenvalues of $J(x)^t J(x)$ are the square of the singular values of $J(x)$. In case that $J(x)$ is ill-conditioned, for the eigenvalue $\lambda$ which is close to zero and the eigenvector $v$ corresponding to $\lambda$, \begin{aligned} J(x)^t J(x)v &= \lambda v \implies (J(x)^t J(x) + \mu I) v = \lambda v + \mu v = (\lambda + \mu)v \end{aligned}

Note that $J(x)^t J(x) + \mu I$ is symmetric since $J(x)^t J(x)$ is symmetric and it is also positive definite if $\mu$ is selected well. Accordingly, the Cholesky factorization can be applied to it. \begin{aligned} (J(x)^t J(x) + \mu I) v = (\lambda + \mu)v = \hat{J}(x)^t \hat{J}(x) v \end{aligned}

where $\hat{J}(x)^t \hat{J}(x)$ is positive definite and its all eigenvalues are positive. Then the eigenvalue of $\hat{J}(x)^t \hat{J}(x)$ is $\lambda + \mu$, so the singular value of $\hat{J}(x)$ is $\sqrt{\lambda + \mu}$. Therefore, the original singular value of $J(x)$, $\sqrt{\lambda}$, is increased to $\sqrt{\lambda + \mu}$, and it helps the ill-conditioned $J(x)$ to be far from this condition. The corresponding linear least squares problem to Levenberg-Marquardt method is, \begin{aligned} \left(\begin{array}{c} J(x) \\ \sqrt{\mu} \ I \end{array}\right) s \cong \left(\begin{array}{c} -r(x) \\ \mathbf{0} \end{array}\right) \end{aligned}

### Reference

[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.