# Floating-Point Arithmetic

• When adding two floating-point numbers, the number whose exponent is smaller should be modified for the other. Consider two decimal numbers whose precision is $6$.
\begin{aligned} x &= 1.92403 \times 10^2, \quad y = 6.35782 \times 10^{-1} \\ x + y &= (1.92403 + 0.00635782) \times 10^2 \\ &= (1.92403 + 0.00636) \times 10^2 \quad \text{round-to-nearest} \\ &= 1.93039 \times 10^2 \end{aligned}
• Cancellation: the subtraction between the similar numbers, the addtion/subtraction between the relative large number and relative small number, and the division by the small number.
1. For the machine epsilon $\epsilon$, $(1 + \epsilon) - (1 - \epsilon) = 1 - 1 = 0$ although it should be $2\epsilon$ in the real mathematics.
2. For quadratic formula $\frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$, when $b > 0$ and $b^2 \gg ac$, $-b \pm \sqrt{b^2 - 4ac}$ part is numerically unstable.
3. For performance, a standard deviation $\sigma$
\begin{aligned} \sigma = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2} \end{aligned}

where $\overline{x}$ is the mean of $n$-points $x_1, \cdots, x_n$, can be replaced by \begin{aligned} \sqrt{\frac{1}{n-1} \left( \sum_{i=1}^n x^2_i - n \overline{x}^2 \right)} \end{aligned}

However, it can be numerically unstable in $\left( \sum_{i=1}^n x^2_i - n \overline{x}^2 \right)$ part, and it can be even negative.

1. For $a = 1.1$ and $x = 123456.789$, $(x+a)-x$ may not be the same as $a$.
\begin{aligned} a = 1.1 &\implies \underbrace{\color{plum}0 \color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}1 \color{black}000}_{8} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1101}_{D} \textcolor{orange}{\quad : (1. f_1 \mathellipsis f_{23})2^0 \ form} \\ x = 123456.789 &\implies \underbrace{\color{plum}0 \color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \ \underbrace{\color{limegreen}1 \color{black}111}_{F} \ \underbrace{0001}_{1} \ \underbrace{0010}_{2} \ \underbrace{0000}_{0} \ \underbrace{0110}_{6} \ \underbrace{0101}_{5} \textcolor{orange}{\quad : (1. f_1 \mathellipsis f_{23})2^{16} \ form} \\ \\ x + a &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ &+ \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}\underbrace{ \color{red}{000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}1}_{\text{appeared 16-bit number}} \color{black}{000} \ 1100 \ \color{orangered}\underbrace{ 1100 \ 1100 \ 1100 \ 1101}_{\text{loss}} \\ &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ &+ \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{red}{000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}1 \color{black}{000} \ 110 \color{red}1 \quad \color{black}\text{(round-to-nearest)} \\ &= \underbrace{\color{plum}0 \color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \ \underbrace{\color{limegreen}1 \color{black}111}_{F} \ \underbrace{0001}_{1} \ \underbrace{0010}_{2} \ \underbrace{0000}_{0} \ \underbrace{1111}_{F} \ \underbrace{0010}_{2} \\ \\ (x + a) - x &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 1111 \ 0010 \\ &- \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ \\ &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}\underbrace{ \color{red}{000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}1}_{\text{should be shifted}} \color{black}{000} \ 1101 \\ \\ &= \underbrace{\color{plum}0 \color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}1 \color{black}000}_{8} \ \underbrace{1101}_{D} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \\ &\color{red}{\not =} \ \color{black} \underbrace{\color{plum}0 \color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}1 \color{black}000}_{8} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1101}_{D} = a \end{aligned}

In this example, $(x + a) - x \not = a$ after the addition between the relative large number and relative small number ($x+a$ part), and the subtraction between the similar numbers($(x + a) - x$ part). Therefore, this calculation is numerically unstable.