Floating-Point Arithmetic

When adding two floating-point numbers, the number whose exponent is smaller should be modified for the other. Consider two decimal numbers whose precision is 6.

\begin{aligned} x &= 1.92403 \times 10^2, \quad y = 6.35782 \times 10^{-1} \\ x + y &= (1.92403 + 0.00635782) \times 10^2 \\ &= (1.92403 + 0.00636) \times 10^2 \quad \text{(round-to-nearest)} \\ &= 1.93039 \times 10^2 \end{aligned}

Cancellation: the subtraction between the similar numbers, the addtion/subtraction between the relative large number and relative small number, and the division by the small number.

(a) For the machine epsilon $\epsilon$ , $(1 + \epsilon) - (1 - \epsilon) = 1 - 1 = 0$ although it should be $2\epsilon$ in the real mathematics.

(b) For quadratic formula $\cfrac{-b \pm \sqrt{b^2 - 4ac}}{2a}$ , when $b > 0$ and $b^2 \gg ac$ , $-b \pm \sqrt{b^2 - 4ac}$ part is numerically unstable.

(c) For performance, a standard deviation $\sigma$ $\begin{aligned} \sigma = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2} \end{aligned}$

where $\overline{x}$ is the mean of $n$ -points $x_1, \cdots, x_n$ , can be replaced by $\begin{aligned} \sqrt{\frac{1}{n-1} \left( \sum_{i=1}^n x^2_i - n \overline{x}^2 \right)} \end{aligned}$

However, it can be numerically unstable in $\left( \sum_{i=1}^n x^2_i - n \overline{x}^2 \right)$ part, and it can be even negative.

(d) For $a = 1.1$ and $x = 123456.789$ , $(x+a)-x$ may not be the same as $a$ . $\begin{aligned} a = 1.1 &\implies \underbrace{\color{plum}0 \color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}1 \color{black}000}_{8} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1101}_{D} \textcolor{orange}{\quad : (1. f_1 \mathellipsis f_{23})2^0 \ \text{form}} \\ x = 123456.789 &\implies \underbrace{\color{plum}0 \color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \ \underbrace{\color{limegreen}1 \color{black}111}_{F} \ \underbrace{0001}_{1} \ \underbrace{0010}_{2} \ \underbrace{0000}_{0} \ \underbrace{0110}_{6} \ \underbrace{0101}_{5} \textcolor{orange}{\quad : (1. f_1 \mathellipsis f_{23})2^{16} \ \text{form}} \\ \\ x + a &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ &+ \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}\underbrace{ \color{red}{000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}1}_{\text{appeared 16-bit number}} \color{black}{000} \ 1100 \ \color{orangered}\underbrace{ 1100 \ 1100 \ 1100 \ 1101}_{\text{loss}} \\ &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ &+ \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{red}{000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}1 \color{black}{000} \ 110 \color{red}1 \quad \color{black}\text{(round-to-nearest)} \\ &= \underbrace{\color{plum}0 \color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \ \underbrace{\color{limegreen}1 \color{black}111}_{F} \ \underbrace{0001}_{1} \ \underbrace{0010}_{2} \ \underbrace{0000}_{0} \ \underbrace{1111}_{F} \ \underbrace{0010}_{2} \\ \\ (x + a) - x &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 1111 \ 0010 \\ &- \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ \\ &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}\underbrace{ \color{red}{000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}1}_{\text{should be shifted}} \color{black}{000} \ 1101 \\ \\ &= \underbrace{\color{plum}0 \color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}1 \color{black}000}_{8} \ \underbrace{1101}_{D} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \\ &\color{red}{\not =} \ \color{black} \underbrace{\color{plum}0 \color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}1 \color{black}000}_{8} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1101}_{D} = a \end{aligned}$

In this example, $(x + a) - x \not = a$ after the addition between the relative large number and relative small number, $x+a$ , and the subtraction between the similar numbers, $(x + a) - x$ . Therefore, this calculation is numerically unstable.

Floating-Point Arithmetic

Jeesun Kim

Error

Templates (for web app):

Error