Floating-Point Arithmetic

  • When adding two floating-point numbers, the number whose exponent is smaller should be modified for the other. Consider two decimal numbers whose precision is 6.
x=1.92403×102,y=6.35782×101x+y=(1.92403+0.00635782)×102=(1.92403+0.00636)×102(round-to-nearest)=1.93039×102\begin{aligned} x &= 1.92403 \times 10^2, \quad y = 6.35782 \times 10^{-1} \\ x + y &= (1.92403 + 0.00635782) \times 10^2 \\ &= (1.92403 + 0.00636) \times 10^2 \quad \text{(round-to-nearest)} \\ &= 1.93039 \times 10^2 \end{aligned}
  • Cancellation: the subtraction between the similar numbers, the addtion/subtraction between the relative large number and relative small number, and the division by the small number.

(a) For the machine epsilon ϵ\epsilon, (1+ϵ)(1ϵ)=11=0(1 + \epsilon) - (1 - \epsilon) = 1 - 1 = 0 although it should be 2ϵ2\epsilon in the real mathematics.

(b) For quadratic formula b±b24ac2a\cfrac{-b \pm \sqrt{b^2 - 4ac}}{2a}, when b>0b > 0 and b2acb^2 \gg ac, b±b24ac-b \pm \sqrt{b^2 - 4ac} part is numerically unstable.

(c) For performance, a standard deviation σ\sigma σ=1n1i=1n(xix)2\begin{aligned} \sigma = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2} \end{aligned}

where x\overline{x} is the mean of nn-points x1,,xnx_1, \cdots, x_n, can be replaced by 1n1(i=1nxi2nx2)\begin{aligned} \sqrt{\frac{1}{n-1} \left( \sum_{i=1}^n x^2_i - n \overline{x}^2 \right)} \end{aligned}

However, it can be numerically unstable in (i=1nxi2nx2)\left( \sum_{i=1}^n x^2_i - n \overline{x}^2 \right) part, and it can be even negative.

(d) For a=1.1a = 1.1 and x=123456.789x = 123456.789, (x+a)x(x+a)-x may not be the same as aa. a=1.1    0011undefined3 1111undefinedF 1000undefined8 1100undefinedC 1100undefinedC 1100undefinedC 1100undefinedC 1101undefinedD:(1.f1f23)20 formx=123456.789    0100undefined4 0111undefined7 1111undefinedF 0001undefined1 0010undefined2 0000undefined0 0110undefined6 0101undefined5:(1.f1f23)216 formx+a=0100 0111 1111 0001 0010 0000 0110 0101+0100 0111 1000 0000 0000 0000 1undefinedappeared 16-bit number000 1100 1100 1100 1100 1101undefinedloss=0100 0111 1111 0001 0010 0000 0110 0101+0100 0111 1000 0000 0000 0000 1000 1101(round-to-nearest)=0100undefined4 0111undefined7 1111undefinedF 0001undefined1 0010undefined2 0000undefined0 1111undefinedF 0010undefined2(x+a)x=0100 0111 1111 0001 0010 0000 1111 00100100 0111 1111 0001 0010 0000 0110 0101=0100 0111 1000 0000 0000 0000 1undefinedshould be shifted000 1101=0011undefined3 1111undefinedF 1000undefined8 1101undefinedD 0000undefined0 0000undefined0 0000undefined0 0000undefined0 0011undefined3 1111undefinedF 1000undefined8 1100undefinedC 1100undefinedC 1100undefinedC 1100undefinedC 1101undefinedD=a\begin{aligned} a = 1.1 &\implies \underbrace{\color{plum}0 \color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}1 \color{black}000}_{8} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1101}_{D} \textcolor{orange}{\quad : (1. f_1 \mathellipsis f_{23})2^0 \ \text{form}} \\ x = 123456.789 &\implies \underbrace{\color{plum}0 \color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \ \underbrace{\color{limegreen}1 \color{black}111}_{F} \ \underbrace{0001}_{1} \ \underbrace{0010}_{2} \ \underbrace{0000}_{0} \ \underbrace{0110}_{6} \ \underbrace{0101}_{5} \textcolor{orange}{\quad : (1. f_1 \mathellipsis f_{23})2^{16} \ \text{form}} \\ \\ x + a &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ &+ \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}\underbrace{ \color{red}{000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}1}_{\text{appeared 16-bit number}} \color{black}{000} \ 1100 \ \color{orangered}\underbrace{ 1100 \ 1100 \ 1100 \ 1101}_{\text{loss}} \\ &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ &+ \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{red}{000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}1 \color{black}{000} \ 110 \color{red}1 \quad \color{black}\text{(round-to-nearest)} \\ &= \underbrace{\color{plum}0 \color{limegreen}{100}}_{4} \ \underbrace{\color{limegreen}{0111}}_{7} \ \underbrace{\color{limegreen}1 \color{black}111}_{F} \ \underbrace{0001}_{1} \ \underbrace{0010}_{2} \ \underbrace{0000}_{0} \ \underbrace{1111}_{F} \ \underbrace{0010}_{2} \\ \\ (x + a) - x &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 1111 \ 0010 \\ &- \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}111 \ 0001 \ 0010 \ 0000 \ 0110 \ 0101 \\ \\ &= \color{plum}0 \color{limegreen}{100} \ \color{limegreen}{0111} \ \color{limegreen}1 \color{black}\underbrace{ \color{red}{000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}{0000} \ \color{red}1}_{\text{should be shifted}} \color{black}{000} \ 1101 \\ \\ &= \underbrace{\color{plum}0 \color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}1 \color{black}000}_{8} \ \underbrace{1101}_{D} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \ \underbrace{0000}_{0} \\ &\color{red}{\not =} \ \color{black} \underbrace{\color{plum}0 \color{limegreen}{011}}_{3} \ \underbrace{\color{limegreen}{1111}}_{F} \ \underbrace{\color{limegreen}1 \color{black}000}_{8} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1100}_{C} \ \underbrace{1101}_{D} = a \end{aligned}

In this example, (x+a)xa(x + a) - x \not = a after the addition between the relative large number and relative small number, x+ax+a, and the subtraction between the similar numbers, (x+a)x(x + a) - x. Therefore, this calculation is numerically unstable.


© 2025. All rights reserved.