Floating-Point Number

1. System Format

Suppose that β\beta is the radix, or base, pp is precision, and [L,U][L, U] is the range of exponent EE. Then for xRx \in \mathbb{R}, x=±(d0.d1d2dp1)ββE=±(d0+d1β+d2β2++dp1βp1)βE\begin{aligned} x = \pm (d_0 . d_1 d_2 \mathellipsis d_{p-1})_{\beta} \beta^E = \pm \left( d_0 + \frac{d_1}{\beta} + \frac{d_2}{\beta^2} + \cdots + \frac{d_{p-1}}{\beta^{p-1}} \right) \beta^E \end{aligned}

where did_i is an integer in [0,β1][0, \beta - 1].

  • pp-digit number based-β\beta d0d1dp1d_0 d_1 \mathellipsis d_{p-1}: mantissa, or significant
  • d1dp1d_1 \mathellipsis d_{p-1} of mantissa: fraction
  • EE: exponent, or characteristic

2. Normalization

For x0Rx \not = 0 \in \mathbb{R}, it can be normalized so that d00d_0 \not = 0 and mantissa mm is in [1,β)[1, \beta). This normalization is unique and saves space for leading zeros. Especially, d1d_1 is always 11 when β=2\beta = 2, so it does not have to ve stored and saves, in turn, one bit more.

  • The number of the normalized floating-point number xx is
2undefined±×(β1)undefinedd00×(βp1)undefinedd1dp1×(UL+1)undefinedE+1undefinedzero\begin{aligned} \underbrace{2}_{\pm} \times \underbrace{(\beta - 1)}_{d_0 \not = 0} \times \underbrace{(\beta^{p-1})}_{d_1 \thicksim d_{p-1}} \times \underbrace{(U - L + 1)}_{E} + \underbrace{1}_{\text{zero}} \end{aligned}
  • The smallest positive xx is (1.00)ββL=βL(1.0\mathellipsis0)_{\beta} \beta^L = \beta^L.
  • The largest xx is
[(β1).(β1)(β1)]ββU=(β1)(1+β1++β1p)βU=βU+1(1βp)\begin{aligned} [(\beta - 1) . (\beta - 1) \mathellipsis (\beta - 1)]_{\beta} \beta^U &= (\beta - 1)(1 + \beta^{-1} + \cdots + \beta^{1-p}) \beta^U \\ &= \beta^{U+1} (1 - \beta^{-p}) \end{aligned}
  • In general, floating point numbers are not uniformly distributed. However, they are uniformly distributed in the exponent range [E,E+1)[E, E+1) for EZE \in \mathbb{Z}. In this range, the minimal difference between numbers which floating-point system can represent is (0.01)ββE=β1pβE=βEp+1(0.0\mathellipsis 1)_{\beta} \beta^E = \beta^{1-p} \beta^E = \beta^{E - p + 1}. If this range is changed to [E+1,E+2)[E + 1, E + 2), then the minimal difference is multiplied by β\beta.

Interval

Let the minimal difference between numbers which floating-point system can represent in [L,L+1)[L, L + 1) be ϵ\epsilon. Then the following shows the entire distribution of floating-point numbers.

EntireInterval

The negative part is symmetrically the same as the positive one. Note that there could be the integers which the floating-point system cannot represent when this interval ϵβk>1\epsilon \beta^k > 1.

3. Subnormal(Denormalized) Numbers

When looking the series the floating-point system represents, there is empty space in [0,βL][0, \beta^L]. This range can be divided by ϵ\epsilon, which is the interval in [L,L+1)[L, L + 1). Then the number in this range can be represented as d0=0d_0 = 0 and d10d_1 \not = 0, that is, ±(0.d1dp1)ββL\pm (0. d_1 \mathellipsis d_{p-1})_{\beta}\beta^L if some condition are satisfied which will come later.

4. Rounding

The number which the floating-point system can exactly represent is called machine number. However, the number the system cannot do should be rounded. There are rules for rounding such as chopping or round-to-nearest method. Here are some examples about these rules when p=2p = 2. numberchopround-to-nearest1.6491.61.61.6501.61.61.6511.61.71.6991.61.7numberchopround-to-nearest1.7491.71.71.7501.71.81.7511.71.81.7991.71.8\begin{aligned} \begin{array}{ccc} \text{number} & \text{chop} & \text{round-to-nearest} \\ \hline 1.649 & 1.6 & 1.6 \\ 1.650 & 1.6 & 1.6 \\ 1.651 & 1.6 & 1.7 \\ 1.699 & 1.6 & 1.7 \end{array} \quad \begin{array}{ccc} \text{number} & \text{chop} & \text{round-to-nearest} \\ \hline 1.749 & 1.7 & 1.7 \\ 1.750 & 1.7 & 1.8 \\ 1.751 & 1.7 & 1.8 \\ 1.799 & 1.7 & 1.8 \end{array} \end{aligned}

The round-to-nearest is also known as round-to-even, because it rounds the number to the one whose last digit is even in case of a tie. This rule is the most accurate and unbiased, but expensive. Meanwhile, IEEE standard system has the round-to-nearest as the default rule.

5. Machine Epsilon (Machine Precision)

The floating-point system can be measured by the machine epsilon, machine precision, or unit roundoff which is denoted by ϵmach\epsilon_{\text{mach}}. It is the minimal number so that 1+ϵmach>11 + \epsilon_{\text{mach}} > 1. Considering that the interval between the floating-point numbers in [1,β)[1, \beta) which can be exactly represented is β1p\beta^{1-p} because E=0E = 0,

MachineEpsilon

ϵmach=β1p\epsilon_{\text{mach}} = \beta^{1-p} with rounding by chopping, and ϵmach=β1p2\epsilon_{\text{mach}} = \frac{\beta^{1-p}}{2} with rounding-to-nearest. Now, consider the floating-point xx that can be exactly represented. Then there are many numbers that can be rounded to xx.

RelativeErrors

Therefore, the relative errors can be calculated as follows: relative error{βEp+1x=βEp+1(d0.d1dp1)βEβ1p(chopping)12βEp+1x=12βEp+1(d0.d1dp1)βE12β1p(round-to-nearest)\begin{aligned} \vert\text{relative error}\vert \le \begin{cases} \left\vert \cfrac{\beta^{E - p + 1}}{x} \right\vert = \cfrac{\beta^{E - p + 1}}{(d_0 . d_1 \mathellipsis d_{p-1}) \beta^E} \le \beta^{1-p} \quad \text{(chopping)} \\\\ \left\vert \cfrac{\frac{1}{2} \beta^{E - p + 1}}{x} \right\vert = \cfrac{\frac{1}{2} \beta^{E - p + 1}}{(d_0 . d_1 \mathellipsis d_{p-1}) \beta^E} \le \frac{1}{2} \beta^{1-p} \quad \text{(round-to-nearest)} \end{cases} \end{aligned}

It means that relative errorϵmach\vert \text{relative error} \vert \le \epsilon_{\text{mach}}.

6. IEEE Floating-Point Format

This system has β=2\beta = 2, p=24p = 24, L=126L = -126, and U=127U = 127 for 3232-bit floating-point numbers.

IEEEFormat

Note that d0d_0 is always 11 since β=2\beta = 2, so 2323-bit mantissa can store only 2323-bit for d1d23d_1 \mathellipsis d_{23} with p=24p = 24. Its exponent is 88-bit, so is in [0,255][0, 255], but it is biased by 127-127. It yields that LE127UL \le E - 127 \le U, so 1E2541 \le E \le 254. Therefore, it can represent some special values when E=0E = 0 or E=255E = 255. {1E254    ±(1.d1d23)22E127normalizedE=0{mantissa0    ±(0.d1d23)22126subnormalmantissa=0    ±0E=255{mantissa0    NaNmantissa=0    ±\begin{aligned} \begin{cases} 1 \le E \le 254 \implies \pm (1. d_1 \mathellipsis d_{23})_2 2^{E-127} \quad \color{green}\text{normalized} \\ \\ E = 0 \quad \begin{cases} \text{mantissa} \not = 0 \implies \pm (0. d_1 \mathellipsis d_{23})_2 2^{-126} \quad \color{red}\text{subnormal} \\ \text{mantissa} = 0 \implies \pm 0 \end{cases} \\ \\ E = 255 \quad \begin{cases} \text{mantissa} \not = 0 \implies \text{NaN} \\ \text{mantissa} = 0 \implies \pm \infty \end{cases} \end{cases} \end{aligned}

  • The smallest positive number is
{(1.00)221261.18×1038normalized(0.01)22126=22321261.4×1045subnormal\begin{aligned} \begin{cases} (1. 0 \mathellipsis 0)_2 2^{-126} \approx 1.18 \times 10^{-38} \quad \color{green}\text{normalized} \\\\ (0. 0 \mathellipsis 1)_2 2^{-126} = 2^{-23} 2^{-126} \approx 1.4 \times 10^{-45} \quad \color{red}\text{subnormal} \end{cases} \end{aligned}
  • The largest number is (1.11)22127=(1224)21283.4×1038(1. 1 \mathellipsis 1)_2 2^{127} = (1 - 2^{-24}) 2^{128} \approx 3.4 \times 10^{38}.
  • Theoretically, the machine epsilon ϵmach\epsilon_{\text{mach}} is
ϵmach={β1p=2124=223107(chopping)12β1p=122124=2240.6×107(round-to-nearest)\begin{aligned} \epsilon_{\text{mach}} = \begin{cases} \beta^{1-p} = 2^{1-24} = 2^{-23} \approx 10^{-7} \quad \text{(chopping)} \\\\ \frac{1}{2} \beta^{1-p} = \frac{1}{2} 2^{1-24} = 2^{-24} \approx 0.6 \times 10^{-7} \quad \text{(round-to-nearest)} \end{cases} \end{aligned}

This results can be found here as well. Even though IEEE standard system uses the round-to-nearest as the default rounding rule, std::numeric_limits<float>::epsilon() returns 2232^{-23} because single precision floating-point cannot represent 2242^{-24}. So, in general, ϵmach=223\epsilon_{\text{mach}} = 2^{-23}. It has about 77-precision in decimals. logϵmach=log22323×0.3010=6.923=7+α,α[0,1)    ϵmach=223=107+α\begin{gather} \log \epsilon_{\text{mach}} = \log 2^{-23} \approx -23 \times 0.3010 = -6.923 = -7 + \alpha, \quad \alpha \in [0, 1) \\\\ \implies \epsilon_{\text{mach}} = 2^{-23} = 10^{-7 + \alpha} \end{gather}

7. ULP (Units in the Last Place)

Consider two floating-point numbers which are identical in all respects except for the value of the least-significant bit in their mantissas. These two values are said to differ by 11 ULP. The actual value of 11 ULP changes depending on the exponent. 1.0f has an unbiased exponent zero, and a mantissa in which all bits are zero(except for the implicit leading 1). For this value, 11 ULP is equal to ϵmach=223\epsilon_{\text{mach}} = 2^{-23}. In general, if a floating-point value’s unbiased exponent is xx, then 11 ULP =2xϵmach= 2^x \epsilon_{\text{mach}}.

Mathematically, the condition aba \geq b is equivalent to the condition a+1a + 1 ULP >b> b. As a little trick, it is possible to implement \leq and \geq only using << and >> by adding or subtracting 11 ULP to or from the value being compared.

References

[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education

[2] J. Gregory, Game Engine Architecture, Third Edition, CRC Press


© 2024. All rights reserved.