Floating-Point Number

1. System Format

Suppose that β\beta is the radix, or base, pp is precision, and [L,U][L, U] is the range of exponent EE. Then for xRx \in \mathbb{R}, x=±(d0.d1d2dp1)ββE=±(d0+d1β+d2β2++dp1βp1)βE\begin{aligned} x = \pm (d_0 . d_1 d_2 \mathellipsis d_{p-1})_{\beta} \beta^E = \pm \left( d_0 + \frac{d_1}{\beta} + \frac{d_2}{\beta^2} + \cdots + \frac{d_{p-1}}{\beta^{p-1}} \right) \beta^E \end{aligned}

where did_i is an integer in [0,β1][0, \beta - 1].

  • pp-digit number based-β\beta d0d1dp1d_0 d_1 \mathellipsis d_{p-1}: mantissa, or significant
  • d1dp1d_1 \mathellipsis d_{p-1} of mantissa: fraction
  • EE: exponent, or characteristic

2. Normalization

For x0Rx \not = 0 \in \mathbb{R}, it can be normalized so that d00d_0 \not = 0 and mantissa mm is in [1,β)[1, \beta). This normalization is unique and saves space for leading zeros. Especially, d1d_1 is always 11 when β=2\beta = 2, so it does not have to ve stored and saves, in turn, one bit more.

  • The number of the normalized floating-point number xx is
2undefined±×(β1)undefinedd00×(βp1)undefinedd1dp1×(UL+1)undefinedE+1undefinedzero\begin{aligned} \underbrace{2}_{\pm} \times \underbrace{(\beta - 1)}_{d_0 \not = 0} \times \underbrace{(\beta^{p-1})}_{d_1 \thicksim d_{p-1}} \times \underbrace{(U - L + 1)}_{E} + \underbrace{1}_{\text{zero}} \end{aligned}
  • The smallest positive xx is (1.00)ββL=βL(1.0\mathellipsis0)_{\beta} \beta^L = \beta^L.
  • The largest xx is
[(β1).(β1)(β1)]ββU=(β1)(1+β1++β1p)βU=βU+1(1βp)\begin{aligned} [(\beta - 1) . (\beta - 1) \mathellipsis (\beta - 1)]_{\beta} \beta^U &= (\beta - 1)(1 + \beta^{-1} + \cdots + \beta^{1-p}) \beta^U \\ &= \beta^{U+1} (1 - \beta^{-p}) \end{aligned}
  • In general, floating point numbers are not uniformly distributed. However, they are uniformly distributed in the exponent range [E,E+1)[E, E+1) for EZE \in \mathbb{Z}. In this range, the minimal difference between numbers which floating-point system can represent is (0.01)ββE=β1pβE=βEp+1(0.0\mathellipsis 1)_{\beta} \beta^E = \beta^{1-p} \beta^E = \beta^{E - p + 1}. If this range is changed to [E+1,E+2)[E + 1, E + 2), then the minimal difference is multiplied by β\beta.

Interval

Let the minimal difference between numbers which floating-point system can represent in [L,L+1)[L, L + 1) be ϵ\epsilon. Then the following shows the entire distribution of floating-point numbers.

EntireInterval

The negative part is symmetrically the same as the positive one. Note that there could be the integers which the floating-point system cannot represent when this interval ϵβk>1\epsilon \beta^k > 1.

3. Subnormal(Denormal) Numbers

When looking the series the floating-point system represents, there is empty space in [0,βL][0, \beta^L]. This range can be divided by ϵ\epsilon, which is the interval in [L,L+1)[L, L + 1). Then the number in this range can be represented as d0=0d_0 = 0 and d10d_1 \not = 0, that is, ±(0.d1dp1)ββL\pm (0. d_1 \mathellipsis d_{p-1})_{\beta}\beta^L if some condition are satisfied which will come later.

4. Rounding

The number which the floating-point system can exactly represent is called machine number. However, the number the system cannot do should be rounded. There are rules for rounding such as chopping or round-to-nearest method. Here are some examples about these rules when p=2p = 2. numberchopround-to-nearest1.6491.61.61.6501.61.61.6511.61.71.6991.61.7numberchopround-to-nearest1.7491.71.71.7501.71.81.7511.71.81.7991.71.8\begin{aligned} \begin{array}{ccc} \text{number} & \text{chop} & \text{round-to-nearest} \\ 1.649 & 1.6 & 1.6 \\ 1.650 & 1.6 & 1.6 \\ 1.651 & 1.6 & 1.7 \\ 1.699 & 1.6 & 1.7 \end{array} \quad \begin{array}{ccc} \text{number} & \text{chop} & \text{round-to-nearest} \\ 1.749 & 1.7 & 1.7 \\ 1.750 & 1.7 & 1.8 \\ 1.751 & 1.7 & 1.8 \\ 1.799 & 1.7 & 1.8 \end{array} \end{aligned}

The round-to-nearest is also known as round-to-even, because it rounds the number to the one whose last digit is even in case of a tie. This rule is the most accurate and unbiased, but expensive. Meanwhile, IEEE standard system has the round-to-nearest as the default rule.

5. Machine Precision

The floating-point system can be measured by the machine precision, machine epsilon, or unit roundoff which is denoted by ϵmach\epsilon_{\text{mach}}. It is the minimal number so that 1+ϵmach>11 + \epsilon_{\text{mach}} > 1. Considering that the interval between the floating-point numbers in [1,β)[1, \beta) which can be exactly represented is β1p\beta^{1-p} because E=0E = 0,

MachineEpsilon

ϵmach=β1p\epsilon_{\text{mach}} = \beta^{1-p} with rounding by chopping, and ϵmach=β1p2\epsilon_{\text{mach}} = \frac{\beta^{1-p}}{2} with rounding-to-nearest. Now, consider the floating-point xx that can be exactly represented. Then there are many numbers that can be rounded to xx.

RelativeErrors

Therefore, the relative errors can be calculated as follows: relative error{βEp+1x=βEp+1(d0.d1dp1)βEβ1p(chopping)12βEp+1x=12βEp+1(d0.d1dp1)βE12β1p(round-to-nearest)\begin{aligned} \vert\text{relative error}\vert \le \begin{cases} \left\vert \frac{\beta^{E - p + 1}}{x} \right\vert = \frac{\beta^{E - p + 1}}{(d_0 . d_1 \mathellipsis d_{p-1}) \beta^E} \le \beta^{1-p} \quad \text{(chopping)} \\ \left\vert \frac{\frac{1}{2} \beta^{E - p + 1}}{x} \right\vert = \frac{\frac{1}{2} \beta^{E - p + 1}}{(d_0 . d_1 \mathellipsis d_{p-1}) \beta^E} \le \frac{1}{2} \beta^{1-p} \quad \text{(round-to-nearest)} \end{cases} \end{aligned}

It means that relative errorϵmach\vert \text{relative error} \vert \le \epsilon_{\text{mach}}.

6. IEEE Floating-Point Format

This system has β=2\beta = 2, p=24p = 24, L=126L = -126, and U=127U = 127 for 3232-bit floating-point numbers.

IEEEFormat

Note that d0d_0 is always 11 since β=2\beta = 2, so 2323-bit mantissa can store only 2323-bit for d1d23d_1 \mathellipsis d_{23} with p=24p = 24. Its exponent is 88-bit, so is in [0,255][0, 255], but it is biased by 127-127. It yields that LE127UL \le E - 127 \le U, so 1E2541 \le E \le 254. Therefore, it can represent some special values when E=0E = 0 or E=255E = 255. {1E254    ±(1.d1d23)22E127normalizedE=0{mantissa0    ±(0.d1d23)22126subnormalmantissa=0    ±0E=255{mantissa0    NaNmantissa=0    ±\begin{aligned} \begin{cases} 1 \le E \le 254 \implies \pm (1. d_1 \mathellipsis d_{23})_2 2^{E-127} \quad \color{green}\text{normalized} \\ \\ E = 0 \quad \begin{cases} \text{mantissa} \not = 0 \implies \pm (0. d_1 \mathellipsis d_{23})_2 2^{-126} \quad \color{red}\text{subnormal} \\ \text{mantissa} = 0 \implies \pm 0 \end{cases} \\ \\ E = 255 \quad \begin{cases} \text{mantissa} \not = 0 \implies \text{NaN} \\ \text{mantissa} = 0 \implies \pm \infty \end{cases} \end{cases} \end{aligned}

  • The smallest positive number is
{(1.00)22E1261.8×1038normalized(0.01)22E126=22321261.4×1045subnormal\begin{aligned} \begin{cases} (1. 0 \mathellipsis 0)_2 2^{E-126} \approx 1.8 \times 10^{-38} \quad \color{green}\text{normalized} \\ \\ (0. 0 \mathellipsis 1)_2 2^{E-126} = 2^{-23} 2^{-126} \approx 1.4 \times 10^{-45} \quad \color{red}\text{subnormal} \end{cases} \end{aligned}
  • The largest number is $$(1. 1 \mathellipsis 1)_2 2^{127} = (1 - 2^{-24}) 2^{128} \approx 3.4 \times 10^{38}.
  • The machine epsilon ϵmach\epsilon_{\text{mach}} is
ϵmach=12β1p=122124=224107\begin{aligned} \epsilon_{\text{mach}} = \frac{1}{2} \beta^{1-p} = \frac{1}{2} 2^{1-24} = 2^{-24} \approx 10^{-7} \end{aligned}

since IEEE standard system uses the round-to-nearest as the default rounding rule. It has about 77-precision in decimals. logϵmach=log22424×0.3010=8+α,α[0,1)    ϵmach=224=108+α<107\begin{gather} \log \epsilon_{\text{mach}} = \log 2^{-24} \approx -24 \times 0.3010 = -8 + \alpha, \quad \alpha \in [0, 1) \\ \implies \epsilon_{\text{mach}} = 2^{-24} = 10^{-8 + \alpha} < 10^{-7} \end{gather}

Reference

[1] Michael T. Heath, Scientific Computing: An Introductory Survey. 2nd Edition, McGraw-Hill Higher Education.


© 2023. All rights reserved.