MTH4300 Home
# Floating Point Representation

## Binary representation of real numbers

**Theorem** A proper fraction of the form \(\frac{p}{q}\) has finitely many digits in decimal expansion if and only if the only prime factors of \(q\) are \(2\) and \(5\).
**Theorem** Proper fractions of the form \(\frac{p}{q}\) have finite binary expansion if and only if \(q\) is a power of \(2\).
**Problem** 1. Determine the binary expansion of the number \(9.375\).
**Problem** 2. Determine the binary expansion of \(\frac15\).
## Mantissa and exponent

### Normalized mantissa

**Definition** The representation \(x=p\cdot b^q\) of the real number \(x\) in base \(b\) is called *normalized* if the absolute value of the mantissa \(p\) satisfies \[1\leq |p| < b.\]
### Representation of

### Normalization

**Exponent shift.**
**Normal numbers.** If the exponent \(q\) belongs to the range \(2^{e-1} > q > -(2^{e-1}-1)\), then we store the normalized number. Since the mantissa \(p\) satisfies \(1 \leq p < 2\) we know for sure that its digit \(d_0\) is equal to \(1\). We don't need to waste space for its storage. We store only the binary digits \(d_{-1}\), \(d_{-2}\), \(\dots\), \(d_{-m}\). In other words, we store only the digits after the point that separates integer part from fractional part.
**Problem** 3. Assume that real numbers in a computer are represented using 16 bits: \(1\) for sign, \(6\) for exponent, and \(9\) for mantissa. Determine the representation of the number \(-23.125\) in this computer.
**Sub-normal numbers.** If the exponent \(q\) from the normalized representation \(x=p\cdot 2^q\) satisfies \(q \leq -(2^{e-1}-1)\), then we will represent the number \(x\) in a non-normalized form
\[x=p'\cdot 2^{-(2^{e-1}-1)}.\] We will store all digits of the mantissa \(p'\) in the allocated \(m\) bits. We will store the value \(0\) in the allocated \(e\) bits for the exponent.
**Problem** 4. Assume that real numbers in a computer are represented using 16 bits: \(1\) for sign, \(5\) for exponent, and \(15\) for mantissa. Determine the representations for the numbers \(\frac1{2^{14}}\), \(\frac1{2^{15}}\), and \(\frac1{2^{16}}\) in this computer.
## Difficulties with floating point representation

**Problem** 5. What does the following program print? Explain your answer.
**Problem** 6. What does the following program print on a computer where
**Problem** 7. Create a function
**Problem** 8. The user input consists of a positive number \(M\), a positive integer \(n\), and a sequence \(x[0]\), \(x[1]\), \(\dots\), \(x[n-1]\) of positive numbers. The person with \(M\) dollars decided to go to store and to spend all of the money. The person has a shopping list of items whose prices are \(x[0]\), \(x[1]\), \(\dots\), \(x[n-1]\). Create the program that determines whether the person can choose a subset of items from the list whose total price is exactly equal to \(M\).
## Practice

**Problem** 9. Create a program that prints the raw content stored in the memory for each of the following data types:
**Problem** 10.

It is quite rare for numbers to have finitely many digits in their expansions. Only rational numbers can have finitely many digits, and only some of them. The following theorem is easy to prove and is not surprising at all.

The numbers \(\frac{37}{125}\) and \(\frac{71}{50}\) have finitely many digits in decimal expansion. The number \(\frac{37}{3}\) does not. Once we start working with binary expansion, things are even more desperate.

The numbers such as \(\frac1{5}\) and \(\frac1{10}\) have infinite binary expansions. Rational numbers have periodic expansions. Irrational numbers have infinite expansions that are not periodic.

The number \(12345.678\) in base \(10\) can be represented in several different ways:
\begin{eqnarray*}
12345.678&=&12.345678\cdot 10^3\\
&=& 0.012345678\cdot 10^6\\
&=& 1234567.8\cdot 10^{-2}.
\end{eqnarray*}
Each of the representations has two important components: *mantissa* and *exponent*. In the first representation, the mantissa is \(12.345678\) and the exponent is \(3\). In the second representation, the mantissa is \(0.012345678\) and the exponent is \(6\). In the third representation the mantissa is \(1234567.8\) and the exponent is \(-2\).

For every given number, there are infinitely many ways to choose mantissa and exponent. In the previous paragraph we found \(3\) different ways to represent \(12345.678\) in base \(10\). We could easily find infinitely many more ways to do the same thing. The representations using mantissa and exponent are called *floating point* representations because the decimal point can be placed virtually anywhere. The decimal point floats and depends on our choice of the mantissa and exponent. However, one representation is considered special. For our number \(12345.678\), we are going to make an agreement that the following one is the most beautiful and most special of all of the floating point representations: \[12345.678=1.2345678\cdot 10^4.\] The mantissa has exactly one digit to the left of the decimal point. This representation is called *normalized*. The formal definition is

The previous definition is equivalent to saying that there is exactly one digit to the left of the point that separates integer part from the fractional part.

*Note:* Did you notice how we are using the words *point that separates integer part from fractional part*? We will not use the word decimal point anymore. Starting from the next section, the binary representations will be the default one. The attribute *decimal* would be quite wrong and deceiving.

`float`

and `double`

in C++There are multiple data types that are used to store real numbers. They all have in common some basic rules (also known as standard IEEE 754).

The storage space for the real number is divided into three components:

**Component 1: Sign.**The sign occupies 1 bit. If the sign bit is \(0\), the number is considered to be positive. If the sign bit is \(1\), the number is negative.**Component 2: Exponent.**In the remainder of this document we will use the variable \(e\) to denote the length of the exponent. The length of the exponent varies across operating systems and compilers. Currently, most compilers on \(64\)-bit operating systems have \(e=8\) for type`float`

and \(e=11\) for type`double`

.**Component 3: Mantissa.**The length of mantissa will be denoted by \(m\). The standard does not specify how long the mantissa should be. Currently, most compilers have \(m=23\) for type`float`

and \(m=52\) for type`double`

.

For given real number \(x\), we first determine its normalized representation \(x=p\cdot 2^q\). The mantissa \(p\) is normalized, hence it satisfies \(1\leq p<2\).

The exponent \(q\) can be positive or negative, however the number that will be stored is always non-negative. This is achieved with the following rule.

- If the exponent \(q\) satisfies \(2^{e-1} > q > -(2^{e-1}-1)\), then the value \(q+(2^{e-1}-1)\) is stored in the \(e\) bits dedicated to the exponent.
- If the exponent is smaller than or equal to \(-(2^{e-1}-1)\), then the number is \(x\) will be called
*sub-normal*. Such \(x\) is considered too small to be normalized and will be treated differently than normal numbers. The value \(0\) is stored instead of the exponent. - If the exponent \(q\) is greater than or equal to \(2^{e-1}\) then the number is too big to store. We will store \(2^e-1\) in the space dedicated to the exponent. The number \(2^e-1\) will set every single bit of the exponent to \(1\) and this will signify that there is an
*overflow*. The content of the memory will not correspond to a real number. Depending on the value inside the mantissa, the content of the memory will send one of the messages:- The number is \(+\infty\) (if mantissa is \(0\) and the sign is \(0\));
- The number is \(-\infty\) (if mantissa is \(0\) and the sign is \(1\));
- The content of the memory is not a number, commonly known as
`NaN`

(if mantissa is non-zero).

For example, in the case of the most common operating systems and most common compilers, the exponents of numbers of type `float`

are stored with shift-127. The exponents of numbers of type `double`

are stored with shift-1023.

The length of the space for exponent storage is \(e\). Therefore \(2^e\) values can be stored in total. Roughly half of the values will be dedicated for positive, and half for negative exponents.

Special case of a sub-normal number is \(0\). Both the mantissa and exponent are \(0\). The sign bit can be either \(0\) or \(1\). This means that there are two zeroes: \(+0\) and \(-0\). This is intentional. The value \(0\) can be result of a scientific calculation that went too far and obtained sub-normal number that can't be store any more. By analyzing the sign bit, we can get at least an idea whether \(0\) was approached from positive or negative direction.

Earlier we mentioned a bug that happens when working with real numbers. Now it is time to give it a full analysis.

#include<iostream> int main(){ double a=1.0, b=0.1, c=0.0; for(int i=0;i<10;++i){ c = c+b; } if(a==c){ std::cout<<a<<"=="<<c<<"\n"; } else{ std::cout<<a<<"!="<<c<<"\n"; } return 0; }

The following exercise provides more practice with floating point storage.

`int`

and `float`

are stored in memory that occupies \(32\) bits? Provide a justification for your answer.
#include<iostream> int main(){ int x; float d; void* aV=&x; x=2089; for(long i=0;i<19;++i){ x*=2; } float* aD; aD=static_cast<float*>(aV); d=*aD; std::cout<<d<<"\n"; return 0; }

`truncate`

whose input is a positive real number \(d\) of type `double`

that produce the output of type `long`

obtained by keeping only the integer part of \(d\) and discarding the fractional part.**(a)**Solve the problem under the assumption that \(M\) and \(x[1]\), \(\dots\), \(x[n-1]\) are of type`long`

.**(b)**Solve the problem under the assumption that the user will provide \(M\) and \(x[1]\), \(\dots\), \(x[n-1]\) as real numbers of type`double`

and that each of the numbers from the user input will have decimal expansion with at most two digits after the decimal point.

From the previous paragraphs we can make the following list of rules regarding floating point numbers that should be used to avoid trouble.

- Try to design algorithms that avoid
`float`

and`double`

as much as possible. - If you really heave to use
`float`

and`double`

, don't use the comparison operators`==`

and`!=`

. - If you are making programs that handle money, don't use
`float`

and`double`

.

`int`

, `long`

, `float`

, and `double`

.
The user input consists of two positive integers `e`

and `m`

and a real number `d`

of type `double`

. Create a program that prints the floating point representation of the number `d`

using the storage that has one bit for the sign, `e`

bits for the exponent, and `m`

bits for the mantissa.

Examples:

Input: 6 9 -3.142 Output: 1100000100100100 Input: 11 52 3.141592653589793 Output: 0100000000001001001000011111101101010100010001000010110100011000