MTH4300 Home

Representation of Integers in Computers

This deals with representation of numbers in computers. We will first explore the concept of binary numbers. Then, we will learn how integers are stored in the computer. Finally, we will learn about floating point representation of real numbers.

Troubles with integers and doubles

We will first look at two examples of code that have bugs. These bugs are examples of unpleasant surprises that can occur if we do not pay attention to how the numbers are stored. Once we learn how the numbers are represented in the computer, we will learn the limitations and tradeoffs that must be made in designing the computers. Then we will be able to write the code that will not have unwanted behavior due to vulnerabilities resulting from representations of integers and real numbers.

Example 1: Trouble with integers

#include<iostream>
int main(){
    int x=1 ;
    for(int i=0;i<35;++i){
        std::cout<<i<<"\t"<<x<<"\n";
        x = x*2; 
    }
    return 0;
}

The program prints the powers of \(2\). Everything goes well with the exponents \(0\), \(1\), \(\dots\), \(30\). We get \(2^0=1\), \(2^1=2\), \(2^2=4\), \(2^3=8\), \(2^4=16\), \(2^5=32\), \(\dots\), \(2^{30}=1073741824\). However, this is where it gets surprising, and, mathematically incorrect: The computer thinks that

\[2^{31}=-2147483648,\mbox{ and } 2^{32}=2^{33}=\cdots = 0.\]

We will learn that this is due to the fact that int is stored as a \(32\)-bit number whose first bit is a sign. We will learn even more than that. We will learn how the storage is implemented using technique called Two's complement.

Example 2: Trouble with real numbers

#include<iostream>
int main(){
    double a=1.0, b=0.1, c=0.0;
    for(int i=0;i<10;++i){ c = c+b; }
    std::cout<<"a="<<a<<"\n"<<"c="<<c<<"\n";
    if(a==c){
        std::cout<<a<<"=="<<c<<"\n";
    }
    else{
        std::cout<<a<<"!="<<c<<"\n";
    }
    return 0;
}

The above code prints the following:

a=1
c=1
1!=1

The two values \(1\) stored in the variables a and c are not the same for our computer. The first is obtained by directly writing \(1\) into the variable a. The second value \(1\) is obtained by adding the number \(0.1\) ten times to the variable c inside the four loop.

Once we learn how the real numbers are stored, we will see that when the number \(0.1\) is stored in binary system, its representation has infinitely many digits. In decimal system, there is only one digit after the decimal point. In binary, it will be infinitely many. Thus, \(0.1\) has to be rounded in the computer. When we add this rounded value ten times, the rounding error becomes significant. The computer sees the result as a number different from \(1\). When the computer displays the results, it rounds it again and prints \(1\) on the output. The confusion gets even bigger with the display 1!=1.

Numbers in base \(b\) and binary numbers

Until 12th century, the Europeans did not use the decimal system to represent numbers. They wrote I, II, III, IV, and V instead of \(1\), \(2\), \(3\), \(4\), and \(5\). Chinese did almost the same. The characters for the first five numbers in Chinese alphabet are 一, 二, 三, 四, and 五.

These old traditional number systems need infinitely many symbols to support infinitely many numbers. The decimal system needs only \(10\) symbols: \(0\), \(1\), \(2\), \(\dots\), \(9\). Let us consider a positive integer \(n\) and let us assume that it has \(k\) digits. We will denote the digits from right to left by \(d_0\), \(d_1\), \(\dots\), \(d_{k-1}\). To avoid confusion with multiplication in the case when digits are labeled as variables, we will write a horizontal line over the digits and have an equation of the form \[ n=\overline{d_{k-1}\dots d_1d_0}.\] For example, if \(n=4529\), then we will write \(k=4\), \(d_0=9\), \(d_1=2\), \(d_2=5\), and \(d_3=4\). We can write \(n=\overline{4529}\) as well, but the horizontal line is not necessary because all digits are known and none of them is represented with a variable.

The following equation holds: \[\overline{d_{k-1}\dots d_1d_0}=d_{k-1}\cdot 10^{k-1}+d_{k-2}\cdot 10^{k-2}+\cdots+ d_1\cdot 10^1+d_0\cdot 10^0.\]

We are very used to this representation of numbers. The system is called decimal system and the number \(10\) is called the base of the decimal system. It is believed that \(10\) is chosen as the base for the system because humans have \(10\) fingers on their hands. Other than this very old habit, we can easily replace \(10\) with any other base and have perfectly well functioning mathematics.

We will now define the representation of integers in an arbitrary base \(b\), where \(b\) is an integer bigger than \(1\).

We will now assume that \(b>1\) and that \(n\) is a positive integer.

Definition The sequence of numbers \(d_0\), \(d_1\), \(\dots\), \(d_{k-1}\) is called representation of the number \(n\) in base \(b\) if \(d_0\), \(d_1\), \(\dots\), \(d_{k-1}\) are integers from \(\{0\), \(1\), \(\dots\), \(b-1\}\) that satisfy \[n=d_{k-1}\cdot b^{k-1}+d_{k-2}\cdot b^{k-2}+\cdots + d_1\cdot b^1+d_0\cdot b^0.\] The numbers \(d_0\), \(d_1\), \(\dots\), \(d_{k-1}\) are also called digits of \(n\) in base \(b\).

Theorem If \(n\) and \(b\) are integers such that \(n\geq 1\) and \(b >1\), then there exists a representation of the integer \(n\) in base \(b\). If we require the first digit \(d_{k-1}\) to be non-zero, then the representation is unique.

This theorem is an easy result from elementary number theory. The uniqueness is proved by using the method of contradiction and the existence is proved using mathematical induction.

Uniqueness. Assume that there are two representations of the integer \(n\). In other words, assume that there are two sequences \(\left(d_0\right.\), \(d_1\), \(\dots\), \(\left.d_{k-1}\right)\) and \(\left(e_0\right.\), \(e_1\), \(\dots\), \(\left.e_{l-1}\right)\) such that \begin{eqnarray*} n&=&d_{k-1}\cdot b^{k-1}+\cdots + d_1\cdot b^1+d_0\\ &=&e_{l-1}\cdot b^{l-1}+\cdots+ e_1\cdot b^1+e_0. \end{eqnarray*} Assume that \(j\) is the biggest index for which \(d_j\neq e_j\). If \(k > l\), then we take \(j=k-1\). If \(k < l\)< then we take \(j=l-1\). Without loss of generality assume that \(d_j > e_j\). Then we can cancel the terms \(d_{j+1}\cdot b^{j+1}\) and \(e_{j+1}\cdot b^{j+1}\) from both sides of the previous equation because they are equal. We can also cancel the terms \(d_{j+2}\cdot b^{j+2}\) and \(e_{j+2}\cdot b^{j+2}\), and so on. We are left with the equation \begin{eqnarray*} d_j \cdot b^j+ \cdots + d_1\cdot b^1+d_0 &=& e_j\cdot b^j+\cdots+ e_1\cdot b^1+e_0. \end{eqnarray*} Let us denote by \(M\) the number on the left (and the right) side of the previous equation. Using that \(e_{j-1}\leq b-1\), \(e_{j-2}\leq b-1\), \(\dots\), \(e_1\leq b-1\), and \(e_0\leq b-1\), we obtain \begin{eqnarray*} M& \geq & d_j \cdot b^j \geq \left(e_j+1\right)\cdot b^j=e_j\cdot b^j+b^j.\quad\quad\quad\quad(1) \end{eqnarray*} Observe that \[(b-1)+(b-1)\cdot b^1+(b-1)\cdot b^2+\cdots +(b-1)\cdot b^{j-1}=(b-1)\cdot \frac{b^j-1}{b-1}=b^j-1 < b^j.\] Therefore, the equation (1) becomes \begin{eqnarray*} M& > & e_j\cdot b^j+(b-1)+(b-1)\cdot b^1+(b-1)\cdot b^2+\cdots +(b-1)\cdot b^{j-1}\\ &\geq& e_j\cdot b^j+e_{j-1}\cdot b^{j-1}+\cdots +e_1\cdot b^1+e_0\\ &=& M. \end{eqnarray*} This is a contradiction.

Existence. We use the principle of mathematical induction to prove a stronger statement. If \(n\) is a positive integer smaller than \(b^k\) then there exist a sequence \(\left(d_0\right.\), \(d_1\), \(\dots\), \(\left.d_{k-1}\right)\) of non-negative integers of length \(k\) whose elements are from the set \(\{0\), \(1\), \(\dots\), \(b-1\}\) such that \[n=d_{k-1}\cdot b^{k-1}+\cdots + d_1\cdot b^1+d_0\cdot b^0.\]

The statement is obvious for \(k=1\). Assume that \(k\geq 1\) and that the statement holds for \(k\). We will now prove that it holds for \(k+1\). Let \(n\) be an integer smaller than \(b^{k+1}\). When divided by \(b^k\) the number \(n\) gives a quotient and remainder. Let us denote by \(q\) and \(r\) the unique positive integers such that \(n=q\cdot b^k+r\) and \(r\in\{0\), \(1\), \(\dots\), \(b^k-1\}\). According to the induction hypothesis there exists a sequence \(d_0\), \(d_1\), \(\dots\), \(d_{k-1}\) for which \[r=d_{k-1}\cdot b^{k-1}+\cdots + d_1\cdot b+ d_0.\] By defining \(d_k=q\) we obtain \begin{eqnarray*} d_k\cdot b^k+d_{k-1}\cdot b^{k-1}+\cdots + d_1\cdot b+ d_0&=& q\cdot b^k+r=n. \end{eqnarray*} This completes the proof.

Problem 1. If the representation of the number \(n\) in base \(8\) is \(\overline{37105}_8\), what is the decimal representation of \(n\)?

Problem 2. If the representation of the number \(n\) in base \(5\) is \(\overline{2144}_5\), what is the decimal representation of \(n\)?

Problem 3. What is the representation of the number \(352\) in base \(5\)?

Let us denote by \(d_{k-1}\), \(d_{k-2}\), \(\dots\), \(d_2\), \(d_1\), and \(d_0\) the digits of the number \(352\) in base \(5\). The following equation holds \begin{eqnarray*} d_{k-1}\cdot 5^{k-1}+d_{k-2}\cdot 5^{k-2}+\cdots + d_2\cdot 5^2+d_1\cdot 5^1+d_0\cdot 5^0 &= & 352 . \quad\quad\quad\quad\quad(1)\end{eqnarray*} We will find an easy way to obtain the digit \(d_0\) just by looking at the equation (1). Let us analyze the left-hand side of (1). Except for the last term \(d_0\cdot 5^0\), each of the numbers \(d_{k-1}\cdot 5^{k-1}\), \(d_{k-2}\cdot 5^{k-2}\), \(\dots\), \(d_2\cdot 5^2\), and \(d_1\cdot 5^1\) is divisible by \(5\). The right hand side is \(352\). The number \(352\) gives remainder \(2\) when divided by \(5\). Thus, we must have \(d_0=2\). We now place the value \(d_0=2\) in equation (1) and obtain \begin{eqnarray*} d_{k-1}\cdot 5^{k-1}+d_{k-2}\cdot 5^{k-2}+\cdots + d_2\cdot 5^2+d_1\cdot 5^1+2 &= & 352 . \end{eqnarray*} We can cancel 2 from both sides and get \begin{eqnarray*} d_{k-1}\cdot 5^{k-1}+d_{k-2}\cdot 5^{k-2}+\cdots + d_2\cdot 5^2+d_1\cdot 5^1&= & 350 . \end{eqnarray*} Both sides of the previous equation can be divided by \(5\). The equation becomes \begin{eqnarray*} d_{k-1}\cdot 5^{k-2}+d_{k-2}\cdot 5^{k-3}+\cdots + d_2\cdot 5^1+d_1\cdot 5^0 &= & 70 . \quad\quad\quad\quad\quad(2)\end{eqnarray*} We will now use the same logic as before and obtain the digit \(d_1\). Each of the terms \(d_{k-1}\cdot 5^{k-2}\), \(d_{k-2}\cdot 5^{k-3}\), \(\dots\), \(d_2\cdot 5^1\) is divisible by \(5\). The right-hand side is \(70\). The number \(70\) has remainder \(0\) when divided by \(5\). Therefore, \(d_1=0\). The equation (2) becomes \begin{eqnarray*} d_{k-1}\cdot 5^{k-2}+d_{k-2}\cdot 5^{k-3}+\cdots + d_2\cdot 5^1&= & 70 .\end{eqnarray*} Dividing both sides by \(5\) gives us \begin{eqnarray*} d_{k-1}\cdot 5^{k-3}+d_{k-2}\cdot 5^{k-4}+\cdots + d_3\cdot 5^1 + d_2\cdot 5^0&= & 14 .\quad\quad\quad\quad\quad(3)\end{eqnarray*} Using the same reasoning as before we conclude that \(d_2=4\). The equation (3) becomes \begin{eqnarray*} d_{k-1}\cdot 5^{k-3}+d_{k-2}\cdot 5^{k-4}+\cdots+d_4\cdot 5^2 + d_3\cdot 5^1 + 4&= & 14 .\end{eqnarray*} We now subtract \(4\) from both sides and divide the remaining quantities by \(5\) to obtain \begin{eqnarray*} d_{k-1}\cdot 5^{k-4}+d_{k-2}\cdot 5^{k-5}+\cdots+d_4\cdot 5^1 + d_3\cdot 5^0&= & 2 .\end{eqnarray*} We immediately conclude that \(d_3=2\) and \(d_4=d_5=\cdots =0\). Therefore, the representation of \(352\) in base \(5\) is \[\overline{352}_{10}=\overline{2402}_5.\]

Definition The representation in base \(2\) is also called a binary representation.

Problem 4. Determine the integer \(n\) whose binary representation is \(\overline{1101011}_2\).

Problem 5. Determine binary representation of the number \(n=150\).

Let \(d_0\), \(d_1\), \(d_2\), \(\dots\), \(d_{k-1}\) be the binary digits of \(n\). The number \(n\) and its binary digits must satisfy the equation \begin{eqnarray*} 150&=&d_{k-1}\cdot 2^{k-1}+d_{k-2}\cdot 2^{k-2}+\cdots+d_3\cdot 2^3+d_2\cdot 2^2+d_1\cdot 2^1+d_0. \end{eqnarray*} Since \(150\) is divisible by \(2\), we conclude that \(d_0=0\). The last equation becomes \begin{eqnarray*} 150&=&d_{k-1}\cdot 2^{k-1}+d_{k-2}\cdot 2^{k-2}+\cdots+d_3\cdot 2^3+d_2\cdot 2^2+d_1\cdot 2^1. \end{eqnarray*} We can divide both sides by \(2\) and obtain \begin{eqnarray*} 75&=&d_{k-1}\cdot 2^{k-2}+d_{k-2}\cdot 2^{k-3}+\cdots+d_3\cdot 2^2+d_2\cdot 2^1+d_1. \end{eqnarray*} Since \(75\) is odd, we conclude that \(d_1=1\). The last equation becomes \begin{eqnarray*} 75&=&d_{k-1}\cdot 2^{k-2}+d_{k-2}\cdot 2^{k-3}+\cdots+d_3\cdot 2^2+d_2\cdot 2^1+1. \end{eqnarray*} We now subtract \(1\) from both sides and obtain \begin{eqnarray*} 74&=&d_{k-1}\cdot 2^{k-2}+d_{k-2}\cdot 2^{k-3}+\cdots+d_3\cdot 2^2+d_2\cdot 2^1. \end{eqnarray*} We can divide both sides by \(2\) and get: \begin{eqnarray*} 37&=&d_{k-1}\cdot 2^{k-3}+d_{k-2}\cdot 2^{k-4}+\cdots+d_4\cdot 2^2+d_3\cdot 2^1+d_2 . \end{eqnarray*} Since \(37\) is odd, we must have \(d_2=1\). The last equation becomes \begin{eqnarray*} 37&=&d_{k-1}\cdot 2^{k-3}+d_{k-2}\cdot 2^{k-4}+\cdots+d_4\cdot 2^2+ d_3\cdot 2^1+1. \end{eqnarray*} We subtract \(1\) from both sides. \begin{eqnarray*} 36&=&d_{k-1}\cdot 2^{k-3}+d_{k-2}\cdot 2^{k-4}+\cdots+d_4\cdot 2^2+d_3\cdot 2^1. \end{eqnarray*} Dividing both sides by \(2\) gives us \begin{eqnarray*} 18&=&d_{k-1}\cdot 2^{k-4}+d_{k-2}\cdot 2^{k-5}+\cdots+d_5\cdot 2^2+d_4\cdot 2^1+d_3 . \end{eqnarray*} The number \(18\) is divisible by \(2\). Therefore, \(d_3=0\). Dividing both sides by \(2\) gives us \begin{eqnarray*} 9&=&d_{k-1}\cdot 2^{k-5}+d_{k-2}\cdot 2^{k-6}+\cdots+d_6\cdot 2^2+d_5\cdot 2^1+d_4 . \end{eqnarray*} Since \(9\) is not divisible by \(2\), we must have \(d_4=1\). We subtract \(1\) from both sides and divide the remaining equation by \(2\) to get \begin{eqnarray*} 4&=&d_{k-1}\cdot 2^{k-6}+d_{k-2}\cdot 2^{k-7}+\cdots+d_7\cdot 2^2+d_6\cdot 2^1+d_5 . \end{eqnarray*} The last equation implies that \(d_5=0\). Dividing both sides by \(2\) gives us \begin{eqnarray*} 2&=&d_{k-1}\cdot 2^{k-7}+d_{k-2}\cdot 2^{k-8}+\cdots+d_8\cdot 2^2+d_7\cdot 2^1+d_6 . \end{eqnarray*} We immediately obtain that \(d_6=0\) and \begin{eqnarray*} 1&=&d_{k-1}\cdot 2^{k-8}+d_{k-2}\cdot 2^{k-9}+\cdots+d_8\cdot 2^1+d_7 . \end{eqnarray*} Finally, we get \(d_7=1\) and \(d_8=d_9=\cdots =0\). Thus, \(k=8\). The binary representation of \(150\) is \[150=\overline{10010110}_2.\]

Problem 6. Create the program that reads a number from the user input and prints its binary representation.

Problem 7. Create a program that reads the string s consisting of digits \(0\) and \(1\) only and determines the integer \(n\) whose binary representation is s.

Unsigned integers

Non-negative integers are the easiest to handle. In computer science they are called unsigned integers because we do not need to worry about the sign. We can assume it is \(+\). The binary digits are the only elements that have to be stored.

Unsigned integers were used a lot in the past, but they are not as popular any more. The computer memory was limited. If your program needed both positive and negative integers, then you would be kind of sad. You would need to devote one entire bit for the sign. If you know that your program does not need negative numbers, you would feel happy and lucky. You would use unsigned integers and keep that extra bit to increase the possible range for the data.

l In modern C++ there are two major data types for unsigned integers. The first one is unsigned int and it occupies \(32\) bits. The second type is unsigned long and it takes \(64\) bits. They are rarely used any more.

Signed integers

Let us denote by \(l\) the total number of bits that we are allowed to use for storing our integer. For most C++ compilers, the type int has \(l=32\) while the type long has \(l=64\).

We will treat \(l\) as a general number and in our examples we will keep things simple by using \(l\) that is much smaller than \(32\) and \(64\). We will often use \(l=8\).

The maximal positive number that we will be able to store is \(2^{l-1}-1\). The minimal negative number will be \(-2^{l-1}\). Thus, the range of all integers will be \([-2^{l-1},2^{l-1}-1]\). There are exactly \(2^l\) numbers in this range. Instead of storing the number \(x\), the computer memory will contain Two's complement of \(x\). This is the formal definition of two's complement.

Definition For \(x\in[-2^{l-1},2^{l-1}-1]\) the two's complement of \(x\) is the number consisting of the last \(l\) binary digits of \(2^l+x\).

Problem 8 Assume that \(l=8\). Determine the two's complement of numbers \(0\), \(10\), \(35\), \(-35\), \(-90\), \(127\), \(-128\).

From the previous exercise we can observe that for positive \(x\) that satisfies \(x\leq 2^{l-1}-1\), the two's complement is always equal to the binary expansion of \(x\). The number \(2^l+x\) will have digit \(1\) at the position \(l+1\) which gets discarded. Moreover, the left-most digit of the two's complement is equal to \(0\). However, if \(x\) is a negative number that satisfies \(x\geq -2^{l-1}\), then the two's complement will have exactly \(l\) digits. There won't be any digit to discard. The left-most digit of the two's complement will be \(1\).

Thus, the left-most digit of the two's complement is \(0\) if the number is positive, and \(1\) if the number is negative. The left-most digit is the same as the sign.

The arithmetic operations with signed integers can be performed in the same way as if the integers are unsigned. If there is any carryover that would occupy more than \(l\) bits, that carryover should just be ignored.

We are now ready to analyze the code from the Example 1.

Problem 9. What does the following code print?

#include<iostream>
int main(){
    int x=1 ;
    for(int i=0;i<35;++i){
        std::cout<<i<<"\t"<<x<<"\n";
        x = x*2; 
    }
    return 0;
}