Originally published by Robert Beisert at fortcollinsprogram.robert-beisert.com

CS 101 – Floating Point Binary

This section should wrap us up on binary for now.

Floating Point Numbers

Up until this point, we have dealt exclusively with integer representation binary. Everything we have done has been a whole value, with no fractions.

Floating point numbers are those fractions. More accurately, it is a number whose values can be between one and zero. The floating point value is made of three components: the sign, the exponent, and the mantissa.

The Sign is a single bit, indicating positive or negative. As before, zero is positive, and one is negative.

The Mantissa is the portion that stores our binary value. If we had the decimal value

5.833 * 10^3

the mantissa would be 5.833. The same holds true in binary arithmetic.

The Exponent is the portion that shifts our mantissa up and down the scale. In the above decimal, the exponent is 3. Note that we do not have to store the “10^” portion of the equation, as that is understood.

Converting to Float

When we convert from decimal to floating point, we use the following steps:

  1. Convert the integer portion to binary (as before)
  2. Convert the fraction portion to binary
  3. Combine the integer portion and the fraction
  4. Shift until there is only one 1 in front of the decimal point
  5. Record the shift (exponent) and the value after the decimal.

As our example, we will convert the value 5.833 * 10 ^ 1 into binary. When we expand the decimal, we find the value to be 58.33.

Step 1

Step one is the same as we have done before. This should need no further explanation at this point:

58 -> 111010

Step 2

Step two is a bit trickier. In order to convert fractions to decimal, we follow these simple steps:

  1. Multiply the fraction by two
  2. Record the value left of the decimal
  3. Repeat steps 1 and 2 on the value to the right of the decimal, until the value is 0 (or you give up)

Using this process, we can do the following:

.33 * 2 = 0.66

.66 * 2 = 1.32

.32 * 2 = 0.64

.64 * 2 = 1.28

.28 * 2 = 0.56

.56 * 2 = 1.12

.12 * 2 = 0.24

.24 * 2 = 0.48

.48 * 2 = 0.96

.96 * 2 = 1.92

We will stop here, but you can see that we have reduced it to approximately

.33 = 0101010001

Step 3

This part is easy – we simply combine the two portions we have:

58.33 = 111010.0101010001

Step 4

Now we have to shift the values. In this case, because we have a non-zero integer left of the decimal, we will shift left.

111010.0101010001

<— 5

1.110100101010001

This gives us an exponent value of 0101.

IMPORTANT: we must have a sign bit in the exponent, to tell us whether we shifted left or right. In this case, the sign bit is 0, because we shifted left. If we had shifted right, the sign bit would be 1.

Step 5

Finally, we store our final value.

0     0101     110100101010001

And there we have it. In order, we have our sign (0), exponent (0101) and mantissa (110100101010001).

You may have noticed that we’re missing a number. We did not store the 1 to the left of the decimal.

Why? Because we assume it’s there. The computer is designed to put the 1 back when it’s time to work. This saves us one bit, which we use for our sign. Genius, no?

Single and Double Precision

C-based programmers may have seen the data types float and double used in example code. These correlate to the IEEE standards for single and double precision floating point numbers, respectively.

Single precision (float) values are 32-bits long. This is comprised of:

1 sign bit

8 exponent bits

23 mantissa bits

For most cases, this is enough precision, as we can store values at 24 significant bits (equivalent to ~8 significant digits) with…enormous sizes. (Go ahead. Calculate 2^128. We’ll wait.)

When we need more precision, or more size, we employ double precision (double) values. These 64-bit strings are twice the size of single-precision values, comprised of:

1 sign bit

11 exponent bits

52 mantissa bits

As you can see, this vastly increases our precision (equivalent to ~16 significant digits) and gives us an even larger range of sizes.

Number of Bits: Shorthand

This section is too short to get its own post, but needs to be explained. When we talk about the size of registers, RAM, and storage media, we are talking in large quantities of bits. The following is a list of short-hand ways to express how many bits we are talking about.

Byte – 8 bits

Word – some number of bytes (4 bytes for 32-bit machines, 8 bytes for 64-bit machines is reasonably standard)

Kilobyte – One thousand(ish) bytes

Megabyte – One million(ish) bytes

Gigabyte – One Billion(ish) bytes

Terabyte – One Trillion(ish) bytes

Petabyte – One Quadrillion(ish) bytes

Exabyte – One Quintillion(ish) bytes

You will notice that I put the “ish” qualifier behind every value. That’s because we have an estimating system when we convert between binary and decimal.

If you have practiced counting by powers of two as you should, you might have noticed that 2^10 = 1024. If you tried counting much higher, you might also have noticed that 2^20 = 1048576.

Do you see the system yet? If not, let me spell it out:

2^10 ~= 10^3

Two to the tenth is approximately equal to ten to the third

That means:

10^6 ~= 2^20

10^9 ~= 2^30

10^3n ~= 2^10n

Keep that in mind. I guarantee it will come in handy in your CS career.