IEEE Floating Point Format

Floating point notation is essentially the same as scientific notation, only translated to binary. There are three fields: the sign (which is the sign of the number), the exponent (some representations have used a separate exponent sign and exponent magnitude; IEEE format does not), and a significand (mantissa).

As we discuss the details of the format, you'll find that the motivations used to select some features seem like they should have driven other features in directions other than what was actually used. It seems inconsistent to me, too...

One other thing to mention is that the IEEE floating point format actually specifies several different formats: a ``single-precision'' format that takes 32 bits (ie one word in most machines) to represent a value, and a ``double-precision'' format that allows for both greater precision and greater range, but uses 64 bits. We'll be talking about single-precision here.

Sign

The one-bit sign is 0 for positive, or 1 for negative. The representation is sign-magnitude.

In integers, we use 2's complement for negative numbers because it makes the arithmetic ``just work;'' we can add two numbers together without regard to whether they are positive or negative, and get the right answer. This won't work for floating point numbers because the exponents need to be manipulated; if we used a 2's complement representation for the entire word we'd have to reconstruct the exponent any time we wanted to add or subtract, so it wouldn't gain us anything; in fact, trying to do arithmetic involving a negative number would involve converting it to positive first.

All the same, using the same negative-representation for integer and floating point has been done: the CDC 6600, which used 1's complement arithmetic for integers, also represented floating point numbers by taking the 1's complement of the entire word. The CDC Cyber 205 left the exponent alone, and represented negatives by taking the 2's complement of the mantissa.

Exponent

The exponent gives a power of two, rather than a power of ten as in scientific notation (again, there have been floating point formats using a power of eight or sixteen; IEEE uses two).

The eight-bit exponent uses excess 127 notation. What this means is that the exponent is represented in the field by a number 127 greater than its value. Why? Because it lets us use an integer comparison to tell if one floating point number is larger than another, so long as both are the same sign.

Of course, this is only a benefit if we use the same registers for both integers and floating point numbers, which has become quite rare today. By the time you've moved two operands from floating point registers to integer registers and then performed a comparison, you might as well have just done a floating point compare. Also, an integer compare will fail to give the right answer for comparisons involving NaN's (see later). This really seems to me like a vestige of older formats, with no particularly good reason for its use in a modern computer.

The use of excess-127, instead of excess-128, is also a head-scratcher. Most previous floating point formats using an excess representation for the exponent used an excess that was a power of two; this allowed conversion from exponent representation to exponent value (and vice versa) by simply inverting a bit. I have yet to come across a good explanation for the use of excess-127.

Significand

Using a binary exponent gives us an unexpected benefit. In scientific notation, we always work with a ``normalized'' number: a number whose mantissa is between 1 and 9. If a binary floating point number is normalized, it must have the form 1.f -- the most significant bit must be a 1. Well, if we know what it is, we don't need to explicitly represent it, right? So we just store the fraction part in the word, and put in the ``1.'' when we're actually inside the floating point unit. Sometimes this is called using a ``phantom bit'' or a ``hidden bit.''

Since we're going to fill a 32-bit word, the fraction is 23 bits, but represents a 24 bit significand.

A note on mantissas: a ``mantissa'' is the fractional part of the logarithm of a number. For instance, if we take log₁₀73.2, we get 1.864511. The mantissa is .864511. I've also seen the word used to mean the fracitonal part of any decimal number -- in the above example, using this definition, the mantissa would be .2. The term is also frequently used to mean the significand of a floating point number; we're going to try to be consistent and use the term ``significand.''

Putting It All Together

The value represented by an IEEE floating point number is

(-1)^s * 1.f * 2^exp-127

Floating Point Operations

Let's think a minute about just how we do arithmetic operations in scientific notation:

Addition and subtraction:

Align the exponents
Add (subtract) the significands
Renormalize

Multiplication and division:

Add (subtract) the exponents
Multiply (divide) the significands
Renormalize

A Complete Example

Let's add 2.5 + 4.75

Convert 2.5 to IEEE floating point format

Convert 2 to binary (use the division method)

Old	Old/2	Bit
2	1	0
1	0	1

2. So we get 10

Convert .5 to binary (use the multiplication method)

Old	Bit	New
.5	1	0

4. So the fraction part is .1

Calculate the exponent and fraction fields

The number we're converting is 10.1, which is 1.01 x 2 ¹. The exponent is 127+1 = 128₁₀, or 10000000₂, and the fraction is 010-0₂.

Put it all together We get 0 10000000 010-0₂, or 40200000₁₆.

Convert 4.75 to IEEE floating point format. Following the same steps, we get 40980000₁₆
Add the numbers.

Determine the values of all of the fields

2.5
Sign:	0
Exponent:	10000000
Significand:	1.01
4.75
Sign:	0
Exponent:	10000001
Significand:	1.0011

(notice I put the phantom bit in italics)
Adjust the number with the smaller exponent to make the exponents the same. In the example, the first significand becomes .101.
Add (or subtract, as appropriate) the significands

0.1010

1.0011

------

1.1101· Renormalize the result. In the case of this example it's already normalized, so we don't need to do anything.

· Put the result together: 0 10000001 11010-0₂, or 40e80000₁₆.

One small point to notice here is that I didn't ever have to figure out what the exponents meant; I just had to compare them.

· Convert the result back to decimal.

1. Since the exponent field is 10000001, its value is 129-127=2. So the number's value is 1.1101 x 2 ², or 111.01.

2. The integer part is found with

Old	Bit	New
0	1	1
1	1	3
3	1	7

The fraction part is found with

Old	Bit	New
0	1	1
.5	0	0.5
.25

7. and we get a final result of 7.25.

4. Multiplication

5. Let's run an example of multiplication in floating point. We'll use the same two numbers that we used for addition: 40200000 * 40980000.

6. First, we find the contents of the sign, exponent, and significand fields. As before, this gives us

40200000
Sign:	0
Exponent:	10000000
Significand:	1.01
40980000
Sign:	0
Exponent:	10000001
Significand:	1.0011

7. So now we apply the standard multiplication algorithm.

Determine the sign. If the signs of the two operands are the same the result will be positive; if they are different, the result will be negative.

2. Determine the exponent by adding the operands' exponents together. The only catch here is that we've left the exponents in excess-127 notation; if we just add them, we'll get

e₁ + 127 + e₂ + 127 = e₁ + e₂ + 254

so we have to add the exponents and subtract 127 (yes, we could have subtracted 127 from the exponent fields, added them, and added 127 to the result. But the answer would have been the same, and we would have gone to some extra work).

10000000 + 1000001 - 01111111 = 10000010

3. Multiply the significands using the standard multiplication algorithm

4.           1.0011

5.           1.01

6.           ------

7.            .010011

8.            .00000

9.           1.0011

10.       -------

11.       1.011111

12. Renormalize. If we'd wound up with two places to the left of the binary point we would have had to shift one place to the right, and add one to the exponent.

13. Reconstruct the answer as an IEEE floating point number:

0 10000010 0111110-0 = 413e0000

8. Division

9. This time let's divide 42340000 / 41100000. We break the numbers up into fields as before:

c2340000
Sign:	1
Exponent:	10000100
Significand:	1.01101
41100000
Sign:	0
Exponent:	10000010
Significand:	1.001

Determine the sign. If the signs of the two operands are the same the result will be positive; if they are different, the result will be negative. In this case, since the signs differ, the result will be negative

2. Determine the exponent by subtracting the operands' exponents. This time the excesses will cancel out, so we need to add them back in; we get

10000100 - 1000010 + 01111111 = 10000001

3. Perform the standard fractional division operation. Note: check my math here!

4.                 1.01

5.                --------

6.           1.001)1.01101

7.                 1.001

8.                 -----

9.                  .0100

10.              .0000

11.               ----

12.              .01001

13.                1001

14.                ----

15.                0000

So, our 24-bit significand is 1010--0

16. Renormalize. Our result is already normalized, so we don't need to do this.

17. Reconstruct the answer as an IEEE floating point number:

0 10000001 010--0 = 40a00000

10. extra Features

11. IEEE FP uses a normalized representation where possible, and also extends its range at the expense of normalization with denormalized numbers.

12. Extend range of representation (at cost in precision of really small numbers) with ``denormals.'' These have an exp field of 0, and represent

(-1)^s * 0.f * 2^-126

14. exp field of ff is used for other goodies: if fraction field is 0, +- infinity; any other fraction is Not A Number.

15. So we can express everything possible in the format like this:

Sign	Exponent	Fraction	Represents	Notes
1	ff	!= 0	NaN
1	ff	0	-infinity
1	01-fe	anything	-1.f * 2^(exp-127)
1	00	!= 0	-0.f * 2^-126
1	00	0	-0	(special case of last line)
0	00	0	0	(special case of next line)
0	00	!= 0	0.f * 2^-126
0	01-fe	anything	1.f * 2^{(exp - 127)}
0	ff	0	infinity
0	ff	!= 0	NaN

16. There are actually two classes of NaNs: if the most significant fraction bit is 1, it's a "Quiet NaN" (QNaN), identifying an indeterminate result. QNaN's can be used in arithmetic, and propagate freely (so nothing breaks, but when you're done you get a NaN result). If the most significant fraction bit is a 0, it's a "Signalling NaN" (SNaN), identifying an invalid result. Using a SNaN will raise an exception. These are handy for initializing variables, so use before setting can be recognized.

17. Operations on the "special cases" are well defined by the IEEE standard. Any operation involving a QNaN results in a QNaN; other operations give results of:

Operation	Result
n / ±Infinity	0
±Infinity × ±Infinity	±Infinity
±nonzero / 0	±Infinity
Infinity + Infinity	Infinity
±0 / ±0	NaN
Infinity - Infinity	NaN
±Infinity / ±Infinity	NaN
±Infinity × 0	NaN

18. Double Precision

19. Double precision works just like single precision, except it's 64 bits. The exponent is 11 bits, the fraction is 52.