Floating point notation is essentially the same as scientific notation, only
translated to binary. There are three fields: the sign (which is the sign of
the number), the exponent (some representations have used a separate exponent
sign and exponent magnitude; IEEE format does not), and a significand
(mantissa).
As we discuss the details of the format, you'll find that the motivations
used to select some features seem like they should have driven other features
in directions other than what was actually used. It seems inconsistent to me,
too...
One other thing to mention is that the IEEE floating point format actually
specifies several different formats: a ``single-precision'' format that takes
32 bits (ie one word in most machines) to
represent a value, and a ``double-precision'' format that allows for both
greater precision and greater range, but uses 64 bits. We'll be talking about
single-precision here.
The one-bit sign is 0 for positive, or 1 for negative. The representation is
sign-magnitude.
In integers, we use 2's complement for negative numbers because it makes
the arithmetic ``just work;'' we can add two numbers together without regard to
whether they are positive or negative, and get the right answer. This won't
work for floating point numbers because the exponents need to be manipulated;
if we used a 2's complement representation for the entire word we'd have to
reconstruct the exponent any time we wanted to add or subtract, so it wouldn't
gain us anything; in fact, trying to do arithmetic involving a negative number
would involve converting it to positive first.
All the same, using the same negative-representation for integer and
floating point has been done: the CDC 6600, which used 1's complement
arithmetic for integers, also represented floating point numbers by taking the
1's complement of the entire word. The CDC Cyber 205 left the exponent alone,
and represented negatives by taking the 2's complement of the mantissa.
The exponent gives a power of two, rather than a power of ten as in
scientific notation (again, there have been floating point formats using a
power of eight or sixteen; IEEE uses two).
The eight-bit exponent uses excess 127 notation.
What this means is that the exponent is represented in the field by a number
127 greater than its value. Why? Because it lets us use an integer comparison
to tell if one floating point number is larger than another, so long as both
are the same sign.
Of course, this is only a benefit if we use the same registers for both
integers and floating point numbers, which has become quite rare today. By the
time you've moved two operands from floating point registers to integer
registers and then performed a comparison, you might as well have just done a
floating point compare. Also, an integer compare will fail to give the right
answer for comparisons involving
The use of excess-127, instead of excess-128, is also a head-scratcher. Most previous floating point formats using an
excess representation for the exponent used an excess that was a power of two;
this allowed conversion from exponent representation to exponent value (and
vice versa) by simply inverting a bit. I have yet to come across a good
explanation for the use of excess-127.
Using a binary exponent gives us an unexpected benefit. In scientific
notation, we always work with a ``normalized'' number: a number whose mantissa
is between 1 and 9. If a binary floating point number is normalized, it must
have the form 1.f -- the most significant bit must be a 1. Well, if we know
what it is, we don't need to explicitly represent it, right? So we just store
the fraction part in the word, and put in the ``1.'' when we're actually inside
the floating point unit. Sometimes this is called using a ``phantom bit'' or a
``hidden bit.''
Since we're going to fill a 32-bit word, the fraction is 23 bits, but
represents a 24 bit significand.
A note on mantissas: a ``mantissa'' is the fractional part of the
logarithm of a number. For instance, if we take log1073.2, we get
1.864511. The mantissa is .864511. I've also seen the word used to mean the fracitonal part of any decimal number -- in the above
example, using this definition, the mantissa would be .2. The term is also
frequently used to mean the significand of a floating
point number; we're going to try to be consistent and use the term ``significand.''
The value represented by an IEEE floating point number is
(-1)s * 1.f * 2exp-127
Let's think a minute about just how we do arithmetic operations in
scientific notation:
Addition and subtraction:
Multiplication and division:
Let's add 2.5 + 4.75
Old |
Old/2 |
Bit |
2 |
1 |
0 |
1 |
0 |
1 |
2.
So we get 10
Old |
Bit |
New |
.5 |
1 |
0 |
4.
So the fraction part is .1
The number we're converting is 10.1, which is 1.01
x 2 1. The exponent is 127+1 = 12810, or 100000002,
and the fraction is 010-02.
2.5 |
|
Sign: |
0 |
Exponent: |
10000000 |
Significand: |
1.01 |
4.75 |
|
Sign: |
0 |
Exponent: |
10000001 |
Significand: |
1.0011 |
|
|
|
· Put the result together:
0 10000001 11010-02, or 40e8000016. One small point to notice here is that I didn't ever have
to figure out what the exponents meant; I just had to compare them. · Convert the result back to decimal. 1.
Since the exponent field is 10000001, its
value is 129-127=2. So the number's value is 1.1101 x 2 2, or 111.01. 2.
The integer part is found with 3.
|
Old |
Bit |
New |
0 |
1 |
1 |
1 |
1 |
3 |
3 |
1 |
7 |
Old |
Bit |
New |
0 |
1 |
1 |
.5 |
0 |
0.5 |
.25 |
|
|
7.
and we get a final
result of 7.25.
5.
Let's run an example of multiplication in
floating point. We'll use the same two numbers that we used for addition: 40200000 * 40980000
.
6.
First, we find the contents of the sign,
exponent, and significand fields. As before, this
gives us
40200000 |
|
Sign: |
0 |
Exponent: |
10000000 |
Significand: |
1.01 |
40980000 |
|
Sign: |
0 |
Exponent: |
10000001 |
Significand: |
1.0011 |
7.
So now we apply the standard multiplication
algorithm.
2.
Determine the exponent by adding the operands'
exponents together. The only catch here is that we've left the exponents in
excess-127 notation; if we just add them, we'll get
e1 + 127 + e2 + 127 =
e1 + e2 + 254
so we have to add the exponents and
subtract 127 (yes, we could have subtracted 127 from the exponent fields, added
them, and added 127 to the result. But the answer would have been the same, and
we would have gone to some extra work).
10000000 +
1000001 - 01111111 = 10000010
3.
Multiply the significands
using the standard multiplication algorithm
4. 1.0011
5. 1.01
6. ------
7. .010011
8. .00000
9. 1.0011
10. -------
11. 1.011111
12.
Renormalize. If we'd wound up with two places to
the left of the binary point we would have had to shift one place to the right,
and add one to the exponent.
13.
Reconstruct the answer as an IEEE floating point
number:
0 10000010 0111110-0 =
413e0000
9.
This time let's divide 42340000 / 41100000
. We break the
numbers up into fields as before:
c2340000 |
|
Sign: |
1 |
Exponent: |
10000100 |
Significand: |
1.01101 |
41100000 |
|
Sign: |
0 |
Exponent: |
10000010 |
Significand: |
1.001 |
2.
Determine the exponent by subtracting the
operands' exponents. This time the excesses will cancel out, so we need to add
them back in; we get
10000100 -
1000010 + 01111111 = 10000001
3.
Perform the standard fractional division
operation. Note: check my math here!
4. 1.01
5. --------
6. 1.001)1.01101
7. 1.001
8. -----
9. .0100
10. .0000
11. ----
12. .01001
13. 1001
14. ----
15. 0000
So, our 24-bit significand
is 1010--0
16.
Renormalize. Our result is already normalized,
so we don't need to do this.
17.
Reconstruct the answer as an IEEE floating point
number:
0 10000001 010--0 =
40a00000
11.
IEEE FP uses a normalized representation where
possible, and also extends its range at the expense of normalization with denormalized numbers.
12.
Extend range of representation (at cost in
precision of really small numbers) with ``denormals.''
These have an exp field of 0, and represent
14.
exp field of ff is used for other goodies: if fraction
field is 0, +- infinity; any other fraction is Not A Number.
15.
So we can express everything possible in the
format like this:
Sign |
Exponent |
Fraction |
Represents |
Notes |
1 |
ff |
!= 0 |
|
|
1 |
ff |
0 |
-infinity |
|
1 |
01-fe |
anything |
-1.f * 2(exp-127) |
|
1 |
00 |
!= 0 |
-0.f * 2-126 |
|
1 |
00 |
0 |
-0 |
(special case of last line) |
0 |
00 |
0 |
0 |
(special case of next line) |
0 |
00 |
!= 0 |
0.f * 2-126 |
|
0 |
01-fe |
anything |
1.f * 2(exp - 127) |
|
0 |
ff |
0 |
infinity |
|
0 |
ff |
!= 0 |
|
|
16.
There are actually two classes of NaNs: if the most significant fraction bit is 1, it's a
"Quiet NaN" (QNaN),
identifying an indeterminate result. QNaN's can be
used in arithmetic, and propagate freely (so nothing breaks, but when you're
done you get a
17.
Operations on the "special cases" are
well defined by the IEEE standard. Any operation involving a QNaN results in a QNaN; other
operations give results of:
Operation |
Result |
n / ±Infinity |
0 |
±Infinity × ±Infinity |
±Infinity |
±nonzero / 0 |
±Infinity |
Infinity + Infinity |
Infinity |
±0 / ±0 |
|
Infinity - Infinity |
|
±Infinity / ±Infinity |
|
±Infinity × 0 |
|
19.
Double precision works just like single
precision, except it's 64 bits. The exponent is 11
bits, the fraction is 52.