How to subtract IEEE 754 numbers?

Question

Really not any different than you do it with pencil and paper. Okay a little different

123400 - 5432 = 1.234*10^5 - 5.432*10^3

the bigger number dominates, shift the smaller number’s mantissa off into the bit bucket until the exponents match

1.234*10^5 - 0.05432*10^5

then perform the subtraction with the mantissas

1.234 - 0.05432 = 1.17968
1.17968 * 10^5

Then normalize (which in this case it is)

That was with base 10 numbers.

In IEEE float, single precision

123400 = 0x1E208 = 0b11110001000001000
11110001000001000.000...

normalize that we have to shift the decimal place 16 places to the left so

1.1110001000001000 * 2^16

The exponent is biased so we add 127 to 16 and get 143 = 0x8F. It is a positive number so the sign bit is a 0 we start to build the IEEE floating point number the leading
1 before the decimal is implied and not used in single precision, we get rid of it and keep the fraction

sign bit, exponent, mantissa

0 10001111 1110001000001000...
0100011111110001000001000...
0100 0111 1111 0001 0000 0100 0...
0x47F10400

And if you write a program to see what a computer things 123400 is you get the same thing:

0x47F10400 123400.000000

So we know the exponent and mantissa for the first operand’

Now the second operand

5432 = 0x1538 = 0b0001010100111000

Normalize, shift decimal 12 bits left

1010100111000.000
1.010100111000000 * 2^12

The exponent is biased add 127 and get 139 = 0x8B = 0b10001011

Put it all together

0 10001011 010100111000000
010001011010100111000000
0100 0101 1010 1001 1100 0000...
0x45A9C00

And a computer program/compiler gives the same

0x45A9C000 5432.000000

Now to answer your question. Using the component parts of the floating point numbers, I have restored the implied 1 here because we need it

0 10001111 111100010000010000000000 -  0 10001011 101010011100000000000000

We have to line up our decimal places just like in grade school before we can subtract so in this context you have to shift the smaller exponent number right, tossing mantissa bits off the end until the exponents match

0 10001111 111100010000010000000000 -  0 10001011 101010011100000000000000
0 10001111 111100010000010000000000 -  0 10001100 010101001110000000000000
0 10001111 111100010000010000000000 -  0 10001101 001010100111000000000000
0 10001111 111100010000010000000000 -  0 10001110 000101010011100000000000
0 10001111 111100010000010000000000 -  0 10001111 000010101001110000000000

Now we can subtract the mantissas. If the sign bits match then we are going to actually subtract if they dont match then we add. They match this will be a subtraction.

computers perform a subtraction by using addition logic, inverting the second operator on the way into the adder and asserting the carry in bit, like this:

                         1
  111100010000010000000000
+ 111101010110001111111111
==========================

And now just like with paper and pencil lets perform the add

 1111000100000111111111111
  111100010000010000000000
+ 111101010110001111111111
==========================
  111001100110100000000000

or do it with hex on your calculator

111100010000010000000000 = 1111 0001 0000 0100 0000 0000 = 0xF10400
111101010110001111111111 = 1111 0101 0110 0011 1111 1111 = 0xF563FF
0xF10400 + 0xF563FF + 1 = 0x1E66800
1111001100110100000000000 =1 1110 0110 0110 1000 0000 0000 = 0x1E66800

A little bit about how the hardware works, since this was really a subtract using the adder we also invert the carry out bit (or on some computers they leave it as is). So that carry out of a 1 is a good thing we basically discard it. Had it been a carry out of a zero we would have needed more work. We dont have a carry out so our answer is really 0xE66800.

Very quickly lets see that another way, instead of inverting and adding one lets just use a calculator

111100010000010000000000 -  000010101001110000000000 = 
0xF10400 - 0x0A9C00 = 
0xE66800

By trying to visualize it I perhaps made it worse. The result of the mantissa subtracting is 111001100110100000000000 (0xE66800), there was no movement in the most significant bit we end up with a 24 bit number in this case with the msbit of a 1. No normalization. To normalize you need to shift the mantissa left or right until the 24 bits lines up with the most significant 1 in that left most position, adjusting the exponent for each bit shift.

Now stripping the 1. bit off the answer we put the parts together

0 10001111 11001100110100000000000
01000111111001100110100000000000
0100 0111 1110 0110 0110 1000 0000 0000
0x47E66800

If you have been following along by writing a program to do this, I did as well. This program violates the C standard by using a union in an improper way. I got away with it with my compiler on my computer, dont expect it to work all the time.

#include <stdio.h>

union
{
    float f;
    unsigned int u;
} myun;


int main ( void )
{
    float a,b,c;

    a=123400;
    b=  5432;

    c=a-b;

    myun.f=a; printf("0x%08X %f\n",myun.u,myun.f);
    myun.f=b; printf("0x%08X %f\n",myun.u,myun.f);
    myun.f=c; printf("0x%08X %f\n",myun.u,myun.f);

    return(0);
}

And our result matches the output of the above program, we got a 0x47E66800 doing it by hand

0x47F10400 123400.000000
0x45A9C000 5432.000000
0x47E66800 117968.000000

If you are writing a program to synthesize the floating point math your program can perform the subtract, you dont have to do the invert and add plus one thing, over complicates it as we saw above. If you get a negative result though you need to play with the sign bit, invert your result, then normalize.

So:

1) extract the parts, sign, exponent, mantissa.

2) Align your decimal places by sacrificing mantissa bits from the number with the smallest exponent, shift that mantissa to the right until the exponents match

3) being a subtract operation if the sign bits are the same then you perform a subtract, if the sign bits are different you perform an add of the mantissas.

4) if the result is a zero then your answer is a zero, encode the IEEE value for zero as the result, otherwise:

5) normalize the number, shift the answer to the right or left (The answer can be 25 bits from a 24 bit add/subtract, add/subtract can have a dramatic shift to normalize, either one right or many bits to the left) until you have a 24 bit number with the most significant one left justified. 24 bits is for single precision float. The more correct way to define normalizing is to shift left or right until the number resembles 1.something. if you had 0.001 you would shift left 3, if you had 11.10 you would shift right 1. a shift left increases your exponent, a shift right decreases it. No different than when we converted from integer to float above.

6) for single precision remove the leading 1. from the mantissa, if the exponent has overflowed then you get into building a signaling nan. If the sign bits were different and you performed an add, then you have to deal with figuring out the result sign bit. If as above everything fine you just place the sign bit, exponent and mantissa in the result

Multiply and divide is different, you asked about subract, so that is all I covered.

Leave a Comment Cancel reply