How to subtract IEEE 754 numbers?

Really not any different than you do it with pencil and paper. Okay a little different 123400 – 5432 = 1.234*10^5 – 5.432*10^3 the bigger number dominates, shift the smaller number’s mantissa off into the bit bucket until the exponents match 1.234*10^5 – 0.05432*10^5 then perform the subtraction with the mantissas 1.234 – 0.05432 = … Read more

Converting Int to Float or Float to Int using Bitwise operations (software floating point)

First, a paper you should consider reading, if you want to understand floating point foibles better: “What Every Computer Scientist Should Know About Floating Point Arithmetic,” http://www.validlab.com/goldberg/paper.pdf And now to some meat. The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 … Read more

Uses for negative zero floating point value?

From Wikipedia: It is claimed that the inclusion of signed zero in IEEE 754 makes it much easier to achieve numerical accuracy in some critical problems[1], in particular when computing with complex elementary functions[2]. The first reference is “Branch Cuts for Complex Elementary Functions or Much Ado About Nothing’s Sign Bit” by W. Kahan, that … Read more

Is it safe to assume floating point is represented using IEEE754 floats in C?

Essentially all architectures in current non-punch-card use, including embedded architectures and exotic signal processing architectures, offer one of two floating point systems: IEEE-754. IEEE-754 except for blah. That is, they mostly implement 754, but cheap out on some of the more expensive and/or fiddly bits. The most common cheap-outs: Flushing denormals to zero. This invalidates … Read more