Mathieu Garaud Mathieu Garaud - 4 months ago 17
C++ Question

C++11 numeric_limits<float> and arithmetic

When I compiled this C++ code I didn't expect to see this output

#include <iostream>
#include <iomanip>
#include <limits>

int main() {
const long double ldMinFloat = std::numeric_limits<float>::lowest();
std::cout << std::left << std::setw(20) << "ldMinFloat" << "= " << std::fixed << ldMinFloat << std::endl;
std::cout << std::left << std::setw(20) << "(ldMinFloat - 10)" << "= " << std::fixed << (ldMinFloat - 10) << std::endl;
return 0;
return 0;
}


Here is the output

ldMinFloat = -340282346638528859811704183484516925440.000000
(ldMinFloat - 10) = -340282346638528859811704183484516925440.000000


Can someone be kind enough to explain why the subtraction is not -340282346638528859811704183484516925450.000000???

Based on this link long double max value is +/- 1.797,693,134,862,315,7*10^308 and I don't really understand why the mantis would explain this behaviour in basic integer arithmetic? or is it the implicit conversion from float to long double? or it's the operator << of std::cout?

Any idea to help me feel less stupid before going to sleep?

Answer

A long double cannot represent most values exactly, typically you are talking about big values (std::numeric_limits<float>::max()), so there are big gaps between values that are exactly representable by a long double.

Check the epsilon for long double, which is the difference between 1.0 and the smallest value greater than 1.0 that a long double can represent.

If you want to find the difference between the largest value lower than ldMinFloat that a long double can store and ldMinFloat, you can use the below approximation:

std::abs(ldMinFloat) * std::numeric_limits<long double>::epsilon()

This is (on my computer) 36893485948395847680, so a long double cannot differentiate values between 340282346638528859811704183484516925440 and 340282346638528859811704183484516925440 +/- 36893485948395847680 (approximately... ) even if it can store values well below this.


A more precision computation of the next representation value:

Assuming 32 bits float and 64 bits double (I do not have 96 bits long double to test... ) and both uses IEEE 754 representation:

The lowest float (-340282346638528859811704183484516925440) has the following binary representation:

1 11111110 11111111111111111111111

Converted to a double:

1 10001111110 1111111111111111111111100000000000000000000000000000

The first representable number for a double below this is (just add 1 to the mantissa, and luckily it is easy for this number):

1 10001111110 1111111111111111111111100000000000000000000000000001

Which is exactly -340282346638528897590636046441678635008. The difference between both values (computed in the code) is:

37778931862957161709568 // About half the value of the approximation (using double)

How to compute this difference from ldMinFloat?

You can compute this difference using the binary representation. You know that with IEEE754 the "conversion" is (without sign):

V = 2 ^ (E - shift) * M

Here, the exponent E is the same for both value, so (V1 is ldMinFloat and V2 is the next representable value, I am assuming positive values for this, the sign does not matter here):

V2 - V1 = 2 ^ (E - shift) * M2 - 2 ^ (E - shift) * M1
        = 2 ^ (E - shift) * (M2 - M1)

E is 1050 in the above (10001111110) and the shift for 64-bits double is 1023, so E - shift = 127:

V2 - V1 = 2 ^ 127 * (M2 - M1)

Here we are "lucky" because the last bit in M1 (mantissa of ldMinFloat) is 0, so the difference between M1 and M2 is:

M2 - M1 = 0.000...001b
//          <-------> 52 bits (51 zeros)

So the difference is:

V2 - V1 = (2 ^ 127) * 0.000...001b
        = (2 ^ 127) >> 52
        = 37778931862957161709568

Note: This all work smoothly because the last bit of the mantissa in ldMinFloat was 0, if it was not the case, adding 1 to this mantissa could propagate the remainder and even change the exponent, so the computation would be harder.