Mathieu Garaud - 8 months ago 31

C++ Question

When I compiled this C++ code I didn't expect to see this output

`#include <iostream>`

#include <iomanip>

#include <limits>

int main() {

const long double ldMinFloat = std::numeric_limits<float>::lowest();

std::cout << std::left << std::setw(20) << "ldMinFloat" << "= " << std::fixed << ldMinFloat << std::endl;

std::cout << std::left << std::setw(20) << "(ldMinFloat - 10)" << "= " << std::fixed << (ldMinFloat - 10) << std::endl;

return 0;

return 0;

}

Here is the output

`ldMinFloat = -340282346638528859811704183484516925440.000000`

(ldMinFloat - 10) = -340282346638528859811704183484516925440.000000

Can someone be kind enough to explain why the subtraction is not -3402823466385288598117041834845169254

Based on this link long double max value is +/- 1.797,693,134,862,315,7*10^308 and I don't really understand why the mantis would explain this behaviour in basic integer arithmetic? or is it the implicit conversion from float to long double? or it's the operator << of std::cout?

Any idea to help me feel less stupid before going to sleep?

Answer

A `long double`

cannot represent most values exactly, typically you are talking about big values (`std::numeric_limits<float>::max()`

), so there are big gaps between values that are **exactly** representable by a `long double`

.

Check the `epsilon`

for `long double`

, which is the difference between `1.0`

and the smallest value greater than `1.0`

that a `long double`

can represent.

If you want to find the difference between the largest value lower than `ldMinFloat`

that a `long double`

can store and `ldMinFloat`

, you can use the below **approximation**:

```
std::abs(ldMinFloat) * std::numeric_limits<long double>::epsilon()
```

This is (on my computer) `36893485948395847680`

, so a `long double`

cannot differentiate values between `340282346638528859811704183484516925440`

and `340282346638528859811704183484516925440 +/- 36893485948395847680`

(approximately... ) even if it can store values well below this.

**A more precision computation of the next representation value:**

Assuming 32 bits `float`

and 64 bits `double`

(I do not have 96 bits `long double`

to test... ) and both uses IEEE 754 representation:

The lowest float (`-340282346638528859811704183484516925440`

) has the following binary representation:

```
1 11111110 11111111111111111111111
```

Converted to a `double`

:

```
1 10001111110 1111111111111111111111100000000000000000000000000000
```

The first representable number for a double below this is (just add 1 to the mantissa, and luckily it is easy for this number):

```
1 10001111110 1111111111111111111111100000000000000000000000000001
```

Which is exactly `-340282346638528897590636046441678635008`

. The difference between both values (computed in the code) is:

```
37778931862957161709568 // About half the value of the approximation (using double)
```

**How to compute this difference from ldMinFloat?**

You can compute this difference using the binary representation. You know that with IEEE754 the "conversion" is (without sign):

```
V = 2 ^ (E - shift) * M
```

Here, the exponent `E`

is the same for both value, so (`V1`

is `ldMinFloat`

and `V2`

is the next representable value, I am assuming positive values for this, the sign does not matter here):

```
V2 - V1 = 2 ^ (E - shift) * M2 - 2 ^ (E - shift) * M1
= 2 ^ (E - shift) * (M2 - M1)
```

`E`

is `1050`

in the above (`10001111110`

) and the shift for 64-bits `double`

is `1023`

, so `E - shift = 127`

:

```
V2 - V1 = 2 ^ 127 * (M2 - M1)
```

Here we are "lucky" because the last bit in `M1`

(mantissa of `ldMinFloat`

) is `0`

, so the difference between `M1`

and `M2`

is:

```
M2 - M1 = 0.000...001b
// <-------> 52 bits (51 zeros)
```

So the difference is:

```
V2 - V1 = (2 ^ 127) * 0.000...001b
= (2 ^ 127) >> 52
= 37778931862957161709568
```

*Note:* This all work smoothly because the last bit of the mantissa in `ldMinFloat`

was `0`

, if it was not the case, adding `1`

to this mantissa could propagate the remainder and even change the exponent, so the computation would be harder.