Ratman - 8 months ago 32

C++ Question

using g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

I have tried different typecasting of

`scaledvalue2`

`double`

`int`

I know double precission(0.6999999999999999555910790149937383830547332763671875) is an issue but I don't understand why one way is OK and the other is not ??

I would expect both to fail if precision is a problem.

I DON'T NEED solution to fix it.. but just a WHY ??

(the problem IS fixed)

`void main()`

{

double value = 0.7;

int scaleFactor = 1000;

double doubleScaled = (double)scaleFactor * value;

int scaledvalue1 = doubleScaled; // = 700

int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??

int scaledvalue3 = (double)(1000.0 * 0.7); // = 700

std::ostringstream oss;

oss << scaledvalue2;

printf("convert FloatValue[%f] multi with %i to get %f = %i or %i or %i[%s]\r\n",

value,scaleFactor,doubleScaled,scaledvalue1,scaledvalue2,scaledvalue3,oss.str().c_str());

}

or in short:

`value = 0.6999999999999999555910790149937383830547332763671875;`

int scaledvalue_a = (double)(1000 * value); // = 699??

int scaledvalue_b = (double)(1000 * 0.6999999999999999555910790149937383830547332763671875); // = 700

// scaledvalue_a = 699

// scaledvalue_b = 700

I can't figure out what is going wrong here.

Output :

`convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 699 or 700[699]`

vendor_id : GenuineIntel

cpu family : 6

model : 54

model name : Intel(R) Atom(TM) CPU N2600 @ 1.60GHz

Answer

This is going to be a bit handwaving; I was up too late last night watching the Cubs win the World Series, so don't insist on precision.

The rules for evaluating floating-point expressions are somewhat flexible, and compilers typically treat floating-point expressions even more flexibly than the rules formally allow. This makes evaluation of floating-point expressions faster, at the expense of making the results somewhat less predictable. Speed is important for floating-point calculations. Java initially made the mistake of imposing exact requirements on floating-point expressions and the numerics community screamed with pain. Java had to give in to the real world and relax those requirements.

```
double f();
double g();
double d = f() + g(); // 1
double dd1 = 1.6 * d; // 2
double dd2 = 1.6 * (f() + g()); // 3
```

On x86 hardware (i.e., just about every desktop system in existence), floating-point calculations are in fact done with 80 bits of precision (unless you set some switches that kill performance, as Java required), even though `double`

and `float`

are 64 bits and 32 bits, respectively. So for arithmetic operations the operands are converted up to 80 bits and the results are converted back down to 64 or 32 bits. That's slow, so the generated code typically delays doing conversions as long as possible, doing all of the calculation with 80-bit precision.

But C and C++ both require that when a value is stored into a floating-point variable, the conversion has to be done. So, formally, in line //1, the compiler must convert the sum back to 64 bits to store it into the variable `d`

. Then the value of `dd1`

, calculated in line //2, must be computed using the value that was stored into `d`

, i.e., a 64-bit value, while the value of `dd2`

, calculated in line //3, can be calculated using `f() + g()`

, i.e., a full 80-bit value. Those extra bits can make a difference, and the value of `dd1`

might be different from the value of `dd2`

.

And often the compiler will hang on to the 80-bit value of `f() + g()`

and use that instead of the value stored in `d`

when it calculates the value of `dd1`

. That's a non-conforming optimization, but as far as I know, every compiler does that sort of thing by default. They all have command-line switches to enforce the strictly-required behavior, so if you want slower code you can get it. <g>

For serious number crunching, speed is critical, so this flexibility is welcome, and number-crunching code is carefully written to avoid sensitivity to this kind of subtle difference. People get PhDs for figuring out how to make floating-point code fast and effective, so don't feel bad that the results you see don't seem to make sense. They don't, but they're close enough that, handled carefully, they give correct results without a speed penalty.

Source (Stackoverflow)