using g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
I have tried different typecasting of
double value = 0.7;
int scaleFactor = 1000;
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
int scaledvalue3 = (double)(1000.0 * 0.7); // = 700
oss << scaledvalue2;
printf("convert FloatValue[%f] multi with %i to get %f = %i or %i or %i[%s]\r\n",
value = 0.6999999999999999555910790149937383830547332763671875;
int scaledvalue_a = (double)(1000 * value); // = 699??
int scaledvalue_b = (double)(1000 * 0.6999999999999999555910790149937383830547332763671875); // = 700
// scaledvalue_a = 699
// scaledvalue_b = 700
convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 699 or 700
vendor_id : GenuineIntel
cpu family : 6
model : 54
model name : Intel(R) Atom(TM) CPU N2600 @ 1.60GHz
This is going to be a bit handwaving; I was up too late last night watching the Cubs win the World Series, so don't insist on precision.
The rules for evaluating floating-point expressions are somewhat flexible, and compilers typically treat floating-point expressions even more flexibly than the rules formally allow. This makes evaluation of floating-point expressions faster, at the expense of making the results somewhat less predictable. Speed is important for floating-point calculations. Java initially made the mistake of imposing exact requirements on floating-point expressions and the numerics community screamed with pain. Java had to give in to the real world and relax those requirements.
double f(); double g(); double d = f() + g(); // 1 double dd1 = 1.6 * d; // 2 double dd2 = 1.6 * (f() + g()); // 3
On x86 hardware (i.e., just about every desktop system in existence), floating-point calculations are in fact done with 80 bits of precision (unless you set some switches that kill performance, as Java required), even though
float are 64 bits and 32 bits, respectively. So for arithmetic operations the operands are converted up to 80 bits and the results are converted back down to 64 or 32 bits. That's slow, so the generated code typically delays doing conversions as long as possible, doing all of the calculation with 80-bit precision.
But C and C++ both require that when a value is stored into a floating-point variable, the conversion has to be done. So, formally, in line //1, the compiler must convert the sum back to 64 bits to store it into the variable
d. Then the value of
dd1, calculated in line //2, must be computed using the value that was stored into
d, i.e., a 64-bit value, while the value of
dd2, calculated in line //3, can be calculated using
f() + g(), i.e., a full 80-bit value. Those extra bits can make a difference, and the value of
dd1 might be different from the value of
And often the compiler will hang on to the 80-bit value of
f() + g() and use that instead of the value stored in
d when it calculates the value of
dd1. That's a non-conforming optimization, but as far as I know, every compiler does that sort of thing by default. They all have command-line switches to enforce the strictly-required behavior, so if you want slower code you can get it. <g>
For serious number crunching, speed is critical, so this flexibility is welcome, and number-crunching code is carefully written to avoid sensitivity to this kind of subtle difference. People get PhDs for figuring out how to make floating-point code fast and effective, so don't feel bad that the results you see don't seem to make sense. They don't, but they're close enough that, handled carefully, they give correct results without a speed penalty.