Potatoswatter Potatoswatter - 7 months ago 35
C Question

Round floating-point value to e.g. single precision

C and C++ provide floating-point data types of several widths, but they leave precision unspecified. The compiler is free to use idealized arithmetic to simplify expressions, to use double precision in computing an expression over

values, or to use a double-precision register to keep the value of a
variable or common subexpression.

Correct me if I'm wrong, but it's even legal to hoist a
in memory into a double-precision register, so storing a value and then loading it back doesn't necessarily truncate bits.

What is the safest, most portable way to convert a number to a lower precision? Ideally, it should be efficient too, compiling to
on SSE2. (So, while
may be an answer, I'd prefer something better.)

Edit: Summarizing some of the comments and findings…

  • Wider precision for intermediate results is always fair game.

  • Expression simplification is allowed in C++, and in C given

  • Using double precision for a single-precision
    is not allowed (in C or C++).

However, some compilers (particularly GCC on x86-32) illegally forget some precision conversions.


The C99 excplicitly says that

assignment and cast [..] remove all extra range and precision

So, if you want to limit the range and precision to that of a float, just cast to float, or assign to a float variable.

You can even do stuff like (double)((float)d) (with extra parentheses to make sure humans read it correctly), limiting a variable d to float precision and range, then casting it back to double. (A standard C compiler is NOT allowed to optimize that away even if d is a double; it must limit the precision and range to that of a float.)

I've used this in practical implementations of e.g. Kahan summation algorithm, where it can be utilized to allow the C compiler to do very aggressive optimization, but without risk of invalidation.