John Am John Am - 2 months ago 6
C Question

Computation without floats to multiply a long integer (32 bit ) with 0.0000000004656f

I'm trying to eliminate all floating point computations in an embedded application and I need to scale/multiply a signed long 32 bit integer with

. (1/2147483648)

The context is

( pulse[i] * ( triosc[i] * 0.0000000004656f ) )

are signed long 32 bit integers

So I need my
value to be constrained between
without using floating arithmetic.


saw_x2[i] = (long)( pulse[i] * (triosc[i] * 0.0000000004656f) );
sine_osc[i] = (long)( ((triangle2[i] * (saw_x2[i] * 0.0000000004656f))) *
(pulse[i] * 0.0000000004656f) ) << 2;
return (sine_osc[i]);


The fixed point values in pulse[i] and triosc[i] are signed quantities expressed in units of 2-31. The mathematical values are pulse[i] / 231 and triosc[i] / 231. While you can add these values as long as you do not overflow, multiplying them requires an adjustment by 231. This is what is done approximately with pulse[i] * (triosc[i] * 0.0000000004656f)), but note that the floating point value is not precise enough, it would be more precise to write pulse[i] * (triosc[i] / 2147483648.F), but the result would still lose precision due to the float representation with only 23 bits of matissa.

Performing the multiplication in integer arithmetic with a 64 bit intermediary step is actually more precise.

It can be done this way:

((uint64_t)pulse[i] * triosc[i]) >> 31

or equivalently:

((long long)pulse[i] * triosc[i]) >> 31


You really should use types from <stdint.h> to avoid making assumptions about the size of long. It is 32 bits on your current system, but it may be 64 on the next hardware. Here is how you can rewrite the expressions:

int32_t saw_x2[SIZE];
int32_t pulse[SIZE];
int32_t triosc[SIZE];
int32_t triangle2[SIZE];
int32_t sine_osc[SIZE];


saw_x2[i] = (int32_t)(((int64_t)pulse[i] * triosc[i]) >> 31);
int64_t temp = ((int64_t)triangle2[i] * saw_x2[i]) >> 31;
sine_osc[i] = (int32_t)(((temp * pulse[i]) >> 31) << 2);  
return sine_osc[i];

Note however that if any of these values become negative, right shifting is not guaranteed to produce the correct result. Dividing by 2147483648 would be the required method but may produce less efficient code:

saw_x2[i] = (int32_t)((int64_t)pulse[i] * triosc[i] / 2147483648);
int64_t temp = (int64_t)triangle2[i] * saw_x2[i] / 2147483648;
sine_osc[i] = (int32_t)((temp * pulse[i] / 2147483648) << 2);  
return sine_osc[i];

Also, since you multiply by 4 in the last step, you would get 2 more bits of precision by dividing by 229 instead:

sine_osc[i] = (int32_t)(temp * pulse[i] / 536870912);