Cory Nelson - 1 year ago 51

C Question

I have hot spots in my code where I'm doing

`pow()`

My input to

`pow(x,y)`

`pow()`

- I have two constant exponents: 2.4 and 1/2.4.
- When the exponent is 2.4,
*x*will be in the range (0.090473935, 1.0]. - When the exponent is 1/2.4,
*x*will be in the range (0.0031308, 1.0]. - I'm using SSE/AVX vectors. If platform specifics can be taken advantage of, right on!
`float`

A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for

`float`

I'm already using a fast

`pow()`

Answer Source

In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).

Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of `log2`

. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.

To complete the `pow`

computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are `cvtdq2ps`

and `cvtps2dq`

.

It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.

```
x^p
= exp2( p * log2( x ) )
= exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
= cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
= cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
= cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
= cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )
```

`exp2( 127 / p - 127 )`

is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.

But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.

```
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
```

A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.

The key is to match the overestimate with an underestimate, and take the average.

- Compute x^0.8: four instructions, error ~ +3%.
- Compute x^-0.4: one
`rsqrtps`

. (This is quite accurate enough, but does sacrifice the ability to work with zero.) - Compute x^0.4: one
`mulps`

. - Compute x^-0.2: one
`rsqrtps`

. - Compute x^2: one
`mulps`

. - Compute x^3: one
`mulps`

. - x^2.4 = x^2 * x^0.4: one
`mulps`

. This is the overestimate. - x^2.4 = x^3 * x^-0.4 * x^-0.2: two
`mulps`

. This is the underestimate. - Average the above: one
`addps`

, one`mulps`

.

Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.

To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.

The final error is about .015%, or 2 orders of magnitude better than the initial `fastpow`

result. The runtime is about a dozen cycles for a busy loop with `volatile`

source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!

```
int main() {
__m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
std::printf( "Input: %,vg\n", x0 );
// Approx 5% accuracy from one call. Always an overestimate.
__m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
std::printf( "Direct x^2.4: %,vg\n", x1 );
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
std::printf( "1.38 x^0.8: %,vg\n", xf );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( x0, x0 );
__m128 x3 = _mm_mul_ps( x2, x0 );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
std::printf( "x^2 * x^0.4: %,vg\n", x2 );
std::printf( "x^3 / x^0.6: %,vg\n", x3 );
x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
// Final accuracy about 0.015%, 200x better than x^0.8 calculation.
std::printf( "average = %,vg\n", x2 );
}
```

Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .

I implemented a little test harness and two x^{(5⁄12)} cases corresponding to the above.

```
#include <cstdio>
#include <xmmintrin.h>
#include <cmath>
#include <cfloat>
#include <algorithm>
using namespace std;
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
__m128 pow125_4( __m128 arg ) {
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 x3 = _mm_mul_ps( x2, arg );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
}
__m128 pow512_2( __m128 arg ) {
// 5/12 is too small, so compute the sqrt of 10/12 instead.
__m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
}
__m128 pow512_4( __m128 arg ) {
// 5/12 is too small, so compute the 4th root of 20/12 instead.
// 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
// weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
__m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
__m128 xover = _mm_mul_ps( arg, xf );
__m128 xfm1 = _mm_rsqrt_ps( xf );
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 xunder = _mm_mul_ps( x2, xfm1 );
// sqrt2 * over + 2 * sqrt2 * under
__m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),
_mm_add_ps( xover, xunder ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
return xavg;
}
__m128 mm_succ_ps( __m128 arg ) {
return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
}
void test_pow( double p, __m128 (*f)( __m128 ) ) {
__m128 arg;
for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
! isfinite( _mm_cvtss_f32( f( arg ) ) );
arg = mm_succ_ps( arg ) ) ;
for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
arg = mm_succ_ps( arg ) ) ;
std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );
int n;
int const bucket_size = 1 << 25;
do {
float max_error = 0;
double total_error = 0, cum_error = 0;
for ( n = 0; n != bucket_size; ++ n ) {
float result = _mm_cvtss_f32( f( arg ) );
if ( ! isfinite( result ) ) break;
float actual = ::powf( _mm_cvtss_f32( arg ), p );
float error = ( result - actual ) / actual;
cum_error += error;
error = std::abs( error );
max_error = std::max( max_error, error );
total_error += error;
arg = mm_succ_ps( arg );
}
std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
} while ( n == bucket_size );
}
int main() {
std::printf( "4 insn x^12/5:\n" );
test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
std::printf( "14 insn x^12/5:\n" );
test_pow( 12./5, & pow125_4 );
std::printf( "6 insn x^5/12:\n" );
test_pow( 5./12, & pow512_2 );
std::printf( "14 insn x^5/12:\n" );
test_pow( 5./12, & pow512_4 );
}
```

Output:

```
4 insn x^12/5:
Domain from 1.36909e-23
error max = inf avg = inf |avg| = inf to 8.97249e-19
error max = 2267.14 avg = 139.175 |avg| = 139.193 to 5.88021e-14
error max = 0.123606 avg = -0.000102963 |avg| = 0.0371122 to 3.85365e-09
error max = 0.123607 avg = -0.000108978 |avg| = 0.0368548 to 0.000252553
error max = 0.12361 avg = 7.28909e-05 |avg| = 0.037507 to 16.5513
error max = 0.123612 avg = -0.000258619 |avg| = 0.0365618 to 1.08471e+06
error max = 0.123611 avg = 8.70966e-05 |avg| = 0.0374369 to 7.10874e+10
error max = 0.12361 avg = -0.000103047 |avg| = 0.0371122 to 4.65878e+15
error max = 0.123609 avg = nan |avg| = nan to 1.16469e+16
14 insn x^12/5:
Domain from 1.42795e-19
error max = inf avg = nan |avg| = nan to 9.35823e-15
error max = 0.000936462 avg = 2.0202e-05 |avg| = 0.000133764 to 6.13301e-10
error max = 0.000792752 avg = 1.45717e-05 |avg| = 0.000129936 to 4.01933e-05
error max = 0.000791785 avg = 7.0132e-06 |avg| = 0.000129923 to 2.63411
error max = 0.000787589 avg = 1.20745e-05 |avg| = 0.000129347 to 172629
error max = 0.000786553 avg = 1.62351e-05 |avg| = 0.000132397 to 1.13134e+10
error max = 0.000785586 avg = 8.25205e-06 |avg| = 0.00013037 to 6.98147e+12
6 insn x^5/12:
Domain from 9.86076e-32
error max = 0.0284339 avg = 0.000441158 |avg| = 0.00967327 to 6.46235e-27
error max = 0.0284342 avg = -5.79938e-06 |avg| = 0.00897913 to 4.23516e-22
error max = 0.0284341 avg = -0.000140706 |avg| = 0.00897084 to 2.77556e-17
error max = 0.028434 avg = 0.000440504 |avg| = 0.00967325 to 1.81899e-12
error max = 0.0284339 avg = -6.11153e-06 |avg| = 0.00897915 to 1.19209e-07
error max = 0.0284298 avg = -0.000140597 |avg| = 0.00897084 to 0.0078125
error max = 0.0284371 avg = 0.000439748 |avg| = 0.00967319 to 512
error max = 0.028437 avg = -7.74294e-06 |avg| = 0.00897924 to 3.35544e+07
error max = 0.0284369 avg = -0.000142036 |avg| = 0.00897089 to 2.19902e+12
error max = 0.0284368 avg = 0.000439183 |avg| = 0.0096732 to 1.44115e+17
error max = 0.0284367 avg = -7.41244e-06 |avg| = 0.00897923 to 9.44473e+21
error max = 0.0284366 avg = -0.000141706 |avg| = 0.00897088 to 6.1897e+26
error max = 0.485129 avg = -0.0401671 |avg| = 0.048422 to 4.05648e+31
error max = 0.994932 avg = -0.891494 |avg| = 0.891494 to 2.65846e+36
error max = 0.999329 avg = nan |avg| = nan to -0
14 insn x^5/12:
Domain from 2.64698e-23
error max = 0.13556 avg = 0.00125936 |avg| = 0.00354677 to 1.73472e-18
error max = 0.000564988 avg = 2.51458e-06 |avg| = 0.000113709 to 1.13687e-13
error max = 0.000565065 avg = -1.49258e-06 |avg| = 0.000112553 to 7.45058e-09
error max = 0.000565143 avg = 1.5293e-06 |avg| = 0.000112864 to 0.000488281
error max = 0.000565298 avg = 2.76457e-06 |avg| = 0.000113713 to 32
error max = 0.000565453 avg = -1.61276e-06 |avg| = 0.000112561 to 2.09715e+06
error max = 0.000565531 avg = 1.42628e-06 |avg| = 0.000112866 to 1.37439e+11
error max = 0.000565686 avg = 2.71505e-06 |avg| = 0.000113715 to 9.0072e+15
error max = 0.000565763 avg = -1.56586e-06 |avg| = 0.000112415 to 1.84467e+19
```

I suspect accuracy of the more accurate 5/12 is being limited by the `rsqrt`

operation.