Cory Nelson - 1 month ago 3
C Question

# Optimizations for pow() with const non-integer exponent?

I have hot spots in my code where I'm doing

`pow()`
taking up around 10-20% of my execution time.

My input to
`pow(x,y)`
is very specific, so I'm wondering if there's a way to roll two
`pow()`
approximations (one for each exponent) with higher performance:

• I have two constant exponents: 2.4 and 1/2.4.

• When the exponent is 2.4, x will be in the range (0.090473935, 1.0].

• When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].

• I'm using SSE/AVX
`float`
vectors. If platform specifics can be taken advantage of, right on!

A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for
`float`
) algorithms as well.

`pow()`
approximation, but it doesn't take these constraints into account. Is it possible to do better?

In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).

Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of `log2`. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.

To complete the `pow` computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are `cvtdq2ps` and `cvtps2dq`.

It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.

``````x^p
= exp2( p * log2( x ) )
= exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
= cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
= cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
= cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
= cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )
``````

`exp2( 127 / p - 127 )` is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.

But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.

``````template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
//      std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
//      std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
//      std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
//      std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
//      std::printf( "result = %,vg\n", ret );
return ret;
}
``````

A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.

The key is to match the overestimate with an underestimate, and take the average.

• Compute x^0.8: four instructions, error ~ +3%.
• Compute x^-0.4: one `rsqrtps`. (This is quite accurate enough, but does sacrifice the ability to work with zero.)
• Compute x^0.4: one `mulps`.
• Compute x^-0.2: one `rsqrtps`.
• Compute x^2: one `mulps`.
• Compute x^3: one `mulps`.
• x^2.4 = x^2 * x^0.4: one `mulps`. This is the overestimate.
• x^2.4 = x^3 * x^-0.4 * x^-0.2: two `mulps`. This is the underestimate.
• Average the above: one `addps`, one `mulps`.

Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.

To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.

The final error is about .015%, or 2 orders of magnitude better than the initial `fastpow` result. The runtime is about a dozen cycles for a busy loop with `volatile` source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!

``````int main() {
__m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
std::printf( "Input: %,vg\n", x0 );

// Approx 5% accuracy from one call. Always an overestimate.
__m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
std::printf( "Direct x^2.4: %,vg\n", x1 );

// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
std::printf( "1.38 x^0.8: %,vg\n", xf );

// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );

// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( x0, x0 );
__m128 x3 = _mm_mul_ps( x2, x0 );

// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );

// Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );

std::printf( "x^2 * x^0.4: %,vg\n", x2 );
std::printf( "x^3 / x^0.6: %,vg\n", x3 );
x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
// Final accuracy about 0.015%, 200x better than x^0.8 calculation.
std::printf( "average = %,vg\n", x2 );
}
``````

Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .

## Update with stats

I implemented a little test harness and two x(512) cases corresponding to the above.

``````#include <cstdio>
#include <xmmintrin.h>
#include <cmath>
#include <cfloat>
#include <algorithm>
using namespace std;

template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
//  std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
//  std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
//  std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
//  std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
//  std::printf( "result = %,vg\n", ret );
return ret;
}

__m128 pow125_4( __m128 arg ) {
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );

// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );

// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 x3 = _mm_mul_ps( x2, arg );

// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );

// Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );

return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
}

__m128 pow512_2( __m128 arg ) {
// 5/12 is too small, so compute the sqrt of 10/12 instead.
__m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
}

__m128 pow512_4( __m128 arg ) {
// 5/12 is too small, so compute the 4th root of 20/12 instead.
// 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
// weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
__m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
__m128 xover = _mm_mul_ps( arg, xf );

__m128 xfm1 = _mm_rsqrt_ps( xf );
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 xunder = _mm_mul_ps( x2, xfm1 );

// sqrt2 * over + 2 * sqrt2 * under
__m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),

xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
return xavg;
}

__m128 mm_succ_ps( __m128 arg ) {
return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
}

void test_pow( double p, __m128 (*f)( __m128 ) ) {
__m128 arg;

for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
! isfinite( _mm_cvtss_f32( f( arg ) ) );
arg = mm_succ_ps( arg ) ) ;

for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
arg = mm_succ_ps( arg ) ) ;

std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );

int n;
int const bucket_size = 1 << 25;
do {
float max_error = 0;
double total_error = 0, cum_error = 0;
for ( n = 0; n != bucket_size; ++ n ) {
float result = _mm_cvtss_f32( f( arg ) );

if ( ! isfinite( result ) ) break;

float actual = ::powf( _mm_cvtss_f32( arg ), p );

float error = ( result - actual ) / actual;
cum_error += error;
error = std::abs( error );
max_error = std::max( max_error, error );
total_error += error;

arg = mm_succ_ps( arg );
}

std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
} while ( n == bucket_size );
}

int main() {
std::printf( "4 insn x^12/5:\n" );
test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
std::printf( "14 insn x^12/5:\n" );
test_pow( 12./5, & pow125_4 );
std::printf( "6 insn x^5/12:\n" );
test_pow( 5./12, & pow512_2 );
std::printf( "14 insn x^5/12:\n" );
test_pow( 5./12, & pow512_4 );
}
``````

Output:

``````4 insn x^12/5:
Domain from 1.36909e-23
error max =      inf    avg =      inf  |avg| =      inf    to 8.97249e-19
error max =  2267.14    avg =  139.175  |avg| =  139.193    to 5.88021e-14
error max = 0.123606    avg = -0.000102963  |avg| = 0.0371122   to 3.85365e-09
error max = 0.123607    avg = -0.000108978  |avg| = 0.0368548   to 0.000252553
error max =  0.12361    avg = 7.28909e-05   |avg| = 0.037507    to  16.5513
error max = 0.123612    avg = -0.000258619  |avg| = 0.0365618   to 1.08471e+06
error max = 0.123611    avg = 8.70966e-05   |avg| = 0.0374369   to 7.10874e+10
error max =  0.12361    avg = -0.000103047  |avg| = 0.0371122   to 4.65878e+15
error max = 0.123609    avg =      nan  |avg| =      nan    to 1.16469e+16
14 insn x^12/5:
Domain from 1.42795e-19
error max =      inf    avg =      nan  |avg| =      nan    to 9.35823e-15
error max = 0.000936462 avg = 2.0202e-05    |avg| = 0.000133764 to 6.13301e-10
error max = 0.000792752 avg = 1.45717e-05   |avg| = 0.000129936 to 4.01933e-05
error max = 0.000791785 avg = 7.0132e-06    |avg| = 0.000129923 to  2.63411
error max = 0.000787589 avg = 1.20745e-05   |avg| = 0.000129347 to   172629
error max = 0.000786553 avg = 1.62351e-05   |avg| = 0.000132397 to 1.13134e+10
error max = 0.000785586 avg = 8.25205e-06   |avg| = 0.00013037  to 6.98147e+12
6 insn x^5/12:
Domain from 9.86076e-32
error max = 0.0284339   avg = 0.000441158   |avg| = 0.00967327  to 6.46235e-27
error max = 0.0284342   avg = -5.79938e-06  |avg| = 0.00897913  to 4.23516e-22
error max = 0.0284341   avg = -0.000140706  |avg| = 0.00897084  to 2.77556e-17
error max = 0.028434    avg = 0.000440504   |avg| = 0.00967325  to 1.81899e-12
error max = 0.0284339   avg = -6.11153e-06  |avg| = 0.00897915  to 1.19209e-07
error max = 0.0284298   avg = -0.000140597  |avg| = 0.00897084  to 0.0078125
error max = 0.0284371   avg = 0.000439748   |avg| = 0.00967319  to      512
error max = 0.028437    avg = -7.74294e-06  |avg| = 0.00897924  to 3.35544e+07
error max = 0.0284369   avg = -0.000142036  |avg| = 0.00897089  to 2.19902e+12
error max = 0.0284368   avg = 0.000439183   |avg| = 0.0096732   to 1.44115e+17
error max = 0.0284367   avg = -7.41244e-06  |avg| = 0.00897923  to 9.44473e+21
error max = 0.0284366   avg = -0.000141706  |avg| = 0.00897088  to 6.1897e+26
error max = 0.485129    avg = -0.0401671    |avg| = 0.048422    to 4.05648e+31
error max = 0.994932    avg = -0.891494 |avg| = 0.891494    to 2.65846e+36
error max = 0.999329    avg =      nan  |avg| =      nan    to       -0
14 insn x^5/12:
Domain from 2.64698e-23
error max =  0.13556    avg = 0.00125936    |avg| = 0.00354677  to 1.73472e-18
error max = 0.000564988 avg = 2.51458e-06   |avg| = 0.000113709 to 1.13687e-13
error max = 0.000565065 avg = -1.49258e-06  |avg| = 0.000112553 to 7.45058e-09
error max = 0.000565143 avg = 1.5293e-06    |avg| = 0.000112864 to 0.000488281
error max = 0.000565298 avg = 2.76457e-06   |avg| = 0.000113713 to       32
error max = 0.000565453 avg = -1.61276e-06  |avg| = 0.000112561 to 2.09715e+06
error max = 0.000565531 avg = 1.42628e-06   |avg| = 0.000112866 to 1.37439e+11
error max = 0.000565686 avg = 2.71505e-06   |avg| = 0.000113715 to 9.0072e+15
error max = 0.000565763 avg = -1.56586e-06  |avg| = 0.000112415 to 1.84467e+19
``````

I suspect accuracy of the more accurate 5/12 is being limited by the `rsqrt` operation.