Chris_F - 1 year ago 131

C++ Question

Does anyone know why GCC/Clang will not optimist function *test1* in the below code sample to simply use just the RCPPS instruction when using the fast-math option? Is there another compiler flag that would generate this code?

`typedef float float4 __attribute__((vector_size(16)));`

float4 test1(float4 v)

{

return 1.0f / v;

}

You can see the compiled output here: https://goo.gl/jXsqat

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

Because the precision of `RCPPS`

is a *lot* lower than `float`

division.

An option to enable that optimization would not be appropriate as part of `-ffast-math`

.

The x86 target options of the gcc manual says there in fact is an option that (with `-ffast-math`

) does get gcc to use them (with a Newton-Raphson iteration):

`-mrecip`

This option enables use of RCPSS and RSQRTSS instructions (and their vectorized variants RCPPS and RSQRTPS) with an additional Newton-Raphson step to increase precision instead of DIVSS and SQRTSS (and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -finite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS) already with -ffast-math (or the above option combination), and doesn't need -mrecip.

Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized sqrtf(x) already with -ffast-math (or the above option combination), and doesn't need -mrecip.

`-mrecip=opt`

This option controls which reciprocal estimate instructions may be used. opt is a comma-separated list of options, which may be preceded by a ‘!’ to invert the option:

`’all’ Enable all estimate instructions. ‘default’ Enable the default instructions, equivalent to -mrecip. ‘none’ Disable all estimate instructions, equivalent to -mno-recip. ‘div’ Enable the approximation for scalar division. ‘vec-div’ Enable the approximation for vectorized division. ‘sqrt’ Enable the approximation for scalar square root. ‘vec-sqrt’ Enable the approximation for vectorized square root.`

So, for example, -mrecip=all,!sqrt enables all of the reciprocal approximations, except for square root.

Note that Intel's new Skylake design further improves FP division performance, to 8-11c latency, 1/3c throughput. (Or one per 5c throughput for 256b vectors, but same latency for `vdivps`

). They widened the dividers, so AVX `vdivps ymm`

is now the same latency as for 128b vectors.

(SnB to Haswell did 256b div and sqrt with about twice the latency / recip-throughput, so they clearly only had 128b-wide dividers.) Skylake also pipelines both operations more, so about 4 div operations can be in flight. sqrt is faster, too.

So in several years, once Skylake is widespread, it'll only be worth doing `rcpps`

if you need to divide by the same thing multiple times. `rcpps`

and a couple `fma`

might possibly have slightly higher throughput but worse latency. Also, `vdivps`

is only a single uop; so more execution resources will be available for things to happen at the same time as the division.

It remains to be seen what the initial implementation of AVX512 will be like. Presumably `rcpps`

and a couple FMAs for Newton-Raphson iterations will be a win if FP division performance is a bottleneck. If uop throughput is a bottleneck and there's plenty of other work to do while the divisions are in flight, `vdivps zmm`

is probably still good (unless the same divisor is used repeatedly, of course).