matovitch matovitch - 13 days ago 4x
C Question

C intrinsics, SSE2 dot product and gcc -O3 generated assembly

I need to write a dot product using SSE2 (no _mm_dp_ps nor _mm_hadd_ps) :

#include <xmmintrin.h>

inline __m128 sse_dot4(__m128 a, __m128 b)
const __m128 mult = _mm_mul_ps(a, b);
const __m128 shuf1 = _mm_shuffle_ps(mult, mult, _MM_SHUFFLE(0, 3, 2, 1));
const __m128 shuf2 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(1, 0, 3, 2));
const __m128 shuf3 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(2, 1, 0, 3));

return _mm_add_ss(_mm_add_ss(_mm_add_ss(mult, shuf1), shuf2), shuf3);

but I looked at the generated assembler with gcc 4.9 (experimental) -O3, and I get :

mulps %xmm1, %xmm0
movaps %xmm0, %xmm3 //These lines
movaps %xmm0, %xmm2 //have no use
movaps %xmm0, %xmm1 //isn't it ?
shufps $57, %xmm0, %xmm3
shufps $78, %xmm0, %xmm2
shufps $147, %xmm0, %xmm1
addss %xmm3, %xmm0
addss %xmm2, %xmm0
addss %xmm1, %xmm0

I am wondering why gcc copy xmm0 in xmm1, 2 and 3...
Here is the code I get using the flag : -march=native (looks better)

vmulps %xmm1, %xmm0, %xmm1
vshufps $78, %xmm1, %xmm1, %xmm2
vshufps $57, %xmm1, %xmm1, %xmm3
vshufps $147, %xmm1, %xmm1, %xmm0
vaddss %xmm3, %xmm1, %xmm1
vaddss %xmm2, %xmm1, %xmm1
vaddss %xmm0, %xmm1, %xmm0


(In fact, and despite all the up-votes, the answers that were given at the time this question was posted did not fulfill the expectations I had. Here is the answer I was waiting for.)

The SSE instruction

shufps $IMM, xmmA, xmmB

does not work as

xmmB = f($IMM, xmmA) 
//set xmmB with xmmA's words shuffled according to $IMM

but as

xmmB = f($IMM, xmmA, xmmB) 
//set xmmB with 2 words of xmmA and 2 words of xmmB according to $IMM

this is why the copy of the mulps result from xmm0 to xmm1..3 is needed.