sacheie - 8 months ago 62

C Question

Matt Scarpino gives a good explanation (although he admits he's not sure it's the optimal algorithm, I offer him my gratitude) for how to multiply two complex doubles with Intel's AVX intrinsics. Here's his method, which I've verified:

`__m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0);`

__m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0);

__m256d neg = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0);

/* Step 1: Multiply vec1 and vec2 */

__m256d vec3 = _mm256_mul_pd(vec1, vec2);

/* Step 2: Switch the real and imaginary elements of vec2 */

vec2 = _mm256_permute_pd(vec2, 0x5);

/* Step 3: Negate the imaginary elements of vec2 */

vec2 = _mm256_mul_pd(vec2, neg);

/* Step 4: Multiply vec1 and the modified vec2 */

__m256d vec4 = _mm256_mul_pd(vec1, vec2);

/* Horizontally subtract the elements in vec3 and vec4 */

vec1 = _mm256_hsub_pd(vec3, vec4);

/* Display the elements of the result vector */

double* res = (double*)&vec1;

printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);

My problem is that I want to square two complex doubles. I tried to use Matt's technique like so:

`struct cmplx a;`

struct cmplx b;

a.r = 2.5341;

a.i = 1.843;

b.r = 1.3941;

b.i = 0.93;

__m256d zzs = squareZ(a, b);

double* res = (double*) &zzs;

printf("\nA: %f + %f, B: %f + %f\n", res[0], res[1], res[2], res[3]);

Using Haskell's complex arithmetic, I have verified the results are correct

`A: 3.025014 + 9.340693, B: 0.000000 + 2.593026`

So I have two questions really: is there a better (simpler and/or faster) way to square two complex doubles with AVX intrinsics? If not, how can I modify Matt's code to do it?

Answer

Ideally, store your data in separate real and imaginary arrays, so you can just load contiguous vectors of real and imaginary parts. That makes it free to do the cross-multiplying (just use different registers / variables) instead of having to shuffle things around within a vector.

You can convert between interleaved `double complex`

style and SIMD-friendly separate-vectors style on the fly fairly cheaply, subject to the vagaries of AVX in-lane shuffles. e.g. very cheaply with unpacklo / unpackhi shuffles to de-interleave or to re-interleave within a lane, if you don't care about the actual order of the data within the temporary vector.

It's actually so cheap to do this shuffle that doing it on the fly for a single complex multiply comes out somewhat ahead of (even a tweaked version of) Matt's code, especially on CPUs that support FMA.

gcc auto-vectorizes simple complex multiply scalar loop to pretty good code, as long as you use `-ffast-math`

. It deinterleaves like I suggested.

```
#include <complex.h>
// even with -ffast-math -ffp-contract=fast, clang doesn't manage to use vfmaddsubpd, instead using vmulpd and vaddsubpd :(
// gcc does use FMA though.
// auto-vectorizes with a lot of extra shuffles
void cmul(double complex *restrict dst,
const double complex *restrict A, const double complex *restrict B)
{ // clang and gcc change strategy slightly for i<1 or i<2, vs. i<4
for (int i=0; i<4 ; i++) {
dst[i] = A[i] * B[i];
}
}
```

See the asm on the Godbolt compiler explorer. I'm not sure how good clang's asm is; it uses a lot of 64b->128b VMODDDUP broadcast-loads. This form is handled purely in the load ports on Intel CPUs (see Agner Fog's insn tables), but it's still a lot of operations. As mentioned earlier, gcc uses 4 VPERMPD shuffles to reorder within lanes before multiplying / FMA, then another 4 VPERMPD to reorder the results before combining them with VSHUFPD. This is 8 extra shuffles for 4 complex multiplies.

**Converting gcc's version back to intrinsics and removing the redundant shuffles gives optimal code**. (gcc apparently wants its temporaries to be in A B C D order instead of the A C B D order resulting from the in-lane behaviour of VUNPCKLPD (`_mm256_unpacklo_pd`

)).

I put the code on Godbolt, along with a tweaked version of Matt's code. So you can play around with different compiler options, and also different compiler versions.

```
// multiples 4 complex doubles each from A and B, storing the result in dst[0..3]
void cmul_manualvec(double complex *restrict dst,
const double complex *restrict A, const double complex *restrict B)
{
// low element first, little-endian style
__m256d A0 = _mm256_loadu_pd((double*)A); // [A0r A0i A1r A1i ] // [a b c d ]
__m256d A2 = _mm256_loadu_pd((double*)(A+2)); // [e f g h ]
__m256d realA = _mm256_unpacklo_pd(A0, A2); // [A0r A2r A1r A3r ] // [a e c g ]
__m256d imagA = _mm256_unpackhi_pd(A0, A2); // [A0i A2i A1i A3i ] // [b f d h ]
// the in-lane behaviour of this interleaving is matched by the same in-lane behaviour when we recombine.
__m256d B0 = _mm256_loadu_pd((double*)B); // [m n o p]
__m256d B2 = _mm256_loadu_pd((double*)(B+2)); // [q r s t]
__m256d realB = _mm256_unpacklo_pd(B0, B2); // [m q o s]
__m256d imagB = _mm256_unpackhi_pd(B0, B2); // [n r p t]
// desired: real=rArB - iAiB, imag=rAiB + rBiA
__m256d realprod = _mm256_mul_pd(realA, realB);
__m256d imagprod = _mm256_mul_pd(imagA, imagB);
__m256d rAiB = _mm256_mul_pd(realA, imagB);
__m256d rBiA = _mm256_mul_pd(realB, imagA);
// gcc and clang will contract these nto FMA. (clang needs -ffp-contract=fast)
// Doing it manually would remove the option to compile for non-FMA targets
__m256d real = _mm256_sub_pd(realprod, imagprod); // [D0r D2r | D1r D3r]
__m256d imag = _mm256_add_pd(rAiB, rBiA); // [D0i D2i | D1i D3i]
// interleave the separate real and imaginary vectors back into packed format
__m256d dst0 = _mm256_shuffle_pd(real, imag, 0b0000); // [D0r D0i | D1r D1i]
__m256d dst2 = _mm256_shuffle_pd(real, imag, 0b1111); // [D2r D2i | D3r D3i]
_mm256_storeu_pd((double*)dst, dst0);
_mm256_storeu_pd((double*)(dst+2), dst2);
}
Godbolt asm output: gcc6.2 -O3 -ffast-math -ffp-contract=fast -march=haswell
vmovupd ymm0, YMMWORD PTR [rsi+32]
vmovupd ymm3, YMMWORD PTR [rsi]
vmovupd ymm1, YMMWORD PTR [rdx]
vunpcklpd ymm5, ymm3, ymm0
vunpckhpd ymm3, ymm3, ymm0
vmovupd ymm0, YMMWORD PTR [rdx+32]
vunpcklpd ymm4, ymm1, ymm0
vunpckhpd ymm1, ymm1, ymm0
vmulpd ymm2, ymm1, ymm3
vmulpd ymm0, ymm4, ymm3
vfmsub231pd ymm2, ymm4, ymm5 # separate mul/sub contracted into FMA
vfmadd231pd ymm0, ymm1, ymm5
vunpcklpd ymm1, ymm2, ymm0
vunpckhpd ymm0, ymm2, ymm0
vmovupd YMMWORD PTR [rdi], ymm1
vmovupd YMMWORD PTR [rdi+32], ymm0
vzeroupper
ret
```

For 4 complex multiplies (of 2 pairs of input vectors), my code uses:

- 4 loads (32B each)
- 2 stores (32B each)
- 6 in-lane shuffles (one for each input vector, one for each output)
- 2 VMULPD
- 2 VFMA...something
- (only 4 shuffles if we can use the results in separated real and imag vectors, or 0 shuffles if the inputs are already in this format, too)
- latency on Intel Skylake (not counting loads/stores): 14 cycles = 4c for 4 shuffles until the second VMULPD can start + 4 cycles (second VMULPD) + 4c (second vfmadd231pd) + 1c (shuffle first result vector ready 1c earlier) + 1c (shuffle second result vector)

Matt Scarpino's code with my tweaks (repeated to do 4 complex multiplies) uses:

- the same 6 loads/stores
- 6 in-lane shuffles (HSUBPD is 2 shuffles and a subtract on current Intel and AMD CPUs)
- 4 multiplies
- 2 subtracts (which can't combine with the muls into FMAs)
- An extra instruction (+ a constant) to flip the sign of the imaginary elements. Matt chose to multiply by 1.0 or -1.0, but the efficient choice is to XOR the sign bit (i.e. XORPD with
`-0.0`

). - latency on Intel Skylake for the first result vector: 11 cycles. 1c(vpermilpd and vxorpd in the same cycle) + 4c(second vmulpd) + 6c(vhsubpd). The first vmulpd overlaps with other ops, starting in the same cycle as the shuffle and vxorpd. Computation of a second result vector should interleave pretty nicely.

The major advantage of Matt's code is that it works with just one vector-width of complex multiplies at once, instead of requiring you to have 4 input vectors of data. It has somewhat lower latency. But note that my version doesn't need the 2 pairs of input vectors to be from contiguous memory, or related to each other at all. They get mixed together while processing, but the result is 2 separate 32B vectors.

My tweaked version of Matt's code is nearly as good (as the 4-at-a-time version) on CPUs without FMA (just costing an extra VXORPD), but significantly worse when it stops us from taking advantage of FMA. Also, it never has the results available in non-packed form, so you can't use the separated form as input to another multiply and skip the shuffling.