Deadloop - 6 months ago 36
C++ Question

# Euclidean distance using intrinsic instruction

For a research project, I need to compute a lot of Euclidean distances, where certain dimensions must be selected and others discarded. In the current state of the program, the array of selected dimensions has 100-ish elements, and I compute around 2-3 million distances. My current piece of code is as follow :

``````float compute_distance(const float* p1, const float* p2) const
{
__m256 euclidean = _mm256_setzero_ps();

const uint16_t n = nbr_dimensions;
const uint16_t aligend_n = n - n % 16;
const float* local_selected = selected_dimensions;

for (uint16_t i = 0; i < aligend_n; i += 16)
{
}
float distance = hsum256_ps_avx(euclidean);

for (uint16_t i = aligend_n; i < n; ++i)
{
const float num = p1[i] - p2[i];
distance += num * num * local_selected[i];
}

return distance;
}
``````

The selected dimensions are pre-determined. I could thus pre-compute an array of
`__m256`
to pass to a
`_mm256_blendv_ps`
instead of multiplying by 0 or 1 in the line
`euclidean = _mm256_fmadd_ps(_mm256_mul_ps(r1, r1), _mm256_load_ps(&local_selected[i]), euclidean);`
. But I'm rather a novice in intrinsic instructions, and I have not yet found a working solution.

I was wondering if you guys could have some advice, or even code suggestions, to improve the running speed of this function. As a side note, I do not have access to AVX-512 instructions.

Update:
Using the first above-mentioned solution, it comes:

``````float compute_distance(const float* p1, const float* p2) const
{
const size_t n = nbr_dimensions;
const size_t aligend_n = n - n % 16;
const unsigned int* local_selected = selected_dimensions;

__m256 euc1 = _mm256_setzero_ps(), euc2 = _mm256_setzero_ps(),
euc3 = _mm256_setzero_ps(), euc4 = _mm256_setzero_ps();

const size_t n_max = aligend_n/8;
for (size_t i = 0; i < n_max; i += 4)
{
const __m256 r1 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 0]), _mm256_load_ps(&p2[i * 8 + 0]));
const __m256 r1_1 = _mm256_and_ps(r1, local_masks[i + 0]);

const __m256 r2 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 8]), _mm256_load_ps(&p2[i * 8 + 8]));
const __m256 r2_1 = _mm256_and_ps(r2, local_masks[i + 1]);

const __m256 r3 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 16]), _mm256_load_ps(&p2[i * 8 + 16]));
const __m256 r3_1 = _mm256_and_ps(r3, local_masks[i + 2]);

const __m256 r4 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 24]), _mm256_load_ps(&p2[i * 8 + 24]));
const __m256 r4_1 = _mm256_and_ps(r4, local_masks[i + 3]);
}

for (size_t i = aligend_n; i < n; ++i)
{
const float num = p1[i] - p2[i];
distance += num * num * local_selected[i];
}

return distance;
}
``````

Don't use `uint16_t` for your loop counter unless you really want to force the compiler to truncate to 16 bits every time. Use at least `unsigned`, or you sometimes get better asm from using `uintptr_t` (or more conventionally, `size_t`). Zero-extension from 32-bit to pointer width happens for free on x86-64 just from using 32-bit operand-size asm instructions, but sometimes compilers still don't do great.

Use five or more separate accumulators instead of one `euclidean`, so multiple sub / FMA instructions can be in flight without bottlenecking on the latency of a loop-carried dependency chain that does FMAs into one accumulator.

FMA has a latency of 5 cycles but a throughput of one per 0.5 cycles on Intel Haswell. See also latency vs throughput in intel intrinsics, and also my answer on Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for a more-advanced version.

Avoid passing args through global variables. Apparently your `n` is a compile-time constant (which is good), but `selected_dimensions` isn't, is it? If it is, then you're only using one set of masks in your whole program, so nevermind the stuff below about compressing masks.

Using global variables can defeat compiler optimization when it inlines your function into a caller which sets a global before calling it. (Usually only if there's a non-inline function call between setting the global and using it, but that's not uncommon.)

update: your arrays are small, only ~100 elements, so unrolling by only 2 is maybe good, to reduce startup / cleanup overhead. Out-of-order execution can hide the FMA latency over this short number of iterations, especially if the final result of this function call isn't needed to decide the input parameters to the next call.

Total function-call overhead is important, not just how efficient the vectorization is for large arrays.

As discussed in comments, peeling the first iteration of the loop lets you avoid the first FMA by initializing `euc1 = stuff(p1[0], p2[0]);` instead of `_mm256_setzero_ps()`.

Padding your arrays out to a full vector (or even a full unrolled loop-body of 2 vectors) with zeros lets you avoid the scalar cleanup loop entirely, and make the whole function very compact.

If you couldn't just pad, you could still avoid a scalar cleanup by loading an unaligned vector that goes right up to the end of the inputs, and mask that to avoid double-counting. (See this Q&A for a way to generate a mask based on a misalignment count). In other kinds of problems where you're writing an output array, it's fine to redo overlapping elements.

You don't show your `hsum256_ps_avx` code, but that's a decent fraction of the total latency and maybe throughput of your function. Make sure you optimize it for throughput: e.g. avoid `haddps` / `_mm_hadd_ps`. See my answer on Fastest way to do horizontal float vector sum on x86.

I could thus pre-compute an array of __m256 to pass to a `_mm256_blendv_ps` instead of multiplying by 0 or 1 in the FMA.

Yes, that would be better, especially if it lets you fold something else into an FMAdd / FMSub. But even better than that, use a boolean `_mm256_and_ps` with all-zero or all-ones. That leaves a value unchanged (`1 & x == x`) or zeroed (`0 & x == 0`, and the binary representation of float `0.0` is all-zeros.)

If your masks aren't missing in cache, then store them fully unpacked so they can just be loaded.

If you're using different masks with the same `p1` and `p2`, you could pre-compute `p1-p2` squared, and then just do a masked `add_ps` reduction over that. (But note that FMA has better throughput than ADD on Intel pre-Skylake. Haswell/Broadwell have 2 FMA units, but run ADDPS on a dedicated unit with lower latency (3c vs. 5c). There's only one vector-FP add unit. Skylake just runs everything on the FMA units with 4 cycle latency.) Anyway, this means it can actually be a throughput win to use FMA as `1.0 * x + y`. But you're probably fine, because you still need to load the mask and the `square(p1-p2)` separately, so that's 2 loads per FP add, so one per cycle keeps up with load throughput. Unless you (or the compiler) peel a few iterations at the front and keep the float data for those iterations in registers across multiple different `local_selected` masks.

update: I wrote this assuming that the array size was 2-3 million, rather than ~100. Profile for L1D cache misses to decide if spending more CPU instructions to reduce cache footprint is worth it. If you always use the same mask for all 3 million calls, it's probably not worth it to compress.

You could compact your masks down to 8 bits per element and load them with `pmovsx` (`_mm256_cvtepi8_epi32`) (sign-extending an all-ones value produces a wider all-ones, because that's how 2's complement `-1` works). Unfortunately using it as a load is annoying; compilers sometimes fail to optimize `_mm256_cvtepi8_epi32(_mm_cvtsi64x_si128(foo))` into `vpmovsxbd ymm0, [mem]`, and instead actually use a separate `vmovq` instruction.

``````const uint64_t *local_selected = something;  // packed to 1B per element

__m256 euc1 = _mm256_setzero_ps(), euc2 = _mm256_setzero_ps(),
euc3 =  _mm256_setzero_ps(), euc4 =  _mm256_setzero_ps();

for (i = 0 ; i < n ; i += 8*4) {  // 8 floats * an unroll of 4

__m256 mask = _mm256_castsi256_ps( _mm256_cvtepi8_epi32(_mm_cvtsi64x_si128(local_selected[i*1 + 0])) );

r1 = _mm256_and_ps(r1, mask);             // zero r1 or leave it untouched.
euc1 = _mm256_fmadd_ps(r1, r1, euc1);    // euc1 += r1*r1
// ... same for r2 with local_selected[i + 1]
// and p1/p2[i*8 + 8]
// euc2 += (r2*r2) & mask2

// and again for euc3 (local_selected[i + 2], p1/p2[i*8 + 16]
// and again for euc3 (local_selected[i + 3], p1/p2[i*8 + 24]
}
euclidean = hsum (euc1+euc2+euc3+euc4);
``````

I guess you bottleneck slightly on load throughput without the `pmovsx`, since you have three loads for three vector ALU operations. (And with micro-fusion, it's only 4 total fused-domain uops on an Intel CPU, so it's not bottlenecked on the front-end). And the three ALU uops can run on different ports (`vandps` is 1 uop for port 5 on Intel pre-Skylake. On SKL it can run on any port).

Adding in a shuffle (the `pmovsx`) potentially bottlenecks on port5 (on Haswell/Broadwell). You might want to do use `vpand` for the masking so it can run on any port, if you're tuning for HSW/BDW, even if they have extra bypass latency between an integer AND and FP math instructions. With enough accumulators, you're not latency-bound. (Skylake has extra bypass latency for `VANDPS` depending on which port it happens to run on).

blendv is slower than AND: always at least 2 uops.

### Compressing the mask even more for large arrays

If your arrays are larger than L2 cache, and your mask array has as many elements as your float arrays, you will most likely bottleneck on load bandwidth (at least once you unroll with multiple vector accumulators). This means that spending more instructions unpacking the mask data is worth it to reduce that part of the bandwidth requirement.

I think the ideal format for your mask data is with 32 vectors of masks interleaved, which makes it very cheap to "unpack" on the fly. Use a shift to bring the right mask into the high bit of each 32-bit element, and use it with `vblendvps` to conditionally zero elements by blending with zero. (Or with an arithmetic right shift + boolean AND)

``````__m256i masks = _mm256_load_si256(...);

// this actually needs a cast to __m256, omitted for readability
...

...

...

// fully unrolling is overkill; you can set up for a loop back to r0 with
``````

You could also do `masks = _mm256_slli_epi32(masks, 1);` at every step, which might be better because it uses 1 register less. But it might be more sensitive to resource conflicts causing latency on the mask dep chain, since every mask depends on the previous one.

Intel Haswell runs both `vblendvps` uops on port5 only, so you could consider using `_mm256_srai_epi32` + `_mm256_and_ps`. But Skylake can run the 2 uops on any of p015, so blend is good there (although it does tie up a vector register holding the all-zero vector).

Generate masks in that interleaved format with a packed-compare, then `_mm256_srli_epi32(cmp_result, 31)` and OR that into the vector you're building up. Then left-shift it by 1. Repeat 32 times.

You can still use this format if you have fewer than 32 whole vectors of data in your arrays. The lower bits will just go unused. Or you could have the masks for 2 or more `selected_dimensions` per vector. e.g. the top 16 bits of each element are for one `selected_dimensions`, and the bottom 16 bits are for another. You could do something like

``````__m256i masks =  _mm256_load_si256(dimensions[selector/2]);

// or maybe
if (selector % 2) {
}
``````

## AVX512:

AVX512 can use a bitmap mask directly, so it's somewhat more efficient. Just use `const __mmask16 *local_selected = whatever;` to declare an array of 16-bit masks (for use with 512b vectors of 16 floats), and use `r0 = _mm512_maskz_sub_ps(p1,p2, local_selected[i]);` to zero-mask the subtraction.

If you actually bottleneck on load-port uop throughput (2 loads per clock), you might try loading 64 bits of mask data at once, and using a mask-shift to get a different low-16 of it. This probably won't be a problem unless your data is hot in L1D cache, though.

It's very easy to generate the mask data in the first place with a compare-into-mask, with no interleaving required.

Ideally you could cache-block the code that calls this so you could reuse data while it was hot in cache. e.g. get all the combinations you want from the first 64kiB of p1 and p2, then move on to later elements and do them while they're hot in cache.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download