Deadloop Deadloop - 1 month ago 11
C++ Question

Euclidean distance using intrinsic instruction

For a research project, I need to compute a lot of Euclidean distances, where certain dimensions must be selected and others discarded. In the current state of the program, the array of selected dimensions has 100-ish elements, and I compute around 2-3 million distances. My current piece of code is as follow :

float compute_distance(const float* p1, const float* p2) const
{
__m256 euclidean = _mm256_setzero_ps();

const uint16_t n = nbr_dimensions;
const uint16_t aligend_n = n - n % 16;
const float* local_selected = selected_dimensions;

for (uint16_t i = 0; i < aligend_n; i += 16)
{
const __m256 r1 = _mm256_sub_ps(_mm256_load_ps(&p1[i]), _mm256_load_ps(&p2[i]));
euclidean = _mm256_fmadd_ps(_mm256_mul_ps(r1, r1), _mm256_load_ps(&local_selected[i]), euclidean);
const __m256 r2 = _mm256_sub_ps(_mm256_load_ps(&p1[i + 8]), _mm256_load_ps(&p2[i + 8]));
euclidean = _mm256_fmadd_ps(_mm256_mul_ps(r2, r2), _mm256_load_ps(&local_selected[i + 8]), euclidean);
}
float distance = hsum256_ps_avx(euclidean);

for (uint16_t i = aligend_n; i < n; ++i)
{
const float num = p1[i] - p2[i];
distance += num * num * local_selected[i];
}

return distance;
}


The selected dimensions are pre-determined. I could thus pre-compute an array of
__m256
to pass to a
_mm256_blendv_ps
instead of multiplying by 0 or 1 in the line
euclidean = _mm256_fmadd_ps(_mm256_mul_ps(r1, r1), _mm256_load_ps(&local_selected[i]), euclidean);
. But I'm rather a novice in intrinsic instructions, and I have not yet found a working solution.

I was wondering if you guys could have some advice, or even code suggestions, to improve the running speed of this function. As a side note, I do not have access to AVX-512 instructions.




Update:
Using the first above-mentioned solution, it comes:

float compute_distance(const float* p1, const float* p2) const
{
const size_t n = nbr_dimensions;
const size_t aligend_n = n - n % 16;
const unsigned int* local_selected = selected_dimensions;
const __m256* local_masks = masks;

__m256 euc1 = _mm256_setzero_ps(), euc2 = _mm256_setzero_ps(),
euc3 = _mm256_setzero_ps(), euc4 = _mm256_setzero_ps();

const size_t n_max = aligend_n/8;
for (size_t i = 0; i < n_max; i += 4)
{
const __m256 r1 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 0]), _mm256_load_ps(&p2[i * 8 + 0]));
const __m256 r1_1 = _mm256_and_ps(r1, local_masks[i + 0]);
euc1 = _mm256_fmadd_ps(r1_1, r1_1, euc1);

const __m256 r2 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 8]), _mm256_load_ps(&p2[i * 8 + 8]));
const __m256 r2_1 = _mm256_and_ps(r2, local_masks[i + 1]);
euc2 = _mm256_fmadd_ps(r2_1, r2_1, euc2);

const __m256 r3 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 16]), _mm256_load_ps(&p2[i * 8 + 16]));
const __m256 r3_1 = _mm256_and_ps(r3, local_masks[i + 2]);
euc3 = _mm256_fmadd_ps(r3_1, r3_1, euc3);

const __m256 r4 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 24]), _mm256_load_ps(&p2[i * 8 + 24]));
const __m256 r4_1 = _mm256_and_ps(r4, local_masks[i + 3]);
euc4 = _mm256_fmadd_ps(r4_1, r4_1, euc4);
}

float distance = hsum256_ps_avx(_mm256_add_ps(_mm256_add_ps(euc1, euc2), _mm256_add_ps(euc3, euc4)));

for (size_t i = aligend_n; i < n; ++i)
{
const float num = p1[i] - p2[i];
distance += num * num * local_selected[i];
}

return distance;
}

Answer Source

Basic advice:

Don't use uint16_t for your loop counter unless you really want to force the compiler to truncate to 16 bits every time. Use at least unsigned, or you sometimes get better asm from using uintptr_t (or more conventionally, size_t). Zero-extension from 32-bit to pointer width happens for free on x86-64 just from using 32-bit operand-size asm instructions, but sometimes compilers still don't do great.

Use five or more separate accumulators instead of one euclidean, so multiple sub / FMA instructions can be in flight without bottlenecking on the latency of a loop-carried dependency chain that does FMAs into one accumulator.

FMA has a latency of 5 cycles but a throughput of one per 0.5 cycles on Intel Haswell. See also latency vs throughput in intel intrinsics, and also my answer on Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for a more-advanced version.


Avoid passing args through global variables. Apparently your n is a compile-time constant (which is good), but selected_dimensions isn't, is it? If it is, then you're only using one set of masks in your whole program, so nevermind the stuff below about compressing masks.

Using global variables can defeat compiler optimization when it inlines your function into a caller which sets a global before calling it. (Usually only if there's a non-inline function call between setting the global and using it, but that's not uncommon.)


update: your arrays are small, only ~100 elements, so unrolling by only 2 is maybe good, to reduce startup / cleanup overhead. Out-of-order execution can hide the FMA latency over this short number of iterations, especially if the final result of this function call isn't needed to decide the input parameters to the next call.

Total function-call overhead is important, not just how efficient the vectorization is for large arrays.

As discussed in comments, peeling the first iteration of the loop lets you avoid the first FMA by initializing euc1 = stuff(p1[0], p2[0]); instead of _mm256_setzero_ps().

Padding your arrays out to a full vector (or even a full unrolled loop-body of 2 vectors) with zeros lets you avoid the scalar cleanup loop entirely, and make the whole function very compact.

If you couldn't just pad, you could still avoid a scalar cleanup by loading an unaligned vector that goes right up to the end of the inputs, and mask that to avoid double-counting. (See this Q&A for a way to generate a mask based on a misalignment count). In other kinds of problems where you're writing an output array, it's fine to redo overlapping elements.

You don't show your hsum256_ps_avx code, but that's a decent fraction of the total latency and maybe throughput of your function. Make sure you optimize it for throughput: e.g. avoid haddps / _mm_hadd_ps. See my answer on Fastest way to do horizontal float vector sum on x86.


Your specific case:

I could thus pre-compute an array of __m256 to pass to a _mm256_blendv_ps instead of multiplying by 0 or 1 in the FMA.

Yes, that would be better, especially if it lets you fold something else into an FMAdd / FMSub. But even better than that, use a boolean _mm256_and_ps with all-zero or all-ones. That leaves a value unchanged (1 & x == x) or zeroed (0 & x == 0, and the binary representation of float 0.0 is all-zeros.)

If your masks aren't missing in cache, then store them fully unpacked so they can just be loaded.

If you're using different masks with the same p1 and p2, you could pre-compute p1-p2 squared, and then just do a masked add_ps reduction over that. (But note that FMA has better throughput than ADD on Intel pre-Skylake. Haswell/Broadwell have 2 FMA units, but run ADDPS on a dedicated unit with lower latency (3c vs. 5c). There's only one vector-FP add unit. Skylake just runs everything on the FMA units with 4 cycle latency.) Anyway, this means it can actually be a throughput win to use FMA as 1.0 * x + y. But you're probably fine, because you still need to load the mask and the square(p1-p2) separately, so that's 2 loads per FP add, so one per cycle keeps up with load throughput. Unless you (or the compiler) peel a few iterations at the front and keep the float data for those iterations in registers across multiple different local_selected masks.


update: I wrote this assuming that the array size was 2-3 million, rather than ~100. Profile for L1D cache misses to decide if spending more CPU instructions to reduce cache footprint is worth it. If you always use the same mask for all 3 million calls, it's probably not worth it to compress.

You could compact your masks down to 8 bits per element and load them with pmovsx (_mm256_cvtepi8_epi32) (sign-extending an all-ones value produces a wider all-ones, because that's how 2's complement -1 works). Unfortunately using it as a load is annoying; compilers sometimes fail to optimize _mm256_cvtepi8_epi32(_mm_cvtsi64x_si128(foo)) into vpmovsxbd ymm0, [mem], and instead actually use a separate vmovq instruction.

const uint64_t *local_selected = something;  // packed to 1B per element

__m256 euc1 = _mm256_setzero_ps(), euc2 = _mm256_setzero_ps(),
euc3 =  _mm256_setzero_ps(), euc4 =  _mm256_setzero_ps();

for (i = 0 ; i < n ; i += 8*4) {  // 8 floats * an unroll of 4

    __m256 mask = _mm256_castsi256_ps( _mm256_cvtepi8_epi32(_mm_cvtsi64x_si128(local_selected[i*1 + 0])) );
    // __m256 mask = _mm256_load_ps(local_selected[i*8 + 0]); //  without packing

    const __m256 r1 = _mm256_sub_ps(_mm256_load_ps(&p1[i*8 + 0]), _mm256_load_ps(&p2[i*8 + 0]));
    r1 = _mm256_and_ps(r1, mask);             // zero r1 or leave it untouched.
    euc1 = _mm256_fmadd_ps(r1, r1, euc1);    // euc1 += r1*r1
    // ... same for r2 with local_selected[i + 1]
    // and p1/p2[i*8 + 8]
    // euc2 += (r2*r2) & mask2

    // and again for euc3 (local_selected[i + 2], p1/p2[i*8 + 16]
    // and again for euc3 (local_selected[i + 3], p1/p2[i*8 + 24]
}
euclidean = hsum (euc1+euc2+euc3+euc4);

I guess you bottleneck slightly on load throughput without the pmovsx, since you have three loads for three vector ALU operations. (And with micro-fusion, it's only 4 total fused-domain uops on an Intel CPU, so it's not bottlenecked on the front-end). And the three ALU uops can run on different ports (vandps is 1 uop for port 5 on Intel pre-Skylake. On SKL it can run on any port).

Adding in a shuffle (the pmovsx) potentially bottlenecks on port5 (on Haswell/Broadwell). You might want to do use vpand for the masking so it can run on any port, if you're tuning for HSW/BDW, even if they have extra bypass latency between an integer AND and FP math instructions. With enough accumulators, you're not latency-bound. (Skylake has extra bypass latency for VANDPS depending on which port it happens to run on).

blendv is slower than AND: always at least 2 uops.


Compressing the mask even more for large arrays

If your arrays are larger than L2 cache, and your mask array has as many elements as your float arrays, you will most likely bottleneck on load bandwidth (at least once you unroll with multiple vector accumulators). This means that spending more instructions unpacking the mask data is worth it to reduce that part of the bandwidth requirement.

I think the ideal format for your mask data is with 32 vectors of masks interleaved, which makes it very cheap to "unpack" on the fly. Use a shift to bring the right mask into the high bit of each 32-bit element, and use it with vblendvps to conditionally zero elements by blending with zero. (Or with an arithmetic right shift + boolean AND)

__m256i masks = _mm256_load_si256(...);

                          // this actually needs a cast to __m256, omitted for readability
r0 = _mm256_blendv_ps(_mm256_setzero_ps(), r0, masks);
...

__m256i mask1 = _mm256_slli_epi32(masks, 1);
r1 = _mm256_blendv_ps(_mm256_setzero_ps(), r1, mask1);
...

__m256i mask2 = _mm256_slli_epi32(masks, 2);
r2 = _mm256_blendv_ps(_mm256_setzero_ps(), r2, mask2);
...

// fully unrolling is overkill; you can set up for a loop back to r0 with
masks = _mm256_slli_epi32(masks, 4);

You could also do masks = _mm256_slli_epi32(masks, 1); at every step, which might be better because it uses 1 register less. But it might be more sensitive to resource conflicts causing latency on the mask dep chain, since every mask depends on the previous one.

Intel Haswell runs both vblendvps uops on port5 only, so you could consider using _mm256_srai_epi32 + _mm256_and_ps. But Skylake can run the 2 uops on any of p015, so blend is good there (although it does tie up a vector register holding the all-zero vector).


Generate masks in that interleaved format with a packed-compare, then _mm256_srli_epi32(cmp_result, 31) and OR that into the vector you're building up. Then left-shift it by 1. Repeat 32 times.


You can still use this format if you have fewer than 32 whole vectors of data in your arrays. The lower bits will just go unused. Or you could have the masks for 2 or more selected_dimensions per vector. e.g. the top 16 bits of each element are for one selected_dimensions, and the bottom 16 bits are for another. You could do something like

__m256i masks =  _mm256_load_si256(dimensions[selector/2]);
masks = _mm256_sll_epi32(masks, 16 * (selector % 2));

// or maybe
if (selector % 2) {
    masks = _mm256_slli_epi32(masks, 16);
}

AVX512:

AVX512 can use a bitmap mask directly, so it's somewhat more efficient. Just use const __mmask16 *local_selected = whatever; to declare an array of 16-bit masks (for use with 512b vectors of 16 floats), and use r0 = _mm512_maskz_sub_ps(p1,p2, local_selected[i]); to zero-mask the subtraction.

If you actually bottleneck on load-port uop throughput (2 loads per clock), you might try loading 64 bits of mask data at once, and using a mask-shift to get a different low-16 of it. This probably won't be a problem unless your data is hot in L1D cache, though.

It's very easy to generate the mask data in the first place with a compare-into-mask, with no interleaving required.


Ideally you could cache-block the code that calls this so you could reuse data while it was hot in cache. e.g. get all the combinations you want from the first 64kiB of p1 and p2, then move on to later elements and do them while they're hot in cache.