sacheie sacheie - 14 days ago 7x
C Question

How to extract bytes from an SSE2 __m128i structure?

I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images).

I tried AVX, more modern SSE versions etc, but eventually decided SSE2 seems sufficient and has the most support for individual bytes - please correct me if I'm wrong.

I have two questions: first, what's the right way to load 128-bit registers? I think I'm supposed to pass the load intrinsics data aligned to multiples of 128, but will that work with 2D array code like this:

greys = aligned_alloc(16, xres * sizeof(int8_t*));

for (uint32_t x = 0; x < xres; x++)
greys[x] = aligned_alloc(16, yres * sizeof(int8_t*));

(The code above assumes xres and yres are the same, and are powers of two). Does this turn into a linear, unbroken block in memory? Could I then, as I loop, just keep passing addresses (incrementing them by 128) to the SSE2 load intrinsics? Or does something different need to be done for 2D arrays like this one?

My second question: once I've done all my vector processing, how the heck do I extract the modified bytes from the
? Looking through the Intel Intrinsics Guide, instructions that convert a vector type to a scalar one are rare. The closest I've found is int
_mm_movemask_epi8 (__m128i a)
but I don't quite understand how to use it.

Oh, and one third question - I assumed
only loads signed bytes? And I couldn't find any other byte loading function, so I guess you're just supposed to subtract 128 from each and account for it later?

I know these are basic questions for SIMD experts, but I hope this one will be useful to beginners like me. And if you think my whole approach to the application is wrong, or I'd be better off with more modern SIMD extensions, I'd love to know. I'd just like to humbly warn I've never worked with assembly and all this bit-twiddling stuff requires a lot of explication if it's to help me.

Nevertheless, I'm grateful for any clarification available.

In case it makes a difference: I'm targeting a low-power i7 Skylake architecture. But it'd be nice to have the application run on much older machines too (hence SSE2).


Least obvious question first:

once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i

Extract the low 64 bits to an integer with int64_t _mm_cvtsi128_si64x(__m128i), or the low 32 bits with int _mm_cvtsi128_si32 (__m128i a). If you want other parts of the vector, your options are:

  • shuffle the vector to create a new __m128i with the data you want in the low element, and use the cvt intrinsics (MOVD or MOVQ in asm).
  • use SSE2 int _mm_extract_epi16 (__m128i a, int imm8), or the SSE4.1 similar instructions for other element sizes. PEXTRB/W/D/Q are not the fastest instructions, but if you only need one high element, they're better than a separate shuffle and MOVD.
  • Store to a temporary array, or use union { __m128i v; int64_t i64[2]; } or something. Union-based type punning is legal in C99, but only as an extension in C++. An alternative to the union that would also work in C++ would be memcpy(&my_int64_local, 8 + (char*)my_vector, 8); to extract the high half. Hopefully that won't compile to a store + an actual memcpy library function call. Compilers are usually pretty good about optimizing away small fixed-size memcpy because of use-cases like this, but you will probably get a store/reload instead of a PEXTRQ, even if you compile with -msse4.1 to let the compiler use it if it wants. If the result can go directly into memory unmodified (instead of being needed in an integer register), a smart compiler might use MOVHPS to store the high half of a __m128i

Does this turn into a linear, unbroken block in memory?

No, it's an array of pointers to separate blocks of memory, introducing an extra level of indirection vs. a proper 2D array. Don't do that.

Make one large allocation, and do the index calculation yourself (using array[x*yres + y]).

And yes, load data from it with _mm_load_si128, or loadu if you need to load from an offset.

assumed _mm_load_si128 only loads signed bytes

Signed or unsigned isn't an inherent property of a byte, it's only how you interpret the bits. You use the same load intrinsic for loading two 64-bit elements, or a 128-bit bitmap.

Use intrinsics that are appropriate for your data. It's a little bit like assembly language: everything is just bytes, and the machine will do what you tell it with your bytes. It's up to you to choose a sequence of instructions / intrinsics that produces meaningful results.

The integer load intrinsics take __m128i* pointer args, so you have to use _mm_load_si128( (const __m128i*) my_int_pointer ) or similar. This looks like pointer aliasing (e.g. reading an array of int through a short *), which is Undefined Behaviour in C and C++. However, this is how Intel says you're supposed to do it, so any compiler that implements Intel's intrinsics is required to make this work correctly. gcc does so by defining __m128i with __attribute__((may_alias)).

See also Loading data for GCC's vector extensions which points out that you can use Intel intrinsics for GNU C native vector extensions, and shows how to load/store.

To learn more about SIMD with SSE, there are some links in the tag wiki, including some intro / tutorial links.

The tag wiki has some good x86 asm / performance links.