fuz - 3 years ago 131
C Question

# How to load a pixel struct into an SSE register?

I have a struct of 8-bit pixel data:

``````struct __attribute__((aligned(4))) pixels {
char r;
char g;
char b;
char a;
}
``````

I want to use SSE instructions to calculate certain things on these pixels (namely, a Paeth transformation). How can I load these pixels into an SSE register as 32-bits unsigned integers?

### Unpacking unsigned pixels with SSE2

Ok, using SSE2 integer intrinsics from `<emmintrin.h>` first load the thing into the lower 32 bits of the register:

``````__m128i xmm0 = _mm_cvtsi32_si128(*(const int*)&pixel);
``````

Then first unpack those 8-bit values into 16-bit values in the lower 64 bits of the register, interleaving them with 0s:

``````xmm0 = _mm_unpacklo_epi8(xmm0, _mm_setzero_si128());
``````

And again unpack those 16-bit values into 32-bit values:

``````xmm0 = _mm_unpacklo_epi16(xmm0, _mm_setzero_si128());
``````

You should now have each pixel as 32-bit integer in the respective 4 components of the SSE register.

### Unpacking signed pixels with SSE2

I just read, that you want to get those values as 32-bit signed integers, though I wonder what sense a signed pixel in [-127,127] makes. But if your pixel values can indeed be negative, the interleaving with zeros won't work, since it makes a negative 8-bit number into a positive 16-bit number (thus interprets your numbers as unsigned pixel values). A negative number has to be extended with `1`s instead of `0`s, but unfortunately that would have to be decided dynamically on a component by component basis, at which SSE is not that good.

What you could do is compare the values for negativity and use the resulting mask (which fortunately uses `1...1` for true and `0...0` for false) as interleavand, instead of the zero register:

``````xmm0 = _mm_unpacklo_epi8(xmm0, _mm_cmplt_epi8(xmm0, _mm_setzero_si128()));
xmm0 = _mm_unpacklo_epi16(xmm0, _mm_cmplt_epi16(xmm0, _mm_setzero_si128()));
``````

This will properly extend negative numbers with `1`s and positives with `0`s. But of course this additional overhead (in the form of probably 2-4 additional SSE instructions) is only neccessary if your initial 8-bit pixel values can ever be negative, which I still doubt. But if this is really the case, you should rather consider `signed char` over `char`, as the latter has implementation-defined signedness (in the same way you should use `unsigned char` if those are the common unsigned [0,255] pixel values).

### Alternative SSE2 unpacking using shifts

Although, as clarified, you don't need signed-8-bit to 32-bit conversion, but for the sake of completeness harold had another very good idea for the SSE2-based sign-extension, instead of using the above mentioned comparison based version. We first unpack the 8-bit values into the upper byte of the 32-bit values instead of the lower byte. Since we don't care for the lower parts, we just use the 8-bit values again, which frees us from the need for an extra zero-register and an additional move:

``````xmm0 = _mm_unpacklo_epi8(xmm0, xmm0);
xmm0 = _mm_unpacklo_epi16(xmm0, xmm0);
``````

Now we just need to perform and arithmetic right-shift of the upper byte into the lower byte, which does the proper sign-extension for negative values:

``````xmm0 = _mm_srai_epi32(xmm0, 24);
``````

This should be more instruction count and register efficient than my above SSE2-version.

And as it should even be equal in instruction count for a single pixel (though 1 more instruction when amortized over many pixels) and more register efficient (due to no extra zero-register) compared to the above zero-extension, it might even be used for the unsigned-to-signed conversion if registers are rare, but then with a logical shift (`_mm_srli_epi32`) instead of an arithmetic shift.

### Improved unpacking with SSE4

Thanks to harold's comment, there is even a better option for the first 8-to-32 transformation. If you have SSE4 support (SSE4.1 to be precise), which has instructions for doing the complete conversion from 4 packed 8-bit values in the lower 32 bits of the register into 4 32-bit values in the whole register, both for signed and unsigned 8-bit values:

``````xmm0 = _mm_cvtepu8_epi32(xmm0);   //or _mm_cvtepi8_epi32 for signed 8-bit values
``````

### Packing pixels with SSE2

As for the follow-up of reversing this transformation, first we pack the signed 32-bit integers into signed 16-bit integers and saturating:

``````xmm0 = _mm_packs_epi32(xmm0, xmm0);
``````

Then we pack those 16-bit values into unsigned 8-bit values using saturation:

``````xmm0 = _mm_packus_epi16(xmm0, xmm0);
``````

We can then finally take our pixel from the lower 32-bits of the register:

``````*(int*)&pixel = _mm_cvtsi128_si32(xmm0);
``````

Due to the saturation, this whole process will autmatically map any negative values to `0` and any values greater than `255` to `255`, which is usually intended when working with color pixels.

If you actually need truncation instead of saturation when packing the 32-bit values back into `unsigned char`s, then you will need to do this yourself, since SSE only provides saturating packing instructions. But this can be achieved by doing a simple:

``````xmm0 = _mm_and_si128(xmm0, _mm_set1_epi32(0xFF));
``````

right before the above packing procedure. This should amount to just 2 additional SSE instructions, or only 1 additional instruction when amortized over many pixels.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download