C++ SSE2 or AVX2 intrinsics for grayscale to ARGB conversion

I was wondering if there is an SSE2/AVX2 integer instruction or sequence of instructions(or intrinsics) to be performed in order to achieve the following result:

Given a row of 8 byte pixels of the form:

A = {a, b, c, d, e, f, g, h}

Is there any way to load these pixels in an YMM register that contains 8 32bit ARGB pixels, such that the initial grayscale value gets broadcast to the other 2 bytes of each corresponding 32 bit pixel? The result should be something like this: ( the 0 is the alpha value )

B = {0aaa, 0bbb, 0ccc, 0ddd, 0eee, 0fff, 0ggg, 0hhh}

I'm a complete beginner in vector extensions so I'm not even sure how to approach this, or if it's at all possible.

Any help would be appreciated. Thanks!


Thanks for your answers. I still have a problem though:

I put this small example together and compiled with VS2015 on x64.

int main()
unsigned char* pixels = (unsigned char*)_aligned_malloc(64, 32);
memset(pixels, 0, 64);

for (unsigned char i = 0; i < 8; i++)
pixels[i] = 0xaa + i;

__m128i grayscalePix = _mm_load_si128((const __m128i*)pixels);
__m256i rgba = _mm256_cvtepu8_epi32(grayscalePix);
__m256i mulOperand = _mm256_set1_epi32(0x00010101);

__m256i result = _mm256_mullo_epi32(rgba, mulOperand);

return 0;

The problem is that after doing

__m256i rgba = mm256_cvtepu8_epi32(grayscalePix)

rgba only has the first four doublewords set. The last four are all 0.

The Intel developer manual says:

VPMOVZXBD ymm1, xmm2/m64

Zero extend 8 packed 8-bit integers in the low 8
bytes of xmm2/m64 to 8 packed 32-bit integers in

I'm not sure if this is intended behaviour or I'm still missing something.


Start with PMOVZX like Mark suggests.

But after that, PSHUFB (_mm256_shuffle_epi8) will be much faster than PMULLD, except that it competes for the shuffle port with PMOVZX. (And it operates in-lane, so you still need the PMOVZX).

So if you only care about throughput, not latency, then _mm256_mullo_epi32 is good. But if latency matters, or if your throughput bottlenecks on something other than 2 shuffle-port instructions per vector anyway, then PSHUFB to duplicate the bytes within each pixel should be best.

Actually, even for throughput, _mm256_mullo_epi32 is bad on HSW and BDW: it's 2 uops (10c latency) for p0, so it's 2 uops for one port.

On SKL, it's 2 uops (10c latency) for p01, so it can sustain the same one per clock throughput as VPMOVZXBD. But it's an extra 1 uop, making it more likely to bottleneck.

(VPSHUFB is 1 uop, 1c latency, for port 5, on all Intel CPUs that support AVX2.)