Philinator Philinator - 20 days ago 7
C++ Question

SSE intrinsics check zero flag

I was wondering if it was possible to check the processor's flags register by the means of Intel's SSE intrinsic functions?

For example:

int idx = _mm_cmpistri(mmrange, mmstr, 0x14);
int zero = _mm_cmpistrz(mmrange, mmstr, 0x14);


In this example the compiler is able to optimize those two intrinsics to a single instruction (
pcmpistri
) and checking the flags register by a jump instruction (
jz
).

However in the following example the compiler doesn't manage to optimize the code properly:

__m128i mmmask = _mm_cmpistrm(mmoldchar, mmstr, 0x40);
int zero = _mm_cmpistrz(mmoldchar, mmstr, 0x40);


Here, the compiler generates a
pcmpistrm
and a
pcmpistri
instruction. However, in my opinion, the second instruction is redundant because
pcmpistrm
sets the flags in the processor's flags register in the same way as
pcmistri
.

So, to come back to my question, is there a way to either read the flags register directly or to instruct the compiler to only generate a
pcmpistrm
instruction?

Thanks in advance

Answer

Looks like just an MSVC missed-optimization bug, not anything inherent.

gcc6.2 and icc17 successfully use both results from one PCMPISTRM in a test function I wrote that branches on the zero result (on the Godbolt compiler explorer):

#include <immintrin.h>
__m128i foo(__m128i mmoldchar, __m128i mmstr)
{      
  __m128i mmmask = _mm_cmpistrm(mmoldchar, mmstr, 0x40);
  int zero = _mm_cmpistrz(mmoldchar, mmstr, 0x40);
  if(zero)
    return mmmask;
  else
    return _mm_setzero_si128();
}

    ##gcc6.2 -O3 -march=nehalem
    pcmpistrm       xmm0, xmm1, 64
    je      .L5
    pxor    xmm0, xmm0
    ret
.L5:
    ret

OTOH, clang3.9 fails to CSE, and uses a PCMPISTRI.

foo:
    movdqa  xmm2, xmm0
    pcmpistri       xmm2, xmm1, 64
    pxor    xmm0, xmm0
    jne     .LBB0_2
    pcmpistrm       xmm2, xmm1, 64
.LBB0_2:
    ret

Note that according to Agner Fog's instruction tables, PCMPISTRM has good throughput but high latency, so there's lots of room to do two in parallel if latency is the bottleneck. Jumping through hoops like using __readflags() might actually be worse.

Comments