Nikita Sivukhin Nikita Sivukhin - 2 months ago 11x
C Question

Alignment and SSE strange behaviour

I try to work with SSE and i faced with some strange behaviour.

I write simple code for comparing two strings with SSE Intrinsics, run it and it work. But later i understand, that in my code one of pointer still not aligned, but i use

instruction, which requires pointer aligned on a 16-byte boundary.

//Compare two different, not overlapping piece of memory
__attribute((target("avx"))) int is_equal(const void* src_1, const void* src_2, size_t size)
//Skip tail for right alignment of pointer [head_1]
const char* head_1 = (const char*)src_1;
const char* head_2 = (const char*)src_2;
size_t tail_n = 0;
while (((uintptr_t)head_1 % 16) != 0 && tail_n < size)
if (*head_1 != *head_2)
return 0;
head_1++, head_2++, tail_n++;

//Vectorized part: check equality of memory with SSE4.1 instructions
//src1 - aligned, src2 - NOT aligned
const __m128i* src1 = (const __m128i*)head_1;
const __m128i* src2 = (const __m128i*)head_2;
const size_t n = (size - tail_n) / 32;
for (size_t i = 0; i < n; ++i, src1 += 2, src2 += 2)
printf("src1 align: %d, src2 align: %d\n", align(src1) % 16, align(src2) % 16);
__m128i mm11 = _mm_load_si128(src1);
__m128i mm12 = _mm_load_si128(src1 + 1);
__m128i mm21 = _mm_load_si128(src2);
__m128i mm22 = _mm_load_si128(src2 + 1);

__m128i mm1 = _mm_xor_si128(mm11, mm21);
__m128i mm2 = _mm_xor_si128(mm12, mm22);

__m128i mm = _mm_or_si128(mm1, mm2);

if (!_mm_testz_si128(mm, mm))
return 0;

//Check tail with scalar instructions
const size_t rem = (size - tail_n) % 32;
const char* tail_1 = (const char*)src1;
const char* tail_2 = (const char*)src2;
for (size_t i = 0; i < rem; i++, tail_1++, tail_2++)
if (*tail_1 != *tail_2)
return 0;
return 1;

I print alignment of two pointers and one of this wal aligned but second - wasn't. And program still running correctly and fast.

Then i create synthetic test like this:

//printChars128(...) function just print 16 byte values from __m128i
const __m128i* A = (const __m128i*)buf;
const __m128i* B = (const __m128i*)(buf + rand() % 15 + 1);
for (int i = 0; i < 5; i++, A++, B++)
__m128i A1 = _mm_load_si128(A);
__m128i B1 = _mm_load_si128(B);

And it crashes, as we expected, on first iteration, when try load pointer B.

Interesting fact that if i switch
then my implementation of
will crash.

Another interesting fact that if i try align second pointer instead of first (so first pointer will be not aligned, second - aligned), then
will crash.

So, my question is: "Why
function works fine with only first pointer aligned if i enable
instruction generation?"

UPD: This is
code. I compile my code with
MinGW64/g++, gcc version 4.9.2
under Windows, x86.

Compile string:
g++.exe main.cpp -Wall -Wextra -std=c++11 -O2 -Wcast-align -Wcast-qual -o main.exe


TL:DR: Loads from _mm_load_* intrinsics can be folded (at compile time) into memory operands to other instructions. The AVX versions of vector instructions don't require alignment for memory operands, except for specifically-aligned load/store instructions like vmovdqa.

In the legacy SSE encoding of vector instructions (like pxor xmm0, [src1]) , unaligned 128 bit memory operands will fault except with the special unaligned load/store instructions (like movdqu / movups).

The VEX-encoding of vector instructions (like vpxor xmm1, xmm0, [src1]) doesn't fault with unaligned memory, except with the alignment-required load/store instructions (like vmovdqa, or vmovntdq).

The _mm_loadu_si128 vs. _mm_load_si128 (and store/storeu) intrinsics communicate alignment guarantees to the compiler, but don't force it to actually emit a specific instruction.

The as-if rule still applies when optimizing code that uses intrinsics. A load can be folded into a memory operand for the vector-ALU instruction that uses it, as long as that doesn't introduce the risk of a fault. This is advantageous for code-density reasons, and also fewer uops to track in parts of the CPU thanks to micro-fusion (see Agner Fog's microarch.pdf). The optimization pass that does this isn't enabled at -O0, so an unoptimized build of your code probably would have faulted with unaligned src1.

The interpretation of the as-if rule in this case is that it's ok for the program to not fault in some cases where the naive translation into asm would have faulted. (Or for the same code to fault in an un-optimized build but not fault in an optimized build).

This is opposite from the rules for floating-point exceptions, where the compiler-generated code must still raise any and all exceptions that would have occurred on the C abstract machine. That's because there are well-defined mechanisms for handling FP exceptions, but not for handling segfaults.

Note that since stores can't fold into memory operands for ALU instructions, store (not storeu) intrinsics will compile into code that faults with unaligned pointers even when compiling for an AVX target.

To be specific: consider this code fragment:

// aligned version:
y = ...;                         // assume it's in xmm1
x = _mm_load_si128(Aptr);        // Aligned pointer
res = _mm_or_si128(y, x);

// unaligned version: the same thing with _mm_loadu_si128(Uptr)

When targeting SSE (code that can run on CPUs without AVX support), the aligned version can fold the load into por xmm1, [Aptr], but the unaligned version has to use
movdqu xmm0, [Uptr] / por xmm0, xmm1. The aligned version might do that too, if the old value of y is still needed after the OR.

When targeting AVX (gcc -mavx, or gcc -march=sandybridge or later), all vector instructions emitted (including 128 bit) will use the VEX encoding. So you get different asm from the same _mm_... intrinsics. Both versions can compile into vpor xmm0, xmm1, [ptr]. (And the 3-operand non-destructive feature means that this actually happens except when the original value loaded is used multiple times).

Only one operand to ALU instructions can be a memory operand, so in your case one has to be loaded separately. Your code faults when the first pointer isn't aligned, but doesn't care about alignment for the second, so we can conclude that gcc chose to load the first operand with vmovdqa and fold the second, rather than vice-versa.

You can see this happen in practice in your code on the Godbolt compiler explorer. Unfortunately gcc 4.9 (and 5.3) compile it to somewhat sub-optimal code that generates the return value in al and then tests it, instead of just branching on the flags from vptest :( clang-3.8 does a significantly better job.

.L36: add rdi, 32 add rsi, 32 cmp rdi, rcx je .L9 .L10: vmovdqa xmm0, XMMWORD PTR [rdi] # first arg: loads that will fault on unaligned xor eax, eax vpxor xmm1, xmm0, XMMWORD PTR [rsi] # second arg: loads that don't care about alignment vmovdqa xmm0, XMMWORD PTR [rdi+16] # first arg vpxor xmm0, xmm0, XMMWORD PTR [rsi+16] # second arg vpor xmm0, xmm1, xmm0 vptest xmm0, xmm0 sete al # generate a boolean in a reg test eax, eax jne .L36 # then test&branch on it. /facepalm

Note that your is_equal is memcmp. I think glibc's memcmp will do better than your implementation in many cases, since it has hand-written asm versions for SSE4.1 and others which handle various cases of the buffers being misaligned relative to each other. (e.g. one aligned, one not.) Note that glibc code is LGPLed, so you might not be able to just copy it. If your use-case has smaller buffers that are typically aligned, your implementation is probably good. Not needing a VZEROUPPER before calling it from other AVX code is also nice.

The compiler-generated byte-loop to clean up at the end is definitely sub-optimal. If the size is bigger than 16 bytes, do an unaligned load that ends at the last byte of each src. It doesn't matter that you re-compared some bytes you've already checked.

Anyway, definitely benchmark your code with the system memcmp. Besides the library implementation, gcc knows what memcmp does and has its own builtin definition that it can inline code for.