romeric romeric - 13 days ago 4
C++ Question

How to solve the 32-byte-alignment issue for AVX load/store operations?

I am having alignment issue while using

registers, with some snippets of code that seems fine to me. Here is a minimal working example:

#include <iostream>
#include <immintrin.h>

inline void ones(float *a)
__m256 out_aligned = _mm256_set1_ps(1.0f);

int main()
size_t ss = 8;
float *a = new float[ss];

delete [] a;

std::cout << "All Good!" << std::endl;
return 0;

on my architecture (Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz) and I'm compiling with
-O3 -march=native
flags. Of course the error goes away with unaligned memory access i.e. specifying
. I also do not have this problem on
registers, i.e.

inline void ones_sse(float *a)
__m128 out_aligned = _mm_set1_ps(1.0f);

Am I doing anything foolish? what is the work-around for this?


The standard allocators are probably only aligning to 8B (the width of the widest standard type), or maybe 16B.


  • aligned_alloc: ISO C11, and available in some but not all C++ compilers. It's not part of any ISO C++ standard, only C11. (commenters report it's unavailable in MSVC++, but see best cross-platform method to get aligned memory for a viable #ifdef for Windows).

  • posix_memalign: Part of POSIX 2001, not any ISO C or C++ standard. Clunky prototype/interface compared to aligned_alloc.

#include <stdlib.h>
int posix_memalign(void **memptr, size_t alignment, size_t size);  // POSIX 2001
void *aligned_alloc(size_t alignment, size_t size);                // C11 (not C++)
  • _mm_malloc: Available on any platform where _mm_whatever_ps is available, but you can't pass pointers from it to free. On many C and C++ implementations _mm_free and free are compatible, but it's not guaranteed to be portable. (And unlike the other two, it will fail at run-time, not compile time.)

  • In C++11 and later: use alignas(32) float avx_array[1234] as the first member of a struct/class member (or on a plain array directly) so static and automatic storage objects of that type will have 32B alignment. std::aligned_storage documentation has an example of this technique to explain what std::aligned_storage does.

    This doesn't actually work for dynamically-allocated storage (like a std::vector<my_class_with_aligned_member_array>), see Making std::vector allocate aligned memory.

And finally, the last option is so bad it's not even part of the list: allocate a larger buffer and add do p+=31; p&=~31ULL with appropriate casting. Too many drawbacks (hard to free, wastes memory) to be worth discussing, since aligned-allocation functions are available on every platform that support Intel _mm256 intrinsics. But there are even library functions that will help you do this, IIRC.

The requirement to use _mm_free instead of free probably exists to for the possibility of implementing _mm_malloc on top of a plain old malloc using this technique.