I am having alignment issue while using
inline void ones(float *a)
__m256 out_aligned = _mm256_set1_ps(1.0f);
size_t ss = 8;
float *a = new float[ss];
delete  a;
std::cout << "All Good!" << std::endl;
inline void ones_sse(float *a)
__m128 out_aligned = _mm_set1_ps(1.0f);
The standard allocators are probably only aligning to 8B (the width of the widest standard type), or maybe 16B.
aligned_alloc: ISO C11, and available in some but not all C++ compilers. It's not part of any ISO C++ standard, only C11. (commenters report it's unavailable in MSVC++, but see best cross-platform method to get aligned memory for a viable
#ifdef for Windows).
posix_memalign: Part of POSIX 2001, not any ISO C or C++ standard. Clunky prototype/interface compared to
#include <stdlib.h> int posix_memalign(void **memptr, size_t alignment, size_t size); // POSIX 2001 void *aligned_alloc(size_t alignment, size_t size); // C11 (not C++)
_mm_malloc: Available on any platform where
_mm_whatever_ps is available, but you can't pass pointers from it to
free. On many C and C++ implementations
free are compatible, but it's not guaranteed to be portable. (And unlike the other two, it will fail at run-time, not compile time.)
In C++11 and later: use
alignas(32) float avx_array as the first member of a struct/class member (or on a plain array directly) so static and automatic storage objects of that type will have 32B alignment.
std::aligned_storage documentation has an example of this technique to explain what
This doesn't actually work for dynamically-allocated storage (like a
std::vector<my_class_with_aligned_member_array>), see Making std::vector allocate aligned memory.
And finally, the last option is so bad it's not even part of the list: allocate a larger buffer and add do
p+=31; p&=~31ULL with appropriate casting. Too many drawbacks (hard to free, wastes memory) to be worth discussing, since aligned-allocation functions are available on every platform that support Intel
_mm256 intrinsics. But there are even library functions that will help you do this, IIRC.
The requirement to use
_mm_free instead of
free probably exists to for the possibility of implementing
_mm_malloc on top of a plain old
malloc using this technique.