youpilat13 youpilat13 - 2 months ago 36
C Question

Check vectorization into a simple example with gcc-4.9

I am trying to check the vectorization into a simple loop. I am working on MacOS 19.5 and my code is compiled with

gcc-mp-4.9
(installed from Macports). For getting better performances with vectorization, I measure the elapsed time into a main loop and compared it with no-vectorization version.

Here's this simple code (that I compiled either with
"NOVEC"
or
"VEC" -D flag
) :

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define SIZE 1000000000

#ifdef NOVEC
void addition_tab_novec(double *a, double *b, double *c)
{
int i;

for (i=0; i<SIZE; i++)
c[i] = a[i] + b[i];
}
#endif

#ifdef VEC
void addition_tab_vec(double * restrict a, double * restrict b, double * restrict c)
{
int i;

double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
double *z = __builtin_assume_aligned(c, 16);

for (i=0; i<SIZE; i++)
z[i] = x[i] + y[i];
}
#endif

int main(int argc, char *argv[])
{
// Array index
int i;

// Two input arrays
double *tab_x;
double *tab_y;
double *tab_z;

// Time elapsed
time_t time1, time2;

// Allocation
tab_x = (double*) malloc(SIZE*sizeof(double));
tab_y = (double*) malloc(SIZE*sizeof(double));
tab_z = (double*) malloc(SIZE*sizeof(double));

// Initialization
for (i=0; i<SIZE; i++)
{
tab_x[i] = i;
tab_y[i] = 2*i;
tab_z[i] = 0.0;
}

#ifdef NOVEC
// Start time for vectorization
time(&time1);

// Addition function
addition_tab_novec(tab_x, tab_y, tab_z);

// Compute elapsed time for vectorization
time(&time2);

printf("No Vectorization - Time elapsed = %f seconds\n", difftime(time2, time1));
#endif

#ifdef VEC
// Start time for vectorization
time(&time1);

// Addition function
addition_tab_vec(tab_x, tab_y, tab_z);

// Compute elapsed time for vectorization
time(&time2);

printf("Vectorization - Time elapsed = %f seconds\n", difftime(time2, time1));
#endif

return 0;
}


My issue is that I don't get better results with vectorization compared to no-vectorization version.

Given that I use "
__builtin_assume_aligned(array, 16)
", i.e a
16 bytes alignement
, I expect to get an elapsed time twice smaller into the measured loop (I use double arrays with
sizeof(double) = 8 bytes
)

But actually, I get 60 seconds without vectorization and 59s with it : how could I interpret these same results ?

Here are the compilation command line in two cases :

No-Vectorization :

gcc-mp-4.9 -DNOVEC -std=c99 -fno-tree-vectorize main_benchmark.c


Vectorization :

gcc-mp-4.9 -DVEC -std=c99 -Wa,-q -O3 -march=native -ftree-vectorize -fopt-info-vec main_benchmark.c


I am not sure that optimization is not activated for no-vectorization compilation. If this is the case, how to disable it ?

Thanks for your help

Answer

First of all, the __builtin_assume_aligned variant can be removed, as the vectorizer automatically aligns the data, with a special variant specialized for unaligned data. But you are right that explicit alignment and the restrict improves the code.

gcc-mp-6 -g -DVEC -std=c99 -Wa,-q -O3 -march=native -ftree-vectorize -fopt-info-vec -ftree-vectorizer-verbose=2 vec-sample.c -o vec-vec2

vec-sample.c:12:2: note: loop vectorized
vec-sample.c:12:2: note: loop versioned for vectorization because of possible aliasing
vec-sample.c:12:2: note: loop peeled for vectorization to enhance alignment
vec-sample.c:50:3: note: loop vectorized
vec-sample.c:50:3: note: loop peeled for vectorization to enhance alignment
vec-sample.c:12:2: note: loop vectorized
vec-sample.c:12:2: note: loop peeled for vectorization to enhance alignment

$ otool -tv vec-vec2

...
0000000100000920    vmovupd (%r10,%rax), %ymm0
0000000100000926    vaddpd  (%r14,%rax), %ymm0, %ymm0
000000010000092c    addl    $0x1, %ecx
000000010000092f    vmovupd %ymm0, (%r8,%rax)
...

Second, the addition code in the loop is vectorized (see vaddpd above), but the setup and malloc dominates the loop. There's not much left to measure the loop.

To measure the loop (osx only), I used

#include <mach/mach_time.h>
#define SIZE 1000000
...
  // Time elapsed
  uint64_t t1, t2;
...
  // Start time for vectorization
  t1 = mach_absolute_time();

  // Addition function
  addition_tab_vec(tab_x, tab_y, tab_z);

  // Compute elapsed time for vectorization
  t2 = mach_absolute_time();

  printf("Vectorization - Time elapsed = %ld ticks\n", t2-t1);

which gives you a measurable 20% overhead with the SSE code, and 62% without SSE with -O0.

http://locklessinc.com/articles/vectorize/ talks about the details a bit more.