FiniteElement FiniteElement - 4 months ago 32
C Question

Loop unrolling in inlined functions in C

I have a question about C compiler optimization and when/how loops in inline functions are unrolled.

I am developing a numerical code which does something like the example below. Basically,

would compute some kind of stencil and call
to do something with the data in
my_type *arg
for each
. Here,
, creating the argument and sending the function pointer to
... who’s job it is to modify the
th double for each of the (
) double arrays

typedef struct my_type {
int const n;
double *dest[16];
double const *src[16];
} my_type;

static inline void my_for( void (*op)(my_type *,int), my_type *arg, int N ) {
int i;

for( i=0; i<N; ++i )
op( arg, i );

static inline void my_op( my_type *arg, int i ) {
int j;
int const n = arg->n;

for( j=0; j<n; ++j )
arg->dest[j][i] += arg->src[j][i];

void my_func( double *dest0, double *dest1, double const *src0, double const *src1, int N ) {
my_type Arg = {
.n = 2,
.dest = { dest0, dest1 },
.src = { src0, src1 }

my_for( &my_op, &Arg, N );

This works fine. The functions are inlining as they should and the code is (almost) as efficient as having written everything inline in a single function and unrolled the
loop, without any sort of
my_type Arg

Here’s the confusion: if I set
int const n = 2;
rather than
int const n = arg->n;
, then the code becomes as fast as the unrolled single-function version. So, the question is: why? If everything is being inlined into
, why doesn’t the compiler see that I am literally defining
Arg.n = 2
? Furthermore, there is no improvement when I explicitly make the bound on the
, which should look just like the speedier
int const n = 2;
after inlining. I also tried using
my_type const
everywhere to really signal this const-ness to the compiler, but it just doesn't want to unroll the loop.

In my numerical code, this amounts to about a 15% performance hit. If it matters, there,
and these
loops appear in a couple of conditional branches in an

I am compiling with icc (ICC) 12.1.5 20120612. I tried
#pragma unroll
. Here are my compiler options (did I miss any good ones?):

-O3 -ipo -static -unroll-aggressive -fp-model precise -fp-model source -openmp -std=gnu99 -Wall -Wextra -Wno-unused -Winline -pedantic



Well, obviously the compiler isn't 'smart' enough to propagate the n constant and unroll the for loop. Actually it plays it safe since arg->n can change between instantiation and usage.

In order to have consistent performance across compiler generations and squeeze the maximum out of your code, do the unrolling by hand.

What people like myself do in these situations (performance is king) is rely on macros.

Macros will 'inline' in debug builds (useful) and can be templated (to a point) using macro parameters. Macro parameters which are compile time constants are guaranteed to remain this way.