kriss - 4 months ago 24

C Question

Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 ints?

As the question is very low level:

- we can't assume libraries are available (and the call itself has its cost), only plain C
- to avoid emptying instruction pipeline (that has a
*very*high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||). - room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.

Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening` code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.

As for why it is interesting, there is several layers:

- the example is simple and easy to understand and measure, not much C skill involved
- it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.

Here is my reference (naive, not optimized) implementation and my test set.

`#include <stdio.h>`

static __inline__ int sort6(int * d){

char j, i, imin;

int tmp;

for (j = 0 ; j < 5 ; j++){

imin = j;

for (i = j + 1; i < 6 ; i++){

if (d[i] < d[imin]){

imin = i;

}

}

tmp = d[j];

d[j] = d[imin];

d[imin] = tmp;

}

}

static __inline__ unsigned long long rdtsc(void)

{

unsigned long long int x;

__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));

return x;

}

int main(int argc, char ** argv){

int i;

int d[6][5] = {

{1, 2, 3, 4, 5, 6},

{6, 5, 4, 3, 2, 1},

{100, 2, 300, 4, 500, 6},

{100, 2, 3, 4, 500, 6},

{1, 200, 3, 4, 5, 600},

{1, 1, 2, 1, 2, 1}

};

unsigned long long cycles = rdtsc();

for (i = 0; i < 6 ; i++){

sort6(d[i]);

/*

* printf("d%d : %d %d %d %d %d %d\n", i,

* d[i][0], d[i][6], d[i][7],

* d[i][8], d[i][9], d[i][10]);

*/

}

cycles = rdtsc() - cycles;

printf("Time is %d\n", (unsigned)cycles);

}

As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).

I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).

- Direct call to qsort library function : 689.38
- Naive implementation (insertion sort) : 285.70
- Insertion Sort (Daniel Stutzbach) : 142.12
- Insertion Sort Unrolled : 125.47
- Rank Order : 102.26
- Rank Order with registers : 58.03
- Sorting Networks (Daniel Stutzbach) : 111.68
- Sorting Networks (Paul R) : 66.36
- Sorting Networks 12 with Fast Swap : 58.86
- Sorting Networks 12 reordered Swap : 53.74
- Sorting Networks 12 reordered Simple Swap : 31.54
- Reordered Sorting Network w/ fast swap : 31.54
- Reordered Sorting Network w/ fast swap V2 : 33.63
- Inlined Bubble Sort (Paolo Bonzini) : 48.85
- Unrolled Insertion Sort (Paolo Bonzini) : 75.30

- Direct call to qsort library function : 705.93
- Naive implementation (insertion sort) : 135.60
- Insertion Sort (Daniel Stutzbach) : 142.11
- Insertion Sort Unrolled : 126.75
- Rank Order : 46.42
- Rank Order with registers : 43.58
- Sorting Networks (Daniel Stutzbach) : 115.57
- Sorting Networks (Paul R) : 64.44
- Sorting Networks 12 with Fast Swap : 61.98
- Sorting Networks 12 reordered Swap : 54.67
- Sorting Networks 12 reordered Simple Swap : 31.54
- Reordered Sorting Network w/ fast swap : 31.24
- Reordered Sorting Network w/ fast swap V2 : 33.07
- Inlined Bubble Sort (Paolo Bonzini) : 45.79
- Unrolled Insertion Sort (Paolo Bonzini) : 80.15

I included both -O1 and -O2 results because surprisingly for several programs O2 is

As expected minimizing branches is indeed a good idea.

Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).

The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?

As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.

It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).

Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimising this particular code (there was very little difference for other proposals).

As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do

`SWAP(1, 2); SWAP(0, 2);`

`SWAP(1, 2); SWAP(4, 5);`

One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.

The "best" code is now as follow:

`static inline void sort6_sorting_network_simple_swap(int * d){`

#define min(x, y) (x<y?x:y)

#define max(x, y) (x<y?y:x)

#define SWAP(x,y) { const int a = min(d[x], d[y]);

const int b = max(d[x], d[y]);

d[x] = a; d[y] = b; }

SWAP(1, 2);

SWAP(4, 5);

SWAP(0, 2);

SWAP(3, 5);

SWAP(0, 1);

SWAP(3, 4);

SWAP(1, 4);

SWAP(0, 3);

SWAP(2, 5);

SWAP(1, 3);

SWAP(2, 4);

SWAP(2, 3);

#undef SWAP

#undef min

#undef max

}

If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

Answer

For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.

Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.

The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.

Here's an insertion sort implementation:

```
static __inline__ int sort6(int *d){
int i, j;
for (i = 1; i < 6; i++) {
int tmp = d[i];
for (j = i; j >= 1 && tmp < d[j-1]; j--)
d[j] = d[j-1];
d[j] = tmp;
}
}
```

Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:

```
static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
SWAP(1, 2);
SWAP(0, 2);
SWAP(0, 1);
SWAP(4, 5);
SWAP(3, 5);
SWAP(3, 4);
SWAP(0, 3);
SWAP(1, 4);
SWAP(2, 5);
SWAP(2, 4);
SWAP(1, 3);
SWAP(2, 3);
#undef SWAP
}
```