foxes foxes - 2 months ago 21
C++ Question

Parallel radix sort using a provided number of threads

I'm having some difficulty understanding the concepts of parallel radix sort using threads.

If we do it using the Most Significant Digit method, we can start by creating buckets 1 to 9 and divide the numbers up into the buckets using their MSD.

You could sort this in parallel by having 1 thread per bucket.

However if we had to do it with a given number of processors, say 4, how would you split those 9 buckets up into four processors?

A diagram that I saw online seemed to suggest that you would begin by dividing the numbers into x number of partitions for the x processors (without doing any sorting) and then each processor sorts all the numbers for their given partition. But then you would be left with x number buckets each sorted by itself, but not the entire vector/array of numbers sorted, and I'm not sure what you would do next.


After sorting the x partitions, you would merge the x partitions. This could be done with log2(x) passes doing a 2 way merge, or a single pass doing an x way merge. An x way merge commonly uses a heap to speed up determining which of x elements (and which partition the element belongs to) is the smallest. Other than initialization, elements are removed and added to the heap one at a time. Eventually the end of one of the x partitions is reached, and the merge becomes an x-1 merge. Finally just one partition is left and it's copied to the output array.

If the array is so large that it's partitions are much larger than the core specific L1 and L2 caches, parallel radix sort won't help much due to conflicts since all cores share the common L3 (and L4 if present) cache and main memory. For example, each core's L1 cache might be 32KB and each core's L2 cache might be 256KB.

An alternative would be to split up the array into z 256KB partitions, and then do x radix sorts at a time for the z partitions. The last step would be a z way merge of the z sorted 256KB partitions.

As for the radix sort, instead of using base 10, you may want to use base 256 for 8 bit fields (you can shift instead dividing to split up an element into fields).