1'' 1'' - 28 days ago 13
C++ Question

Parallelizing std::nth_element and std::partition

I'm porting C++ code which uses

std::nth_element
and
std::partition
to OpenCL.

nth_element
is a selection algorithm which places the nth-smallest number in an array at the nth spot, and arranges the remaining elements so that all elements less than this number are before it in the array, and all elements greater than it are after it. In effect,
nth_element
sorts the array into 3 buckets: the number itself, all the numbers less than it, and all the numbers greater than it.

Canonically,
nth_element
is implemented using a recursive partition: an element is chosen, the elements are partitioned based on whether they're less than this element. Then, pick the bucket which contains the nth element of the array and recurse on that bucket. The main difference between
nth_element
and a full quicksort is that quicksort recurses on both buckets, not just the one containing the nth element.




partition
is a weaker version of
nth_element
which only sorts the array into 2 buckets: those for which a condition is true, and those for which it is false. The site I linked to gives the implementation:

while (first!=last) {
while (pred(*first)) {
++first;
if (first==last) return first;
}
do {
--last;
if (first==last) return first;
} while (!pred(*last));
swap (*first,*last);
++first;
}
return first;


where pred is a function that evaluates whether an element should be in the first bucket. Basically, this function iteratively finds the outermost pair of elements of the array which are in the wrong place, and swaps them, stopping when the pair of elements are the same element.




Here are my initial ideas on parallelizing
nth_element
and
partition
:

Partition could be implemented using atomic comparisons and swaps, but I'm not sure how to cover all the possible pairs of values which could be swapped. There's no obvious way to divide the work among multiple threads, since partition requires comparing elements which may be either right next to each other or at opposite ends of the array. I also don't see a way to avoid having thread B compare with an element which has already been swapped by thread A, which is inefficient.

nth_element seems even less parallelizable, since it is recursive: each partition depends upon the elements having been partially ordered by the previous partition.

Presumably, an efficient parallelization strategy will require a completely different approach than the typical serial code, for both functions.




Do efficient parallel implementations of
nth_element
and
partition
already exist? If not, what is a good parallelization strategy?

Answer

Cuda THRUST has partition function implemented (http://docs.nvidia.com/cuda/thrust/index.html#reordering).

The main idea should be following: Using prefix sums to calculate position of element in the array and then rearrange the array.

Comments