Serge Rogatch - 1 month ago 15

C++ Question

Consider a bit vector of

`N`

`N`

`M`

`M`

`N`

`0..N-1`

`1`

`__m256i`

`__m256i`

How can this work be split efficiently accross multiple threads?

Preferred language is C++ (MSVC++2017 toolset v141), assembly is also great. Preferred CPU is x86_64 (intrinsics are ok). AVX2 is desired, if any benefit from it.

Answer Source

Let's assume you want to divide this work up among `T`

threads. It's a pretty interesting problem since it isn't trivially parallelizable via partitioning and various solutions may apply for different sizes of `N`

and `M`

.

You could simply divide up the array `M`

into `T`

partitions and have each thread work on its own partition of `M`

with a shared `N`

. The main problem is that since `M`

is not sorted, all threads may access any element of `N`

and hence stomp on each others work. To avoid this, you'd have to use atomic operations such as compare-and-swap for each modification of the shared `N`

array, or come up with some locking scheme. Both approaches are likely to kill performance (i.e., using an atomic operation to set a bit is likely to be an order of magnitude slower than the equivalent single-threaded code).

Let's look at ideas that are likely faster.

One relatively obvious idea to avoid the "shared N" problem which requires atomic operations for all mutations of N is simply to give each T a private copy of N and merge them at the end via `or`

.

Unfortunately, this solution is `O(N) + O(M/T)`

whereas the original single-threaded solution and the "atomic" solution above are simply `O(M)`

. Since we know that `N >> M`

this is likely to be a poor tradeoff in this case. Still, it's worth noting that the hidden constants in each term are very different: the `O(N)`

term, which comes from the merging step^{0} can use 256-bit wide `vpor`

instructions, meaning a throughput of something close to 200-500 bits/cycle (if cached), while the bit-setting step which is `O(M/T)`

I estimate at closer to 1 bit/cycle. So this approach can certainly be the best one for moderate T even if the size of `N`

is 10 or 100 times the size of `M`

.

The basic idea here is to partition the indexes in `M`

such that each worker thread can then work on a disjoint part of the `N`

array. If `M`

was sorted, that would be trivial, but it's not, so...

A simple algorithm that will work well if `M`

is *smoothly distributed* is to first partition that values of `M`

into `T`

buckets, with the buckets having values in the ranges `[0, N/T), [N/T, 2N/T], ..., [(T-1)N/T, N)`

. That is, divide `N`

into `T`

disjoint regions and then find the values of `M`

that fall into each of them. You can spread that work across the `T`

threads by assigning each thread an equal size chunk of `M`

, and having them each create the `T`

partitions and then *logically merging*^{1} them at the end so you have the `T`

partitions of `M`

.

The second step is to actually set all the bits: you assign one partition to each thread `T`

which can set the bits in a "single threaded" way, i.e., not worrying about concurrent updates, since each thread is working on a disjoint partition of `N`

^{2}.

Both steps `O(M)`

and the second step is identical to the single-threaded case, so the overhead for parallelizing this is the first step. I suspect the first will range from about the same speed as the second to perhaps 2-4 times as slow, depending on implementation and hardware, so you can expect a speedup on a machine with many cores, but with only 2 or 4 it might not be any better.

If the distribution of `M`

is not *smooth*, such that the partitions created in the first step have very different sizes, it will work poorly because some threads will get a lot more work. A simple strategy is to create say `10 * T`

partitions, rather than only `T`

and have the threads in the second pass all consume from the same queue of partitions until complete. In this way you spread the work more evenly, unless the array `M`

is very bunched up. In that case you might consider a refinement of the first step which first essentially creates a bucketed histogram of the elements, and then a reduce stage which looks at the combined histogram to create a good partitioning.

Essentially, we are just progressively refining the first stage into a type of parallel sort/partitioning algorithm, for which there is already lots of literature. You might even find that a full (parallel) sort is fastest, since it will greatly help in bit-setting phase, since accesses will be in-order and have the best spatial locality (helping with prefetching and caching, respectively).

^{0} ... and also from the "allocate a private array of length N" step, although this is likely to be quite fast.

^{1} The conceptually simplest form of merging would be to simply copy each thread's partitions of M such that you have a contiguous partition of all of `M`

, but in practice if the partitions are large you can just leave the partitions where they are and link them together, adding some complexity to the consuming code, but avoiding the compacting step.

^{2} To make it truly disjoint from a threading point of view you want to ensure the partition of `N`

falls on "byte boundaries", and perhaps even cache-line boundaries to avoid false sharing (although the latter is likely not to be a big problem since it only occurs at the edge of each partition, and the order of processing means that you are not likely to get contention).