LocalVolatility LocalVolatility - 2 months ago 7
C++ Question

Why is std::none_of faster than a hand rolled loop?

I benchmarked the performance of

std::none_of
against a three different manual implementations using i) a
for
loop, ii) a range-based
for
loop and iii) iterators. To my surprise, I found that while all three manual implementations take roughly the same time,
std::none_of
is significantly faster. My question is - why is this the case?

I used the Google benchmark library and compiled with
-std=c++14 -O3
. When running the test, I restricted the affinity of the process to a single processor. I get the following result using GCC 6.2:

Benchmark Time CPU Iterations
--------------------------------------------------------
benchmarkSTL 28813 ns 28780 ns 24283
benchmarkManual 46203 ns 46191 ns 15063
benchmarkRange 48368 ns 48243 ns 16245
benchmarkIterator 44732 ns 44710 ns 15698


On Clang 3.9,
std::none_of
is also faster than the manual
for
loop though the speed difference is smaller. Here is the test code (only including the manual for loop for brevity):

#include <algorithm>
#include <array>
#include <benchmark/benchmark.h>
#include <functional>
#include <random>

const size_t N = 100000;
const unsigned value = 31415926;

template<size_t N>
std::array<unsigned, N> generateData() {
std::mt19937 randomEngine(0);
std::array<unsigned, N> data;
std::generate(data.begin(), data.end(), randomEngine);
return data;
}

void benchmarkSTL(benchmark::State & state) {
auto data = generateData<N>();
while (state.KeepRunning()) {
bool result = std::none_of(
data.begin(),
data.end(),
std::bind(std::equal_to<unsigned>(), std::placeholders::_1, value));
assert(result);
}
}

void benchmarkManual(benchmark::State & state) {
auto data = generateData<N>();
while (state.KeepRunning()) {
bool result = true;
for (size_t i = 0; i < N; i++) {
if (data[i] == value) {
result = false;
break;
}
}
assert(result);
}
}

BENCHMARK(benchmarkSTL);
BENCHMARK(benchmarkManual);

BENCHMARK_MAIN();


Note that generating the data using a random number generator is irrelevant. I get the same result when just setting the
i
-th element to
i
and checking if the value
N + 1
is contained.

Answer

After some more investigation, I will try to answer my own question. As suggested by Kerrek SB, I looked at the generated assembly code. The bottom line seems to be that GCC 6.2 does a much better job at unrolling the loop implicit in std::none_of compared to the other three versions.

GCC 6.2:

  • std::none_of is unrolled 4 times -> ~30µs
  • manual for, range for and iterator are not being unrolled at all -> ~45µs

As suggested by Corristo, the result is compiler dependend - which makes perfect sense. Clang 3.9 unrolls all but the range for loop, though to varying degrees.

Clang 3.9

  • `std::none_of' is unrolled 8 times -> ~30µs
  • manual for is unrolled 5 times -> ~35µs
  • range for is not being unrolled at all -> ~60µs
  • iterator is unrolled 8 times -> ~28µs

All code was compiled with -std=c++14 -O3.