LocalVolatility LocalVolatility - 5 months ago 23
C++ Question

Why is std::none_of faster than a hand rolled loop?

I benchmarked the performance of

against a three different manual implementations using i) a
loop, ii) a range-based
loop and iii) iterators. To my surprise, I found that while all three manual implementations take roughly the same time,
is significantly faster. My question is - why is this the case?

I used the Google benchmark library and compiled with
-std=c++14 -O3
. When running the test, I restricted the affinity of the process to a single processor. I get the following result using GCC 6.2:

Benchmark Time CPU Iterations
benchmarkSTL 28813 ns 28780 ns 24283
benchmarkManual 46203 ns 46191 ns 15063
benchmarkRange 48368 ns 48243 ns 16245
benchmarkIterator 44732 ns 44710 ns 15698

On Clang 3.9,
is also faster than the manual
loop though the speed difference is smaller. Here is the test code (only including the manual for loop for brevity):

#include <algorithm>
#include <array>
#include <benchmark/benchmark.h>
#include <functional>
#include <random>

const size_t N = 100000;
const unsigned value = 31415926;

template<size_t N>
std::array<unsigned, N> generateData() {
std::mt19937 randomEngine(0);
std::array<unsigned, N> data;
std::generate(data.begin(), data.end(), randomEngine);
return data;

void benchmarkSTL(benchmark::State & state) {
auto data = generateData<N>();
while (state.KeepRunning()) {
bool result = std::none_of(
std::bind(std::equal_to<unsigned>(), std::placeholders::_1, value));

void benchmarkManual(benchmark::State & state) {
auto data = generateData<N>();
while (state.KeepRunning()) {
bool result = true;
for (size_t i = 0; i < N; i++) {
if (data[i] == value) {
result = false;



Note that generating the data using a random number generator is irrelevant. I get the same result when just setting the
-th element to
and checking if the value
N + 1
is contained.


After some more investigation, I will try to answer my own question. As suggested by Kerrek SB, I looked at the generated assembly code. The bottom line seems to be that GCC 6.2 does a much better job at unrolling the loop implicit in std::none_of compared to the other three versions.

GCC 6.2:

  • std::none_of is unrolled 4 times -> ~30µs
  • manual for, range for and iterator are not being unrolled at all -> ~45µs

As suggested by Corristo, the result is compiler dependend - which makes perfect sense. Clang 3.9 unrolls all but the range for loop, though to varying degrees.

Clang 3.9

  • `std::none_of' is unrolled 8 times -> ~30µs
  • manual for is unrolled 5 times -> ~35µs
  • range for is not being unrolled at all -> ~60µs
  • iterator is unrolled 8 times -> ~28µs

All code was compiled with -std=c++14 -O3.