Ivan Velichko Ivan Velichko - 3 months ago 12
C++ Question

When RVO shows the maximum performance impact?

I've spent a little time trying to understand is RVO performance impact is as valuable as I thought.

Here is my benchmark code (the main idea is to create big structures and return them from functions):

#include <chrono>
#include <iostream>
#include <stdlib.h>
#include <time.h>
#define SIZE_MEDIUM 128
#define SIZE_LARGE 8192
#define ITER_COUNT 100000
using namespace std;
using namespace std::chrono;

struct MediumStruct {
int data[SIZE_MEDIUM];
};
struct LargeStruct {
int data[SIZE_LARGE];
};

template<typename T>
T by_value(int size) {
T rv;
for (int i = 0; i < size; i++) {
rv.data[i] = rand();
}
return rv;
}

template<typename T>
void by_ref(T &v, int size) {
for (int i = 0; i < size; i++) {
v.data[i] = rand();
}
}

template<typename T>
void bench_by_value(int size, const string &suite) {
high_resolution_clock::time_point t1 = high_resolution_clock::now();

int counter = 0;
for (int i = 0; i < ITER_COUNT; i++) {
T r = by_value<T>(size);
for (int j = 0; j < size; j++) {
counter += r.data[j];
}
}

high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "By value (" << suite << ") " << duration << " ms "
<< "[stub " << counter << "]" << endl;
}

template<typename T>
void bench_by_ref(int size, const string &suite) {
high_resolution_clock::time_point t1 = high_resolution_clock::now();

int counter = 0;
for (int i = 0; i < ITER_COUNT; i++) {
T r;
by_ref<T>(r, size);
for (int j = 0; j < size; j++) {
counter += r.data[j];
}
}

high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "By ref (" << suite << ") " << duration << " ms "
<< "[stub " << counter << "]" << endl;
}

int main() {
srand(time(NULL));
bench_by_value<MediumStruct>(SIZE_MEDIUM, "MEDIUM");
bench_by_value<LargeStruct>(SIZE_LARGE, "LARGE");
bench_by_ref<MediumStruct>(SIZE_MEDIUM, "MEDIUM");
bench_by_ref<LargeStruct>(SIZE_LARGE, "LARGE");
}


On my MacBook Pro this benchmark shows pretty the same performance with and without RVO optimization.

$g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.2.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

$g++ -std=c++11 -S -o /dev/stdout main.cpp | grep memcpy | wc -l
0

$g++ -std=c++11 -fno-elide-constructors -S -o /dev/stdout main.cpp | grep memcpy | wc -l
4

$ g++ -std=c++11 main.cpp
$ ./a.out
By value (MEDIUM) 123 ms [stub -1593461165]
By value (LARGE) 7741 ms [stub -1299931242]
By ref (MEDIUM) 120 ms [stub -1955762550]
By ref (LARGE) 7835 ms [stub 911248215]
$ ./a.out
By value (MEDIUM) 118 ms [stub 2094615909]
By value (LARGE) 7840 ms [stub -1276864738]
By ref (MEDIUM) 118 ms [stub -223778153]
By ref (LARGE) 7890 ms [stub -381745773]

$ g++ -std=c++11 -fno-elide-constructors main.cpp
$ ./a.out
By value (MEDIUM) 122 ms [stub 1921645226]
By value (LARGE) 8078 ms [stub 1869896539]
By ref (MEDIUM) 123 ms [stub -676968691]
By ref (LARGE) 7855 ms [stub 1621698360]
$ ./a.out
By value (MEDIUM) 119 ms [stub 954834819]
By value (LARGE) 7779 ms [stub 98742842]
By ref (MEDIUM) 118 ms [stub 1498384025]
By ref (LARGE) 7505 ms [stub -118604845]


Having such results one might tend to always use returning by value technique since it looks better semantically even if a compiler doesn't provide RVO. When RVO really gives significant performance impact?

Link to the benchmark gist.

Answer

There are two things, you shall know about RVO:

1. RVO is not (regular) compiler optimization

Sounds crazy because last O stands for "optimization", but let me explain. Regular compiler optimizations preserve semantics of your code: at least the order of side effects and dependencies between them. RVO do not bother such things. Example:

#include <cstdio>
using namespace std;
struct foo {
  foo () { printf ("foo::foo()\n"); }
  foo (const foo&) { printf ("foo::foo( const foo& )\n"); }
  ~foo () { printf ("foo::~foo()\n"); }
};

foo bar() {
  foo local_foo;
  return local_foo;
}

int main() {
  foo f = bar();
  return 0;
}

What do you expect on screen? Without RVO:

$ g++ -O2 rvo.cc -fno-elide-constructors
$ ./a.out

foo::foo()
foo::foo( const foo& )
foo::~foo()
foo::foo( const foo& )
foo::~foo()
foo::~foo()

With RVO:

$ g++ -O2 rvo.cc
$ a.out

foo::foo()
foo::~foo()

No compiler ever will optimize user code in such way. But RVO is really more then optimization: it is prescription for compiler to be blind in some cases. Blind to user-defined semantics, I mean.

2. RVO is extremely effective

Your benchmarks just not exploit this because you do not do things that RVO elides. You do things that any normal compiler may optimize. But lets dance around return value copying:

class MyBuf {
  int x_;
  char *buf_;
public:
  MyBuf(int x) : x_(x), buf_(new char[x_]) {}
  MyBuf(const MyBuf &rhs) :
    x_(rhs.x_), buf_(new char[x_])
  {
    memcpy (buf_, rhs.buf_, x_);
  }
  ~MyBuf() { delete [] buf_; }
  char &operator[](int x) { return buf_[x]; }
};

Now:

MyBuf
retvec()
{
  MyBuf v(10000);
  v[0] = 1;
  return v;
}

And wow:

for (idx = 0; idx != 1000000; ++idx)
  {
    MyBuf v = retvec();
    cnt += v[0];
  }
fprintf (stderr, "cnt is %d\n", cnt);

You shall see:

$ g++ --std=c++11 -O2 rvo-spec.cc -fno-elide-constructors
$ time ./a.out 
cnt is 1000000

real    0m1.284s
user    0m1.280s
sys 0m0.000s

$ g++ --std=c++11 -O2 rvo-spec.cc
$ time ./a.out 
cnt is 1000000

real    0m0.099s
user    0m0.096s
sys 0m0.000s

More then 12 times difference for very simple scenario.

I hope I created some feeling of RVO and why you shall never switch it off. Good luck with it!