jdtournier - 3 months ago 26

C++ Question

We're just in the process of porting our codebase over to Eigen 3.3 (quite an undertaking with all the 32-byte alignment issues). However, there's a few places where performance seems to have been badly affected, contrary to expectations (I was looking forward to some speedup given the extra support for FMA and AVX...). These include eigenvalue decomposition, and matrix*matrix.transpose()*vector products. I've written two minimal working examples to demonstrate.

All tests run on an up to date Arch Linux system, using an Intel Core i7-4930K CPU (3.40GHz), and compiled with g++ version 6.2.1.

A straightforward self-adjoint eigenvalue decomposition takes twice as long with Eigen 3.3.0 as it does with 3.2.10.

File

`test_eigen_EVD.cpp`

`#define EIGEN_DONT_PARALLELIZE`

#include <Eigen/Dense>

#include <Eigen/Eigenvalues>

#define SIZE 200

using namespace Eigen;

int main (int argc, char* argv[])

{

MatrixXf mat = MatrixXf::Random(SIZE,SIZE);

SelfAdjointEigenSolver<MatrixXf> eig;

for (int n = 0; n < 1000; ++n)

eig.compute (mat);

return 0;

}

Test results:

- eigen-3.2.10:

`g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD`

real 0m5.136s

user 0m5.133s

sys 0m0.000s

- eigen-3.3.0:

`g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD`

real 0m11.008s

user 0m11.007s

sys 0m0.000s

Not sure what might be causing this, but if anyone can see a way of maintaining performance with Eigen 3.3, I'd like to know about it!

This particular example takes a whopping 200× longer with Eigen 3.3.0...

File

`test_eigen_products.cpp`

`#define EIGEN_DONT_PARALLELIZE`

#include <Eigen/Dense>

#define SIZE 200

using namespace Eigen;

int main (int argc, char* argv[])

{

MatrixXf mat = MatrixXf::Random(SIZE,SIZE);

VectorXf vec = VectorXf::Random(SIZE);

for (int n = 0; n < 50; ++n)

vec = mat * mat.transpose() * VectorXf::Random(SIZE);

return vec[0] == 0.0;

}

Test results:

- eigen-3.2.10:

`g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products`

real 0m0.040s

user 0m0.037s

sys 0m0.000s

- eigen-3.3.0:

`g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products`

real 0m8.112s

user 0m7.700s

sys 0m0.410s

Adding brackets to the line in the loop like this:

`vec = mat * ( mat.transpose() * VectorXf::Random(SIZE) );`

makes a huge difference, with both Eigen versions then performing equally well (actually 3.3.0 is slightly better), and faster than the unbracketed 3.2.10 case. So there is a fix. Still, it's odd that 3.3.0 would struggle so much with this.

I don't know whether this is a bug, but I guess it's worth reporting in case this is something that needs to be fixed. Or maybe I was just doing it wrong...

Any thoughts appreciated.

Cheers,

Donald.

As pointed out by ggael, the EVD in Eigen 3.3 is faster if compiled using

`clang++`

`-O3`

`g++`

Problem 2 isn't really a problem since I can just put brackets to force the most efficient order of operations. But just for completeness: there does seems to be a flaw somewhere in the evaluation of these operations. Eigen is an incredible piece of software, I think this probably deserves to be fixed. Here's a modified version of the MWE, just to show that it's unlikely to be related to the first temporary product being taken out of the loop (at least as far as I can tell):

`#define EIGEN_DONT_PARALLELIZE`

#include <Eigen/Dense>

#include <iostream>

#define SIZE 200

using namespace Eigen;

int main (int argc, char* argv[])

{

VectorXf vec (SIZE), vecsum (SIZE);

MatrixXf mat (SIZE,SIZE);

for (int n = 0; n < 50; ++n) {

mat = MatrixXf::Random(SIZE,SIZE);

vec = VectorXf::Random(SIZE);

vecsum += mat * mat.transpose() * VectorXf::Random(SIZE);

}

std::cout << vecsum.norm() << std::endl;

return 0;

}

In this example, the operands are all initialised within the loop, and the results accumulated in

`vecsum`

`clang++ -O3`

`$ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products`

5467.82

real 0m0.060s

user 0m0.057s

sys 0m0.000s

$ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products

5467.82

real 0m4.225s

user 0m3.873s

sys 0m0.350s

So same result, but vastly different execution times. Thankfully, this is is easily resolved by placing brackets in the right places, but there does seem to be a regression somewhere in Eigen 3.3's evaluation of operations. With brackets around the

`mat.transpose() * VectorXf::Random(SIZE)`

In the meantime, I'll accept ggael's answer, it's all I needed to know to move forward.

Answer

For the EVD, I cannot reproduce with clang. With gcc, you need `-O3`

to avoid an inlining issue. Then, with both compiler, Eigen 3.3 will deliver a 33% speedup.

Regarding the `matrix*matrix*vector`

product, as you noticed you should really add the parenthesis to perform two `matrix*vector`

products instead of a big `matrix*matrix`

product. Then the speed difference is easily explained by the fact that in 3.2, the nested `matrix*matrix`

product is immediately evaluated (at nesting time), whereas in 3.3 it is evaluated at evaluation time, that is in `operator=`

. This means that in 3.2, the loop is equivalent to:

```
for (int n = 0; n < 50; ++n) {
MatrixXf tmp = mat * mat.transpose();
vec = tmp * VectorXf::Random(SIZE);
}
```

and thus the compiler can move `tmp`

out of the loop. Production code should not rely on the compiler for this kind of task and rather explicitly moves constant expression outside loops.