John X. John X. - 2 months ago 29
C++ Question

Eigen matrix multiplication slower than cblas?

I use the following code to test Eigen perfomance.

#include <iostream>
#include <chrono>
#include <eigen3/Eigen/Dense>
#include <cblas.h>
using namespace std;
using namespace std::chrono;

int main()
int n = 3000;

high_resolution_clock::time_point t1, t2;

Eigen::MatrixXd A(n, n), B(n, n), C(n, n);

t1 = high_resolution_clock::now();
C = A * B;
t2 = high_resolution_clock::now();
auto dur = duration_cast<milliseconds>(t2 - t1);
cout << "eigen: " << dur.count() << endl;

t1 = high_resolution_clock::now();
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
n, n, n, 1.0,, n,, n, 1.0,, n);
t2 = high_resolution_clock::now();
dur = duration_cast<milliseconds>(t2 - t1);
cout << "cblas: " << dur.count() << endl;

return 0;

I compile it with the following command:

g++ test.cpp -O3 -fopenmp -lblas -std=c++11 -o test

The results are:

eigen: 1422 ms

cblas: 432 ms

Am i doing something wrong? According to their benchmark it should be faster.

Another problem is that using numpy i get 24 ms

import time
import numpy as np

a = np.random.random((3000, 3000))
b = np.random.random((3000, 3000))
start = time.time()
c = a * b
print("time: ", time.time() - start)


Saying that you are using cblas provide very little information because cblas is just an API. The underlying BLAS library could be netlib's BLAS, OpenBLAS, ATLAS, Intel MKL, Apple's Accelerate, or even EigenBlas... Given your measurements, it is pretty obvious that your underlying BLAS is an highly optimized one exploiting AVX+FMA+multi-threading. So for fair comparison, you must also enable those feature on Eigen's side by compiling with -march=native -fopenmp and make sure you are using Eigen 3.3. Then the performance should be about the same.

Regarding numpy, Warren Weckesser already solved the issue. You could have figured out yourself that 24ms to perform 2*3000^3=54e9 floating point operations on a standard computer is impossible.