Jean Senellart Jean Senellart - 2 months ago 10
C++ Question

sparse x dense matrix multiplication performance under-efficient

Context: I am using Eigen for Artificial Neural Network where the typical dimensions are around 1000 nodes per layer. So most of the operations are to multiplying matrix

M
of size ~(1000,1000) with a vector of size 1000 or a batch of B vectors, which are represented as matrices of size Bx1000.

After training a neural network, I am using pruning - which is a common compression technique which ends up with sparse matrix (density of non empty parameters between 10 and 50%).

Goal: I would like to use sparse matrix for compression purpose and secondarily for performance optimization but it is not the main goal

Issue:
I am comparing performance of sparse and dense matrix multiplication (only multiplication time is computed) for different batch sizes and I am observing the following (Using Eigen 3.2.8, MacBook Pro 64bits, without open_mp, and using standard g++):


  • when B=1 (Matrix x Vector) - sparse matrix operations with density 10% or 30% is more efficient than dense matrix operations - which seems expected result: far less operations are performed

  • for B=32:


    • the time needed for dense matrix operation is only ~10 times the time need for B=1 - which is cool - does it shows some vectorization effect?

    • the time needed for sparse matrix operation is 67 times the time needed for B=1 - which means that it is less efficient than processing the 32 vectors independently




MxN multiplication time (ms) for M sparse/dense, and N of size 1000xB

Same numbers but showing the time per vector in a batch of different size for sparse and dense matrix. We see clearly the decrease of time for dense matrix when batch size increase, and the augmentation for sparse matrix showing some wrong. Normalized with time for B=1

Code:
I am using the following types for sparse and dense matrices:

typedef SparseMatrix<float> spMatFloat;
typedef Matrix<float, Dynamic, Dynamic, RowMajor> deMatRowFloat;


the operation I am benchmarking is the following:

o.noalias()=m*in.transpose();


where
o
is a dense matrix (1000xB),
m
is either a dense matrix (1000x1000) or the corresponding sparse matrix obtained with
m.sparseView()
, and
in
is a dense matrix (Bx1000)

A full code is below (averaging time for 20 different random matrices, and running each multiplication 50 times) - time for B=32 and B=1 are below.

Any feedback/intuition is welcome!




batch 1 ratio 0.3 dense 0.32 sparse 0.29
batch 32 ratio 0.3 dense 2.75 sparse 15.01





#include <Eigen/Sparse>
#include <Eigen/Dense>
#include <stdlib.h>
#include <boost/timer/timer.hpp>

using namespace Eigen;
using namespace boost::timer;

typedef SparseMatrix<float> spMatFloat;
typedef Matrix<float, Dynamic, Dynamic, RowMajor> deMatRowFloat;

void bench_Sparse(const spMatFloat &m, const deMatRowFloat &in, deMatRowFloat &o) {
o.noalias()=m*in.transpose();
}

void bench_Dense(const deMatRowFloat &m, const deMatRowFloat &in, deMatRowFloat &o) {
o.noalias()=m*in.transpose();
}

int main(int argc, const char **argv) {
float ratio=0.3;
int iter=20;
int batch=32;
float t_dense=0;
float t_sparse=0;

deMatRowFloat d_o1(batch,1000);
deMatRowFloat d_o2(batch,1000);
for(int k=0; k<iter; k++) {
deMatRowFloat d_m=deMatRowFloat::Zero(1000,1000);
deMatRowFloat d_b=deMatRowFloat::Random(batch,1000);
for(int h=0;h<ratio*1000000;h++) {
int i=rand()%1000;
int j=rand()%1000;
d_m(i,j)=(rand()%1000)/500.-1;
}
spMatFloat s_m=d_m.sparseView();
{
cpu_timer timer;
for(int k=0;k<50;k++) bench_Dense(d_m,d_b,d_o1);
cpu_times const elapsed_times(timer.elapsed());
nanosecond_type const elapsed(elapsed_times.system+elapsed_times.user);
t_dense+=elapsed/1000000.;
}
{
cpu_timer timer;
for(int k=0;k<50;k++) bench_Sparse(s_m,d_b,d_o2);
cpu_times const elapsed_times(timer.elapsed());
nanosecond_type const elapsed(elapsed_times.system+elapsed_times.user);
t_sparse+=elapsed/1000000.;
}
}
std::cout<<"batch\t"<<batch<<"\tratio\t"<<ratio<<"\tdense\t"<<t_dense/50/iter<<"\tsparse\t"<<t_sparse/50/iter<<std::endl;
}


New Results after ggael suggestion: I tried the different possible combinations and found indeed huge differences of performance when changing
M
and
B
RowMajor/ColMajor.

To summarize I am interested in doing
M*B
where
M
is (1000,1000) and
B
is (1000,batch): I am interested in comparing performance of M sparse/dense and when batch is growing.

I tested 3 configurations:


  • M dense, B dense

  • M sparse, B dense

  • M sparse, B dense, but the multiplication of M*B is done manually column by column



results are as following - where the number is the ratio time per column for B=32/time for B=1 with matrix M with density 0.3:

here

The initial reported problem was the worse case (M ColMajor, B RowMajor). For (M RowMajor, B ColMajor), there is a 5 times speedup between B=32 and B=1 and performance of sparse matrix is almost equivalent to dense matrix.

Answer

In Eigen, for dense algebra, both matrix-vector and matrix-matrix products are highly optimized and take full advantage of vectorization. As you observed, matrix-matrix products exhibit a much higher efficiency. This is because matrix-matrix products can further optimized by increasing the ratio between the number of arithmetic operations and memory accesses, and by exploiting memory caches.

Then regarding sparse-dense products, there are two strategies:

  1. Process the dense right hand side one column at once, and thus scan the sparse matrix multiple times. For this strategy, better use a column-major storage for the dense matrices (right-hand side and result). In Eigen 3.2, this strategy has be emulated by scanning the columns manually.
  2. Scan the sparse matrix only once, and process the rows of the dense right hand side and results in the most nested loop. This is default strategy in Eigen 3.2. In this case, better use a row-major storage for the dense matrices (Matrix<float,Dynamic,32,RowMajor>).

Finally, in either case, you could try with both a row-major and column-major storage for the sparse matrix, and figure out which combination of strategy and storage order of the sparse matrix works best in your case.