MasterID - 11 months ago 72
C++ Question

# Dense Matrix-vector multiplication in VexCL

VexCL seems to be a very attractive library for gpu programming.

Unfortunately, its a very young library and there are few information over there.
I've been searching how to execute a matrix-vector multiplication, but the only matrix representation I've found is vex::SpMat, which holds a sparse matrix.

If matrix is dense, then the sparse representations are less efficient for computations, usually.

All my matrix are dense and I want to know how to execute this efficiently in VexCL.

I am the developer of VexCL library.

I have to admit dense linear algebra operations are not on my priority list. I believe it is very hard to implement those in a way that would be performance-portable across various devices supported by VexCL (that is, by OpenCL/CUDA). This task is probably best left to the vendor BLAS implementations (but patches are welcome!).

You may also want to look at the opensource ViennaCL library, which does provide dense matrix operations and supports OpenCL, CUDA, and OpenMP backends. Their autotuning framework allows them to get portable performance which is close to vendor-tuned libraries.

Having said that, you have a couple of options (aside from providing a custom kernel) for the dense matrix - vector product in VexCL. First, you may use direct implementation based on definition of matrix-vector product:

using namespace vex;
Context ctx(Filter::Env && Filter::Count(1));

// The n x m matrix stored row-wise.
vector<double> A(ctx, n * m);
// The LHS and RHS vectors.
vector<double> x(ctx, m);
vector<double> y(ctx, n);

// Make an n x m matrix from vector x by replicating it along the first
// dimension (reshape), multiply it elementwise by A, and reduce the result
// along the second dimension.
// In other words, y_i = sum_j (A_ij * x_j)
y = reduce<SUM>(
extents[n][m],  // Shape of the expression to reduce,
A * reshape(
x,
extents[n][m], // (We need an n x m matrix...
extents[1]     // ... but we only have vector of size m).
),          // the expression,
1               // and the dimension to reduce along.
);


With C++14 this could be easily hidden away into a function call:

template <class M, class V>
auto prod(size_t n, size_t m, M &&A, V &&x) {
using namespace vex;
auto NxM = extents[n][m];
return reduce<SUM>(NxM, A * reshape(x, NxM, extents[1]), 1);
}


Second, you may just use vendor specific library. For example, if you use CUDA backend with VexCL, you could get raw pointers to VexCL-allocated memory regions and call cuBLAS gemv:

double one  = 1;
double zero = 0;
cublasDgemv(
cublas_handle, CUBPLAS_OP_N, n, m,
&zero,
A(0).raw_ptr(), m,
x(0).raw_ptr(), 1
&one,
y(0).raw_ptr(), 1
);


The first approach should be less effective than a call to cuBLAS. Its advantage is that the result of reduce() call is a vector expression and you could in principle combine several of those into a single fused compute kernel. For example, you could compute Ax + By, or sin(Ax) + cos(By), or (A + B)(x - y), or any other vector expression in a single kernel call:

z = prod(n, m, A, x) + prod(n, m, B, y);
z = sin(prod(n, m, A, x)) + cos(prod(n, m, B, y));
z = prod(n, m, A + B, x - y);


This could be more effective than several chained cuBLAS calls. I have examples where VexCL outperforms cuBLAS by a factor of 1.5 due to this.