I'm trying to do a matrix multiplication between a 3x3 matrix and 360x360 matrix. The smaller matrix (3x3) is going to be manipulated with the first (3x3) block of the big matrix and so forth. Hence I want to have my smaller matrix constant and slide it over my bigger matrix.
Is it possible to store my smaller matrix as part of shared memory and have my bigger matrix divided into 3x3 in global?
I'm not finding a way to copy the smaller matrix to shared directly from host. Kindly do correct me if my visualization of cuda is wrong.
It is not possible to populate shared memory from the host.
However, the best way to handle constants for all threads, such as the 3x3 matrix from your example, is to put them in constant memory (the size of which is 64 kB). There are 2 ways of using constant memory:
structcontaining your kernel arguments, including the 3x3 matrix, and pass it to your kernel
__constant__type qualifier and use
cudaMemcpyToSymbolto populate it from the host