Aliya Clark - 2 months ago 28

C++ Question

I am new to CUDA and going through the CUDA toolkit documentation. There I found a example where matrix multiplication is using shared memory. Here when copying the Matrix structure from host memory to device memory only the data elements are copied. What I can't understand is how other variables are copied to device memory.

Matrix structure is as follows

`typedef struct {`

int width;

int height;

int stride;

float* elements;

} Matrix;

Then here is the code sample where data transfer happens

`void MatMul(const Matrix A, const Matrix B, Matrix C)`

{

// Load A and B to device memory

Matrix d_A;

d_A.width = d_A.stride = A.width; d_A.height = A.height;

size_t size = A.width * A.height * sizeof(float);

cudaMalloc(&d_A.elements, size);

cudaMemcpy(d_A.elements, A.elements, size,

cudaMemcpyHostToDevice);

Matrix d_B;

d_B.width = d_B.stride = B.width; d_B.height = B.height;

size = B.width * B.height * sizeof(float);

cudaMalloc(&d_B.elements, size);

cudaMemcpy(d_B.elements, B.elements, size,

cudaMemcpyHostToDevice);

// Allocate C in device memory

Matrix d_C;

d_C.width = d_C.stride = C.width; d_C.height = C.height;

size = C.width * C.height * sizeof(float);

cudaMalloc(&d_C.elements, size);

// Invoke kernel

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);

MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

// Read C from device memory

cudaMemcpy(C.elements, d_C.elements, size,

cudaMemcpyDeviceToHost);

// Free device memory

cudaFree(d_A.elements);

cudaFree(d_B.elements);

cudaFree(d_C.elements);

}

Here what I don't understand is how width,stride and height is copied to device memory. Because here cudaMalloc and cudaMemcpy is only for the elements. Is there something I have missed in understanding this.

The kernal codes

`__device__ float GetElement(const Matrix A, int row, int col)`

{

return A.elements[row * A.stride + col];

}

// Set a matrix element

__device__ void SetElement(Matrix A, int row, int col,

float value)

{

A.elements[row * A.stride + col] = value;

}

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is

// located col sub-matrices to the right and row sub-matrices down

// from the upper-left corner of A

__device__ Matrix GetSubMatrix(Matrix A, int row, int col)

{

Matrix Asub;

Asub.width = BLOCK_SIZE;

Asub.height = BLOCK_SIZE;

Asub.stride = A.stride;

Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row

+ BLOCK_SIZE * col];

return Asub;

}

Matrix multiplication kernal code

`__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)`

{

// Block row and column

int blockRow = blockIdx.y;

int blockCol = blockIdx.x;

// Each thread block computes one sub-matrix Csub of C

Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

// Each thread computes one element of Csub

// by accumulating results into Cvalue

float Cvalue = 0;

// Thread row and column within Csub

int row = threadIdx.y;

int col = threadIdx.x;

// Loop over all the sub-matrices of A and B that are

// required to compute Csub

// Multiply each pair of sub-matrices together

// and accumulate the results

for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

// Get sub-matrix Asub of A

Matrix Asub = GetSubMatrix(A, blockRow, m);

// Get sub-matrix Bsub of B

Matrix Bsub = GetSubMatrix(B, m, blockCol);

// Shared memory used to store Asub and Bsub respectively

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load Asub and Bsub from device memory to shared memory

// Each thread loads one element of each sub-matrix

As[row][col] = GetElement(Asub, row, col);

Bs[row][col] = GetElement(Bsub, row, col);

// Synchronize to make sure the sub-matrices are loaded

// before starting the computation

__syncthreads();

// Multiply Asub and Bsub together

for (int e = 0; e < BLOCK_SIZE; ++e)

Cvalue += As[row][e] * Bs[e][col];

// Synchronize to make sure that the preceding

// computation is done before loading two new

// sub-matrices of A and B in the next iteration

__syncthreads();

}

// Write Csub to device memory

// Each thread writes one element

SetElement(Csub, row, col, Cvalue);

}

Answer

For those who wonder, the sample code we talk about is here on Nvidia's CUDA toolkit documentation, in the shared memory topic : CUDA C Programming guide : Shared memory

So, why does this sample work ? Yes, only "elements" array is sent on device side by using cudaMalloc and cudaMemcpy functions. Yes, the matrix dimensions are used within the Kernel on device side without being explicitly copied to device memory with cudaMemcpy.

You need to consider arrays and parameters NOT in the same way. Let me explain how these values are sent to the Kernel.

- We declare the matrix on CPU side, all members are uninitialized
- We assign the dimensions, the pointer is still uninitialized
- We allocate and copy memory on device side with API functions, the pointer is initialized but it targets device memory and cannot be used like a regular host array
- We give the matrix as parameter to the kernel. NOT by pointer, but by value.

And that's the trick. The complete structure instance is given as a parameter, and it contains :

- Three integers, the dimensions of the matrix
- A pointer to the array that contains the matrix data

Giving integers as parameters in the Kernel launch is obviously possible and it works fine. Giving the pointer to an array is possible too : the pointer is copied, meaning we create another pointer pointing to the same zone in memory. If the array we target was on host memory, it would cause errors, but as it has been initialized on device side with API functions, it works fine.