I am trying to implement this CUDA example:
Because I have 0x4000 bytes available, I tried to use
TILE_DIM = 128
shared unsigned char tile[TILE_DIM][TILE_DIM];
CUDACOMPILE : ptxas error : Entry function '_Z18transposeCoalescedPh' uses too much shared data (0x4018 bytes, 0x4000 max)
So I have 0x18 (24) extra bytes in shared memory. Where do they come from, and is it possible to remove them?
Referring to the programming guide:
The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory, the amount of dynamically allocated shared memory, and for devices of compute capability 1.x, the amount of shared memory used to pass the kernel's arguments (see
As long as you compile for a cc1.x architecture, you won't be able to elminate the use of shared memory to carry kernel parameters.
I think the solution as you've already indicated, is to compile for a cc2.0 or cc3.0 architecture. It's not clear why you wouldn't want to do this.