I was wondering if there is any simple way to transform a non coalesced memory access into a coalesced one. Let's take the example of this array :
You can try to use shared memory maybe, that might work (or not, hard to tell without an example).
For instance, say the first operation access coalesced data and the second one strides a lot; this may speedup things
__shared__ int shared[BLOCK_SIZE]; // Load data global -> shared with coalesced access ; you may need to load a bit more before/after depending on you application shared[tid] = global[some id] syncthreads(); // Do the math with coalescing access function0(shared[tid]) // Do the math with the non coalescing access function1(shared[tid+-1 or wathever])
The idea is to load data in shared in a coalescent manner, and then use shared to do the math, since coalescent access do not matter with shared memory (but bank conflict do on the other hand ; that's usually fine though).
You'll have to give us more information if you want a more accurate help. That's just a hint.