I'm trying to very rapidly convert packed 24bpp RGB images to packed 32bpp RGBA. I've tried using
kernel void rgb24_to_rgba32(texture2d<half, access::read> inTexture [[texture(0)]],
texture2d<half, access::write> outTexture [[texture(1)]],
uint2 id [[ thread_position_in_grid ]])
uint2 srcAddr1 = uint2(id.x * 3, id.y);
uint2 srcAddr2 = uint2(id.x * 3 + 1, id.y);
uint2 srcAddr3 = uint2(id.x * 3 + 2, id.y);
outTexture.write(half4(inTexture.read(srcAddr1).r, inTexture.read(srcAddr2).r, inTexture.read(srcAddr3).r, 1), id);
Just for the sake of closure, here is Apple's Developer Relations response to this question. The bottom line is that the GPU just doesn't offer any real advantages in this case because this conversion is not a computationally heavy operation.
After discussions with engineering, and evaluating more sample implementations, the verdict is out on Metal v.s. Accelerate performance for converting packed 24bpp RGB images to packed 32bpp RGBA images: on newer devices you can get close to the same performance using Metal but Accelerate will be faster for this operation. “vImage is an extremely well-tuned implementation and since this conversion operation is not compute heavy the best we can do is to be at parity.”
The proposed reasoning behind this is data locality and efficiently operating on multiple pixels at a time (something you’ve mentioned). The fastest Metal implementation tested processed two pixels per thread and still lagged behind
There was an “optimized” implementation using Metal buffers rather than textures (something else that you’d mentioned exploring) and surprisingly this approach was slightly less performant.
Lastly, adjustment of thread groups came into discussion as well as tuning by adding code to the kernel to handle the case where the thread position in grid is outside the destination image. Again, despite these considerations Accelerate remained as the fastest implementation.
I should add that one real advantage to using Metal is CPU usage, while it's no faster, it does significantly reduce the CPU's workload. For applications where the CPU is heavily loaded, the Metal approach might actually make sense.