ldoogy ldoogy - 11 months ago 79
iOS Question

Using GPU for pixel format conversions?

I'm trying to very rapidly convert packed 24bpp RGB images to packed 32bpp RGBA. I've tried using

, but was wondering if there was a faster way by using a compute kernel in Metal. I've tried several different approaches in Metal but the result is always considerably slower than with
, even for large images with >1M pixels.

Here's what my compute kernel looks like:

kernel void rgb24_to_rgba32(texture2d<half, access::read> inTexture [[texture(0)]],
texture2d<half, access::write> outTexture [[texture(1)]],
uint2 id [[ thread_position_in_grid ]])
uint2 srcAddr1 = uint2(id.x * 3, id.y);
uint2 srcAddr2 = uint2(id.x * 3 + 1, id.y);
uint2 srcAddr3 = uint2(id.x * 3 + 2, id.y);

outTexture.write(half4(inTexture.read(srcAddr1).r, inTexture.read(srcAddr2).r, inTexture.read(srcAddr3).r, 1), id);


I am defining the
as a
, and the outTexture as a
. Both textures are loaded using
, so there shouldn't be any memory copies taking place.

The code works and the conversion is performed correctly, but the performance is unimpressive. I've tried different
settings, but none of those achieve comparable performance to

For example, on an A7 (1st generation iPad Air), a 1024x1024 image takes around 32 ms, compared to 6 ms using
. Interestingly, the difference is far smaller for a faster device such as an A9-based iPhone 6s (1.5 ms on the GPU vs. 1.1 ms using
), but the Metal implementation is always slower.

Is this just not a GPU-friendly operation (possibly due to countless unaligned memory accesses?) Might I be missing something fundamental in terms of maximizing the performance of my compute kernel?

Answer Source

Just for the sake of closure, here is Apple's Developer Relations response to this question. The bottom line is that the GPU just doesn't offer any real advantages in this case because this conversion is not a computationally heavy operation.

After discussions with engineering, and evaluating more sample implementations, the verdict is out on Metal v.s. Accelerate performance for converting packed 24bpp RGB images to packed 32bpp RGBA images: on newer devices you can get close to the same performance using Metal but Accelerate will be faster for this operation. “vImage is an extremely well-tuned implementation and since this conversion operation is not compute heavy the best we can do is to be at parity.”

The proposed reasoning behind this is data locality and efficiently operating on multiple pixels at a time (something you’ve mentioned). The fastest Metal implementation tested processed two pixels per thread and still lagged behind vImageConvert_RGB888toRGBA8888.

There was an “optimized” implementation using Metal buffers rather than textures (something else that you’d mentioned exploring) and surprisingly this approach was slightly less performant.

Lastly, adjustment of thread groups came into discussion as well as tuning by adding code to the kernel to handle the case where the thread position in grid is outside the destination image. Again, despite these considerations Accelerate remained as the fastest implementation.

I should add that one real advantage to using Metal is CPU usage, while it's no faster, it does significantly reduce the CPU's workload. For applications where the CPU is heavily loaded, the Metal approach might actually make sense.