Jiajun Yang Jiajun Yang - 2 months ago 15
Python Question

OpenCL, PyopenCL, why GTX970 only has 13 compute units? How to find out the amount of processes in parallel

Can anyone tell me why OpenCL told me my Nvidia Geforce GTX 970 only has 13 maximum compute units? Also, does maximum compute units equal to execution units (EUs)? Because on my Iris 6100 laptop, device.max_compute_units is 48, which is the same as the EUs of the graphic card.

import pyopencl as cl
platform = cl.get_platforms()[0]
device = platform.get_devices()[0] # Get the GPU ID
print device.max_compute_units


Can someone shed some light on this issue? I am trying to figure out how many processes can be carried out in parallel here. So maybe I am looking at the wrong parameter?

Thousand thanks...

Answer

You can't query number of cores, only number of compute units. Intel integrated GPU generally has only 8 cores per compute unit and Nvidia has 192 or 128 cores per compute unit. Max_compute_units is the number of compute units and should be usable in device partition to find(and limit) the partitioning but generally only CPUs are supported for device partitioning.

How many processes(workitems) can be carried out in parallel depends on the hardware's capability. For an AMD graphics card, there can be as many as 40 times the number of cores (which is 64 times the number of compute units). For example, a 8 compute unit AMD GPU can have 20k(8 * 64 * 40) threads in-flight and many more in a single queue.

The maximum compute unit number can be altered from the drivers, I've seen an integrated Intel GPU showing only 8 compute units despite having 12, in beta drivers and also new AMD GPU gained support for reserving some of compute units for audio computing in some applications so the general purpose compute kernels may see only the remaining compute units in those apps.

If a Nvidia GPU has only 13 compute units and drivers enable all 13 for compute, then OpenCL can use all 13 of them. GTX970 has more cores in total than Intel Iris GPU.

Workgroups of an enqueued OpenCL kernel are executed by compute units as a whole so each workgroup's workitems share same memory of the compute unit with other workitems in same workgroup. But some vendors can stretch the rules a bit and more Compute units can be used collaboratively for a single workgroup . Such as Intel igpu.