François Laenen François Laenen - 1 month ago 4x
C++ Question

CUDA | Interest of the number of multiprocessors - confusion with SMs

I've got a NVIDIA GT650M, with the following properties :

( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
Maximum number of threads per multiprocessor: 2048

I just come out of the confusion between streaming multiprocessor (SM), and the actual multiprocessors. SMs and multiprocessors are different things, right?
For example, using the visual profiler, I've got a dummy kernel which only waits and last 370ms when launched with 1 block of 1 thread.
I can launch it with 4 blocks of 1024 threads with one SM, it still lasts 370ms. This is normal because the task uses the 2 multiprocessors of the chip, each one using 2048 concurrent threads (as soon as I use 5 blocks x 1024, it takes 740ms, normal).
Similarly, I can launch concurrently 4 times one block of 1024 threads using 4 SMs, it still takes 370ms, ok.

This first part of the question was just to be sure that we shouldn't confuse SMs and multiprocessors? Like I see sometimes even in answers like here: CUDA - Multiprocessors, Warp size and Maximum Threads Per Block: What is the exact relationship?
As a result, one cannot explicitly control the way that the tasks are scheduled though the multiprocessors, because (as far as I know) no runtime function permit it right? So, if I have a card with 2 mutliprocessors and 2048 thread per multiprocessor, or another one with 4 multiprocessors with 1024 threads each, a given program will get executed the same way?

Secondly, I wanted to know what is better for which usage, having more multiprocessors with few cores, or the reverse? So far, my understanding makes me say that more multiprocessors (for a given maximum thread per multiprocessor) with few cores will be more suited to more massive parallelism with few/simple operations, while with more cores per multiprocessor (now I'm talking about things I barely know) there will be more dedicated ALUs for load/store operations and complex mathematics functions, so it will be more suited for kernels requiring more operations per thread?


This seems to be confusion over terminology.

"SM" (SM = Streaming Multiprocessor) and "multiprocessor" refer to the same thing, a hardware unit that is the principal execution unit on the GPU. These terms refer to specific HW resources. Different GPUs may have differing numbers of SMs. The number of SMs can be found for a particular GPU using the CUDA deviceQuery sample code.

The elements of a CUDA program that are in the "launch" are threadblocks. A grid is the collection of all threadblocks associated with a kernel launch. Individual threadblocks execute on individual SMs. You can launch a large number of threadblocks in a kernel, more or less independent of what GPU you are running on. The threadblocks will then be processed at whatever rate is afforded by the particular GPU and it's SMs.

There is no API function which gives direct control over the scheduling of threadblocks onto SMs. Some level of indirect control for scheduling of threadblocks from different kernels that are running concurrently can be obtained through the use of CUDA stream priorities.