Selection of block size and grid size, as well as occupancy calculation, in CUDA programming.

Posted by HanHaowen on December 16, 2023

Note: Beginner’s article, feedback is welcome.

Chinese version: https://juejin.cn/post/7312338839695114267

In CUDA programming, occupancy refers to the ratio of active thread warps to the maximum active thread warps in the streaming multiprocessors (SM). When the GPU executes internally, each block corresponds to an SM (Streaming Multiprocessor). An SM can concurrently execute multiple blocks, while a block can only execute within a single SM, as described in https://juejin.cn/post/7312275586255470642:

The hardware level corresponding to a block is the SM. The SM provides the necessary hardware resources for communication, synchronization, and other operations among threads within the same block. Communication across SMs is not supported, so all threads within a block are executed on the same SM. Additionally, since threads may need to synchronize with each other, once a block starts executing on an SM, all threads within the block execute concurrently on the same SM (concurrency, not parallelism). This means that the scheduling process of blocks to SMs is atomic. An SM allows more than one block to execute concurrently on it. If an SM has available resources to accommodate the execution of a block, that block can be immediately scheduled on that SM. The specific hardware resources typically include registers, shared memory, and various scheduling-related resources.

Occupancy is not necessarily better the higher it is, but generally, it is desirable to have a higher value. The occupancy can be significantly influenced by the block size and grid size, which also affects the utilization of the GPU.

There are three factors to calculate the theoretical occupancy using the block size and grid size:

1.The maximum number of warps and blocks per Streaming Multiprocessor (SM). “Warps” refer to groups of threads executing the same instruction, and “blocks” refer to groups of warps. Multiple thread blocks are assigned to an SM, and multiple SMs make up the entire GPU unit (which executes the entire Kernel Grid) (source: Wikipedia). Currently, for all Nvidia cards, a warp consists of 32 threads.

2.The number of registers per SM.

3.The size of shared memory per SM.

These three factors create a “bucket effect” on GPU occupancy. In practical calculations, the maximum number of blocks that can be executed per SM is calculated, and then converted into the number of warps to determine the active thread warps.

The website https://xmartlabs.github.io/cuda-calculator/ provides a program to calculate the theoretical GPU occupancy. Using an example from that webpage, we will illustrate how to calculate the theoretical GPU occupancy based on the three factors mentioned above.

Alt text

After selecting factors such as CUDA version and compute capability that impact hardware capabilities, the remaining options are block size (256 in this case), the number of registers required per thread (32 in this case), and the shared memory size required per block (2048 bytes in this case).

1.Let’s first consider the first factor, which is the maximum number of warps and blocks per SM. From the “Threads per Multiprocessor” value, we can see that this hardware can execute a maximum of 1536 threads per SM, which translates to 1536/256=6 blocks. This corresponds to “Limited by Max Warps / Blocks per Multiprocessor=6” in the graph.

2.Next, let’s consider the second factor, which is the number of registers per SM. Each block requires 256×32=8192 registers. Notice that “Total # of 32-bit registers per Multiprocessor” is 65536, so the maximum number of blocks under the register factor is 65536/8192=8. Factors like “Max registers per Block” and “Register allocation unit size” can also have an impact. For example, if “Register allocation unit size” is 256, it means that registers are allocated in units of 256 per warp. Since 32×32=1024=256×4, it doesn’t affect this example. The other related “Max” factors are not discussed here.

3.Lastly, let’s consider the shared memory size per SM. Each block requires 2048+1024=3072 bytes of shared memory (Note: CUDA Runtime uses 1024 bytes of Shared Memory per Thread Block). The value “Shared Memory per Multiprocessor (bytes)” is 102400 in the graph. Therefore, the maximum number of blocks per SM under the shared memory factor is $\lfloor 102400/3072 \rfloor$=33. This corresponds to “Limited by Shared Memory per Multiprocessor=33” in the graph. The “Shared Memory Allocation unit size” represents the allocation unit of 128 bytes. Since 3072=128*24, it doesn’t affect this example.

Finally, based on the bucket effect, we take the minimum value of the three factors, which is 6. This value corresponds to “Active Thread Blocks per Multiprocessor=6” in the graph. Since the block size is 256 and each warp consists of 32 threads, we have block size/32=256/32=8 warps. Therefore, each SM executes 6×8=48 warps. It’s worth noting that “Warps per Multiprocessor=48” in the graph. As a result, the occupancy of each multiprocessor is 48/48=1, which is 100%.