NVIDIA CUDA

Shared Memory and Warp

  • GPU is composed of multiple SM (Streaming Multiprocessor). SMs are designed to handle multiple threads simultaneously. These SMs are organized into larger groups called Graphics Processing Clusters (GPCs).
  • An SM includes one shared memory and multiple cores, including CUDA Cores, ALUs, registers, L1 cache, and control logic.
  • Shared memory is on-chip and accessible by all cores within the SM, facilitating fast data exchange and synchronization among threads. This is similar to the L1 cache but serves as a user-managed cache.
  • A warp is a group of 32 threads that execute the same instruction sequences together.
  • Control logic in SMs includes warp schedulers and instruction dispatch units. The warp schedulers are responsible for distributing warps to different CUDA cores based on availability and the nature of the tasks being executed. This scheduling is dynamic and aims to maximize the utilization of the SM’s resources at any given time. Instruction dispatch units ensure that all cores are fed with instructions as needed to keep them busy and productive.
  • The cores in an SM are grouped into processing blocks. If an SM consists of 4 processing blocks and each block has 32 cores, the SM can execute 4 warps at the same time.
  • Threads can be executed at the same time as much as the number of cores. However, the number of resident threads which are assignable in an SM is much higher than the number of cores. When an instruction to be executed by a warp needs a long-latency operation, such as reading from memory, a warp scheduler can select another resident warp to hide this latency. The architecture has zero overhead for switching out a currently executing warp for a warp ready for execution.
  • An SM can handle different types of warps such as vertices, pixels, and primitives at the same time.
  • Branch divergence such as if-statements causes a bad performance because the warp can no longer execute the same instruction and must be divergent.
  • If only one thread in a warp takes a particular path, the other 31 threads remain idle during that computation. This serialization can lead to performance inefficiencies. However, NVIDIA’s newer architectures have improved on handling divergence through techniques such as Independent Thread Scheduling.
  • Volta architecture introduced Independent Thread Scheduling. Unlike previous architectures where the entire warp executed the same instruction at a time, Independent Thread Scheduling allows individual threads within a warp to execute independently unless they need to synchronize or communicate. In other words, it allows threads within the same warp to execute independently different instructions if they diverge, thus reducing the penalties of divergence. This advancement helps to improve efficiency and performance in workloads where threads diverge in their execution paths due to conditional branches or varying data paths.
    • Modern GPU ISAs include mechanisms to manage divergence, such as splitting a divergent warp into multiple homogeneous groups that can be scheduled separately.
  • Threads in a thread block can share data through some shared memory and can be synchronized by calling __syncthreads(). Besides, a warp can be synchronized by __syncwarp().
  • Static allocation of shared memory can be declared as follows.
    __global__ void kernel()
    {
        __shared__ int m[512];
    }
    
  • Dynamic allocation of shared memory can be declared as follows, passing the size as the third argument when executing the kernel. This means that multiple shared memory spaces cannot be dynamically allocated. If multiple shared memory arrays are needed, declare one large shared memory array with the size of the sum of all array sizes, and use pointers to divide the space and use it.
    extern __shared__ int m[];
    __global__ void kernel() { ... }
    int main()
    {
        kernel<<<grid-dim, block-dim, sizeof(int) * 512>>>();
    }
    
  • The L1 cache shares the same memory space as the shared memory. Therefore, users can specify the size to be used as the L1 cache through the cudaFuncSetCacheConfig() function. The cache settings passed to this function are considered to be recommended, so the actual distribution may vary depending on the kernel situation.
  • Shared memory consists of 32 memory banks. A memory bank is a module that manages access to memory. The banks can operate independently of each other, which means that up to 32 memory access requests can be processed simultaneously. Therefore, all threads in a warp can access data in shared memory simultaneously.
    • The fact that the shared memory consists of 32 banks also means that it is divided into 32 areas. Each area can only be accessed through the corresponding memory bank. If 32 threads access different bank areas, they can perform memory access simultaneously.
    • However, if multiple threads try to access the memory area of ​​one bank, a bank conflict occurs, and each thread accesses the shared memory sequentially. In other words, the access is serialized.

Register

  • The register file is a large array of registers that are shared among the threads in all warps within a thread block. Each thread has its own set of registers, and the number of registers available per thread can impact the number of threads (and hence warps) that can be active at a time on a multiprocessor. This is because the total number of registers in a multiprocessor is fixed.
  • Since threads within a block share the register file of the SM, there is no need to copy or read the context to memory. Because this zero context switch overhead is possible, it is recommended to use threads that are at least 3-4 times and up to 10 times more than the number of cores for GPU algorithms. However, due to limited resources, a thread may not be allocated the registers it needs, in which case the cost of a context switch is incurred.
    • An active warp is a warp in which all threads in the warp have been allocated all the register spaces they need. Having allocated the necessary register spaces means that it is immediately executable.
    • The context switch cost between active warps is close to zero, but the context switch for an inactive warp is not free. An inactive warp has not been allocated the necessary register spaces, so it must steal the register spaces of another active warp during the context switch process. Then, the warp that has had its registers stolen must perform a context save process that stores the values ​​in its registers in memory, and the warp that has stolen its registers must perform a task to restore its own context that was stored in memory. In other words, a high context switch cost occurs.
  • The specific allocation of registers is handled by the compiler and hardware. When a kernel is launched, the compiler first analyzes the kernel’s code to determine the number of registers required per thread. This analysis takes into account not only the variables declared but also temporary registers needed for intermediate calculations and to store state information for context switches between threads.
  • Register spilling is a technique used when the number of registers required exceeds the number available. In such cases, some register contents are temporarily moved to local memory, which is off-chip memory.
    • Local memory pertains to memory that is private to each thread. This memory type is used to store local variables specific to a thread’s execution context.
    • Local memory is physically stored in global memory. However, it is cached on-chip for fast access.
    • Local memory is not intended for large data storage but rather for temporary, per-thread data that cannot be accommodated in registers.
    • Access to local memory is slower than shared memory but faster than accessing uncached global memory directly.
  • Dynamic allocation of registers allows for adjustments in the number of registers allocated per thread, based on the actual usage patterns observed during execution.

Thread Block

  • Thread blocks are allocated to SM, and multiple blocks can be assigned to one SM.
  • Thread blocks are distributed evenly and sequentially across SMs and processed.
  • The choice of block size, that is the number of threads per block, is a key factor in the performance tuning of GPU applications.
  • The number of thread blocks that can be assigned to one SM depends on the program considering the number of registers that should be used in the program and the total number of registers in an SM.
  • Each block is executed by only one SM, and it remains on that SM until its execution is complete. Within an SM, multiple blocks can be active at the same time, depending on the resources required by each block and the total resources available on the SM. The blocks are executed concurrently, and the SM dynamically switches between different warps to hide latencies.
  • Even though there are the rest of the registers in an SM after allocating thread blocks to it, it leaves empty if the whole thread block cannot be accommodated since a thread block cannot be broken into smaller parts.
  • If the number of threads in a thread block is not divisible by 32, the number of threads in the last warp is less than 32. So it is more efficient for the number of threads in a thread block to be a multiple of 32.
  • An active block is a block that has all the necessary memory resources allocated to it. Unlike an active warp that only needs to secure register space, the memory resources that a block must allocate to become active are registers and shared memory space.
    • Like context switching between warps, context switching between active blocks is zero context switch overhead. However, context switching for an inactive block incurs a high cost.
  • A thread block corresponds to a work group in OpenGL. Correspondent built-in variables between CUDA and OpenGL compute shader are as follows.
    • threadIdx \equiv gl_LocalInvocationID
    • blockIdx \equiv gl_WorkGroupID
    • blockDim \equiv gl_WorkGroupSize
    • gridDim \equiv gl_NumWorkGroups

Constant Memory and Texture Memory

  • Constant memory is read-only over a kernel execution, and uses a dedicated on-chip memory called constant cache.
    • cudaMemcpyToSymbol() should be used for copying to the constant memory while cudaMemcpy() is for the global memory.
    • NVIDIA hardware can broadcast a single memory read to each half-warp which is a group of 16 threads. If every thread in a half-warp requests data from the same address in constant memory, only a single read request is generated and the data can be broadcast to the other threads.
    • The hardware can aggressively cache the constant data since the memory is read-only. There is no additional memory traffic when half-warps request the same address once it has been read at first.
    • A downside is when all 16 threads in a half-warp request different data. The 16 different reads get serialized, taking 16 times the amount of time to place the request. Rather the global memory could issue the request at the same time.
  • Texture memory can be used for general purpose computing. Texture caches are designed for graphics applications where memory access patterns exhibit a great deal of spatial locality.

Streams

  • The standard allocation functions such as malloc() allocate pageable host memory, while cudaHostAlloc() allocates a buffer of page-locked host memory, or pinned memory.
    • The OS guarantees that it will never page this pinned memory out to disk, which ensures its residency in physical memory.
  • CUDA driver uses DMA (Direct Memory Access) to copy data to or from the host. Since DMA copies proceed without CPU intervention, CPU could be simultaneously paging data out or relocating their physical address by updating the OS’s pagetables. This possibility implies that using pinned memory for a DMA copy is essential.
    • With pageable memory, the copy happens twice, first from a pageable system buffer to a temporary pinned memory and then from the pinned memory to the GPU.
    • With pinned memory, the copy happens only once avoiding the cost of the transfer between pageable and pinned memory.
    • Since pinned memory can never be swapped out to disk, the system will run out of memory much faster.
  • A stream is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently.
    • When no stream is specified, the default stream, which is also called the null stream, is used. The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device.
    • No operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation in any stream on the device will begin.
  • In the default stream, data transfers using cudaMemcpy() are blocking, while the kernel launch is asynchronous.
    cudaMemcpy( ..., cudaMemcpyHostToDevice );
    kernel<<<...>>>( ... );
    cudaMemcpy( ..., cudaMemcpyDeviceToHost );
    

    Since the host-to-device data transfer is synchronous, the CPU thread will not reach the kernel call until the host-to-device transfer is complete. So, this is why cudaDeviceSynchronize() is unnecessary after cudaMemcpy(). On the other hand, once the kernel is issued, the CPU thread moves to the device-to-host transfer, but the transfer cannot begin due to the device-side order of execution. The asynchronous behavior of kernel launches makes overlapping device and host computation very simple.

    cudaMemcpy( ..., cudaMemcpyHostToDevice );
    kernel<<<...>>>( ... );
    CpuFunction( ... );
    cudaMemcpy( ..., cudaMemcpyDeviceToHost );
    

    As soon as the kernel() is launched on the device, the CPU thread executes CpuFunction(), overlapping its execution on the CPU with the kernel execution on the GPU.

  • To issue a data transfer to a non-default stream, cudaMemcpyAsync() function is required, which is taking a stream identifier. In the case of asynchronous execution, after the data copy command is issued, the host immediately proceeds to the next host code. And since the host considers the host memory area to be finished being used, the area becomes a pageable state. In other words, this host memory area can be moved to virtual memory by the operating system while data transfer is in progress. Therefore, the reason why the host memory must be pinned memory that is always on the physical memory is to ensure the normal operation of asynchronous data transfer.
  • cudaStreamSynchronize() blocks the host thread until all previously issued operations in the specified stream have completed, while cudaDeviceSynchronize() does the host code until all previously issued operations on the device have completed.
  • Kernel execution can be overlapped with data transfers if deviceOverlap field of a cudaDeviceProp struct is set.
    • The kernel execution and the data transfer to be overlapped must both occur in different, non-default streams. Besides, the host memory involved in the data transfer must be pinned memory.
    • The hardware has independent engines that handle copies and kernel executions, and the number of engines varies depending on the hardware.
    • The general guideline involves a breadth-first, or round-robin, but it is better to check first if the hardware has what kind of engines. For example, the C1060 has a single copy engine and a single kernel engine, while the C2050 has two copy engines, one for host-to-device transfers and another for device-to-host transfers, as well as a single kernel engine.
  • In general, it is appropriate to use about 4 streams. Therefore, rather than performing data transfer and kernel execution only once per stream, splitting the data into smaller pieces and performing it multiple times contiguously can lead to more overlapping in data transfer of each stream.

Tips

  • The maximum size of a thread block is limited, so it is helpful to memorize it.
    • Maximum x- or y-dimensionality of a block is 1024.
    • Maximum z-dimension of a block is 64.
    • Maximum number of threads per block is 1024.
  • For IO-bounded algorithms, increasing the number of threads can help improve GPU utilization efficiency. On the other hand, for compute-bounded algorithms, using too many threads can actually have a negative impact on performance. This is the case when repeatedly using data that has been read once or calculating complex formulas.
  • By running some performance experiments, optimal performance is achieved when the number of thread blocks is exactly twice the number of multiprocessors GPU contains.
  cudaDeviceProp property;
  cudaGetDeviceProperties( &property, 0 );
  const int blocks = property.multiProcessorCount * 2;
  kernel<<<blocks, ...>>>( ... );
  • Ironically, despite that the atomic operations cause performance degradation, sometimes alleviating the slowdown actually involves using more atomics, not fewer. The core problem is not the use of atomicity per se, but the fact that too many threads are competing for access to a relatively small number of memory addresses.
    • Code like the following can result in millions of threads competing for one data.
      __global__ void increase(int* a)
      {
          atomicAdd( a, 1 );
      }
      
    • However, if shared memory is used, the synchronization scope can be reduced to threads within a block. Since only one thread per block calls the atomic function on a, the number of threads participating in synchronization is equal to the number of thread blocks.
      __global__ void increase(int* a)
      {
          __shared__ int shared_a;
          if (threadIdx.x == 0) shared_a = 0;
          __syncthreads();
      
          atomicAdd( &shared_a, 1 ); // block-level
          __syncthreads();
      
          if (threadIdx.x == 0) atomicAdd( a, shared_a ); // grid-level
      }
      
    • The synchronization scope can be further reduced by using __syncwarp().
  • It is often difficult to achieve the desired synchronization by simply using the built-in synchronization functions. In this case, manual control can be performed using directly designed logic.
    • Suppose that within a thread block, even and odd threads perform different tasks, but the even threads must wait until all other even threads have completed their tasks.
    • In this case, synchronization is only required for the even number of threads, so __syncthreads() cannot be used. Also, the problem cannot be solved with atomic functions, because it is not related to synchronizing access to specific data.
    • The idea is this. First, declare a shared variable lock for even threads. After each even thread completes its work, it increases the value of lock by 1. To prevent multiple threads from accessing the lock at the same time, an atomic function is used. Then, it waits until the value of lock becomes equal to the number of even threads in the block.
      __global__ void kernel()
      {
          __shared__ int lock;
          if (threadIdx.x % 2 == 0) {
                // work for even threads
      
                atomicInc( &lock );
                while (lock < blockDim.x / 2);
          }
          else { // work for odd threads }
      
          // work for all threads
      
          __syncthreads();
      }
      
  • It is possible for the GPU to directly access the CPU data using zero-copy memory if canMapHostMemory field of a cudaDeviceProp struct is set. The performance is different for discrete GPUs and integrated GPUs.
    • Integrated GPUs are graphics processors built into a system’s chipset and usually share regular system memory with the CPU. For integrated GPUs, the use of zero-copy memory is always a performance win because the memory is physically shared with the host anyway.
    • In cases where inputs and outputs are used exactly once, a performance enhancement can be observed when using zero-copy memory with a discrete GPU. But since the zero-copy memory is not cached on the GPU, in situations where the memory gets read multiple times, a large penalty is paid that could be avoided by simply copying the data to the GPU first.
    • integrated field of a cudaDeviceProp struct is set if the device is an integrated GPU and not otherwise.
    • cudaHostAlloc() has a flag, cudaHostAllocMapped, for this. cudaHostAlloc() returns the CPU pointer and the GPU has a different virtual memory space than the CPU, so a valid GPU pointer for this memory should be obtained by calling cudaHostGetDevicePointer().
  • Pages can appear pinned to a single CPU thread only. If the pointer to the pinned memory is shared between threads, the other threads will see the buffer as standard, pageable data. As a remedy to this problem, this pinned memory can be allocated as portable.
    • cudaHostAlloc() has a flag, cudaHostAllocPortable, for this.

References

[1] Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming (1st. ed.). Addison-Wesley Professional.

[2] Tomas Akenine-Mller, Eric Haines, and Naty Hoffman. 2018. Real-Time Rendering, Fourth Edition (4th. ed.). A. K. Peters, Ltd., USA.

[3] How to Optimize Data Transfers in CUDA C/C++

[4] How to Overlap Data Transfers in CUDA C/C++

[5] Gareth Morgan Thomas. 2024. Advanced GPU Assembly Programming: A Technical Reference for NVIDIA and AMD Architectures. Amazon Digital Services LLC - Kdp

[6] Tor M. Aamodt, Wilson Wai Lun Fung, and Timothy G. Rogers. 2018. General-purpose Graphics Processor Architectures. Morgan & Claypool Publishers.

[7] 김덕수, CUDA 기반 GPU 병렬 처리 프로그래밍, 비제이퍼블릭.


© 2025. All rights reserved.