NVIDIA CUDA

Shared Memory and Warp

  • GPU is composed of multiple SM (Streaming Multiprocessor). SMs are designed to handle multiple threads simultaneously. These SMs are organized into larger groups called Graphics Processing Clusters (GPCs).
  • An SM includes one shared memory and multiple cores, including CUDA Cores, ALUs, registers, L1 cache, and control logic.
  • Shared memory is on-chip and accessible by all cores within the SM, facilitating fast data exchange and synchronization among threads. This is similar to the L1 cache but serves as a user-managed cache.
  • A warp is a group of 32 threads that execute the same instruction sequences together.
  • Control logic in SMs includes warp schedulers and instruction dispatch units. The warp schedulers are responsible for distributing warps to different CUDA cores based on availability and the nature of the tasks being executed. This scheduling is dynamic and aims to maximize the utilization of the SM’s resources at any given time. Instruction dispatch units ensure that all cores are fed with instructions as needed to keep them busy and productive.
  • The cores in an SM are grouped into processing blocks. If an SM consists of 4 processing blocks and each block has 32 cores, the SM can execute 4 warps at the same time.
  • Threads can be executed at the same time as much as the number of cores. However, the number of resident threads which are assignable in an SM is much higher than the number of cores. When an instruction to be executed by a warp needs a long-latency operation, such as reading from memory, a warp scheduler can select another resident warp to hide this latency. The architecture has zero overhead for switching out a currently executing warp for a warp ready for execution.
  • An SM can handle different types of warps such as vertices, pixels, and primitives at the same time.
  • Branch divergence such as if-statements causes a bad performance because the warp can no longer execute the same instruction and must be divergent.
    • Each warp has a SIMT stack composed of multiple entries.
    • Each entry has PC (Program Counter) and 32-bit space.
    • This 32-bit space is for deciding whether each thread in a warp should execute the instruction pointed by PC.
  • If only one thread in a warp takes a particular path, the other 31 threads remain idle during that computation. This serialization can lead to performance inefficiencies. However, NVIDIA’s newer architectures have improved on handling divergence through techniques such as Independent Thread Scheduling.
  • Volta architecture introduced Independent Thread Scheduling. Unlike previous architectures where the entire warp executed the same instruction at a time, Independent Thread Scheduling allows individual threads within a warp to execute independently unless they need to synchronize or communicate. In other words, it allows threads within the same warp to execute independently different instructions if they diverge, thus reducing the penalties of divergence. This advancement helps to improve efficiency and performance in workloads where threads diverge in their execution paths due to conditional branches or varying data paths.
    • Modern GPU ISAs include mechanisms to manage divergence, such as splitting a divergent warp into multiple homogeneous groups that can be scheduled separately.
  • Threads in a thread block can share data through some shared memory and can be synchronized by calling __syncthreads(). Besides, a warp can be synchronized by __syncwarp().

Register

  • The register file is a large array of registers that are shared among the threads in a warp. Each thread has its own set of registers, and the number of registers available per thread can impact the number of threads (and hence warps) that can be active at a time on a multiprocessor. This is because the total number of registers in a multiprocessor is fixed.
  • The specific allocation of registers is handled by the compiler and hardware. When a kernel is launched, the compiler first analyzes the kernel’s code to determine the number of registers required per thread. This analysis takes into account not only the variables declared but also temporary registers needed for intermediate calculations and to store state information for context switches between threads.
  • Register spilling is a technique used when the number of registers required exceeds the number available. In such cases, some register contents are temporarily moved to local memory.
    • Local memory pertains to memory that is private to each thread. This memory type is used to store local variables specific to a thread’s execution context.
    • Local memory is physically stored in global memory. However, it is cached on-chip for fast access.
    • Local memory is not intended for large data storage but rather for temporary, per-thread data that cannot be accommodated in registers.
    • Access to local memory is slower than shared memory but faster than accessing uncached global memory directly.
  • Dynamic allocation of registers allows for adjustments in the number of registers allocated per thread, based on the actual usage patterns observed during execution.

Thread Block

  • Thread blocks are allocated to SM, and multiple blocks can be assigned to one SM.
  • The choice of block size, that is the number of threads per block, is a key factor in the performance tuning of GPU applications.
  • The number of thread blocks that can be assigned to one SM depends on the program considering the number of registers that should be used in the program and the total number of registers in an SM.
  • Each block is executed by only one SM, and it remains on that SM until its execution is complete. Within an SM, multiple blocks can be active at the same time, depending on the resources required by each block and the total resources available on the SM. The blocks are executed concurrently, and the SM dynamically switches between different warps to hide latencies.
  • Even though there are the rest of the registers in an SM after allocating thread blocks to it, it leaves empty if the whole thread block cannot be accommodated since a thread block cannot be broken into smaller parts.
  • If the number of threads in a thread block is not divisible by 32, the number of threads in the last warp is less than 32. So it is more efficient for the number of threads in a thread block to be a multiple of 32.
  • A thread block corresponds to a work group in OpenGL. Correspondent built-in variables between CUDA and OpenGL compute shader are as follows.
    • threadIdx \equiv gl_LocalInvocationID
    • blockIdx \equiv gl_WorkGroupID
    • blockDim \equiv gl_WorkGroupSize
    • gridDim \equiv gl_NumWorkGroups

Constant Memory and Texture Memory

  • Constant memory is read-only over a kernel execution.
    • cudaMemcpyToSymbol() should be used for copying to the constant memory while cudaMemcpy() is for the global memory.
    • NVIDIA hardware can broadcast a single memory read to each half-warp which is a group of 16 threads. If every thread in a half-warp requests data from the same address in constant memory, only a single read request is generated and the data can be broadcast to the other threads.
    • The hardware can aggressively cache the constant data since the memory is read-only. There is no additional memory traffic when half-warps request the same address once it has been read at first.
    • A downside is when all 16 threads in a half-warp request different data. The 16 different reads get serialized, taking 16 times the amount of time to place the request. Rather the global memory could issue the request at the same time.
  • Texture memory can be used for general purpose computing. Texture caches are designed for graphics applications where memory access patterns exhibit a great deal of spatial locality.

Streams

  • The standard allocation functions such as malloc() allocate pageable host memory, while cudaHostAlloc() allocates a buffer of page-locked host memory, or pinned memory.
    • The OS guarantees that it will never page this pinned memory out to disk, which ensures its residency in physical memory.
  • CUDA driver uses DMA (Direct Memory Access) to copy data to or from the host. Since DMA copies proceed without CPU intervention, CPU could be simultaneously paging data out or relocating their physical address by updating the OS’s pagetables. This possibility implies that using pinned memory for a DMA copy is essential.
    • With pageable memory, the copy happens twice, first from a pageable system buffer to a temporary pinned memory and then from the pinned memory to the GPU.
    • With pinned memory, the copy happens only once avoiding the cost of the transfer between pageable and pinned memory.
    • Since pinned memory can never be swapped out to disk, the system will run out of memory much faster.
  • A stream is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently.
    • When no stream is specified, the default stream, which is also called the null stream, is used. The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device.
    • No operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation in any stream on the device will begin.
  • In the default stream, data transfers using cudaMemcpy() are blocking, while the kernel launch is asynchronous.
    cudaMemcpy( ..., cudaMemcpyHostToDevice );
    kernel<<<...>>>( ... );
    cudaMemcpy( ..., cudaMemcpyDeviceToHost );
    

    Since the host-to-device data transfer is synchronous, the CPU thread will not reach the kernel call until the host-to-device transfer is complete. Once the kernel is issued, the CPU thread moves to the device-to-host transfer, but the transfer cannot begin due to the device-side order of execution. The asynchronous behavior of kernel launches makes overlapping device and host computation very simple.

    cudaMemcpy( ..., cudaMemcpyHostToDevice );
    kernel<<<...>>>( ... );
    CpuFunction( ... );
    cudaMemcpy( ..., cudaMemcpyDeviceToHost );
    

    As soon as the kernel() is launched on the device, the CPU thread executes CpuFunction(), overlapping its execution on the CPU with the kernel execution on the GPU.

  • To issue a data transfer to a non-default stream, cudaMemcpyAsync() function is required, which is taking a stream identifier.
  • cudaStreamSynchronize() blocks the host thread until all previously issued operations in the specified stream have completed, while cudaDeviceSynchronize() does the host code until all previously issued operations on the device have completed.
  • Kernel execution can be overlapped with data transfers if deviceOverlap field of a cudaDeviceProp struct is set.
    • The kernel execution and the data transfer to be overlapped must both occur in different, non-default streams. Besides, the host memory involved in the data transfer must be pinned memory.
    • The hardware has independent engines that handle copies and kernel executions, and the number of engines varies depending on the hardware.
    • The general guideline involves a breadth-first, or round-robin, but it is better to check first if the hardware has what kind of engines. For example, the C1060 has a single copy engine and a single kernel engine, while the C2050 has two copy engines, one for host-to-device transfers and another for device-to-host transfers, as well as a single kernel engine.

Tips

  • By running some performance experiments, optimal performance is achieved when the number of thread blocks is exactly twice the number of multiprocessors GPU contains.
  cudaDeviceProp property;
  cudaGetDeviceProperties( &property, 0 );
  const int blocks = property.multiProcessorCount * 2;
  kernel<<<blocks, ...>>>( ... );
  • Ironically, despite that the atomic operations cause performance degradation, sometimes alleviating the slowdown actually involves using more atomics, not fewer. The core problem is not the use of atomics so much as the fact that too many threads are competing for access to a relatively small number of memory addresses.
    • Consider shared memory for thread competition, not global memory.
  • It is possible for the GPU to directly access the CPU data using zero-copy memory if canMapHostMemory field of a cudaDeviceProp struct is set. The performance is different for discrete GPUs and integrated GPUs.
    • Integrated GPUs are graphics processors built into a system’s chipset and usually share regular system memory with the CPU. For integrated GPUs, the use of zero-copy memory is always a performance win because the memory is physically shared with the host anyway.
    • In cases where inputs and outputs are used exactly once, a performance enhancement can be observed when using zero-copy memory with a discrete GPU. But since the zero-copy memory is not cached on the GPU, in situations where the memory gets read multiple times, a large penalty is paid that could be avoided by simply copying the data to the GPU first.
    • integrated field of a cudaDeviceProp struct is set if the device is an integrated GPU and not otherwise.
    • cudaHostAlloc() has a flag, cudaHostAllocMapped, for this. cudaHostAlloc() returns the CPU pointer and the GPU has a different virtual memory space than the CPU, so a valid GPU pointer for this memory should be obtained by calling cudaHostGetDevicePointer().
  • Pages can appear pinned to a single CPU thread only. If the pointer to the pinned memory is shared between threads, the other threads will see the buffer as standard, pageable data. As a remedy to this problem, this pinned memory can be allocated as portable.
    • cudaHostAlloc() has a flag, cudaHostAllocPortable, for this.

References

[1] Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming (1st. ed.). Addison-Wesley Professional.

[2] Tomas Akenine-Mller, Eric Haines, and Naty Hoffman. 2018. Real-Time Rendering, Fourth Edition (4th. ed.). A. K. Peters, Ltd., USA.

[3] How to Optimize Data Transfers in CUDA C/C++

[4] How to Overlap Data Transfers in CUDA C/C++

[5] Gareth Morgan Thomas. 2024. Advanced GPU Assembly Programming: A Technical Reference for NVIDIA and AMD Architectures. Amazon Digital Services LLC - Kdp

[6] Tor M. Aamodt, Wilson Wai Lun Fung, and Timothy G. Rogers. 2018. General-purpose Graphics Processor Architectures. Morgan & Claypool Publishers.


© 2025. All rights reserved.