CUDA Programming

Thread Block and Shared Memory

  • GPU is composed of multiple SM(Streaming Multirocessor).
  • An SM includes one shared memory and multiple cores(CUDA Cores, ALUs, or Streaming Processors).
  • Shared memory is on-chip, and works like L1-cache.
  • The cores in an SM are grouped into processing blocks. If an SM consists of 4 procesing blocks and each block has 32 cores, the SM can execute 4 warps at the same time.
  • Threads can be executed at the same time as much as the number of cores. However, the number of resident threads which are assignable in an SM is much higher than the number of cores. When an instruction to be executed by a warp needs long-latency operation, another resident warp can be selected to hide this latency. The architecture has zero overhead for switching out a currently executing warp for a warp that is ready for execution.
  • An SM can handle different types of warps such as vertices, pixels, and primitives at the same time.
  • Thread blocks are allocated to SM, and multiple blocks can be assigned to one SM.
  • The number of thread blocks that can be assigned to one SM depends on the program considering the number of registers that should be used in the program and the total number of registers in an SM.
  • Even though there are the rest of the registers in an SM after allocating thread blocks to it, it leaves empty if the whole thread block cannot be accommodated since a thread block cannot be broken into smaller parts.
  • A warp is a group of 32 threads that execute the same instruction sequences together.
  • If the number of threads in a thread block is not divisible by 32, the number of threads in the last warp is less than 32. So it is more efficient for the number of threads in a thread block to be a multiple of 32.
  • Branch divergence such as if-statements causes a bad performance because the warp can no longer execute the same instruction and must be divergent.
    • Each warp has a SIMT stack composed of multiple entries.
    • Each entry has PC(Program Counter) and 32-bit space.
    • This 32-bit space is for deciding whether each thread in a warp should execute the instruction pointed by PC.
  • Threads in a thread block can share data through some shared memory and can be synchronized by calling __syncthreads(). Besides, a warp can be synchronized by __syncwarp().
  • A thread block corresponds to a work group in OpenGL. Correspondent built-in variables between CUDA and OpenGL compute shader are as follows.
    • threadIdx \equiv gl_LocalInvocationID
    • blockIdx \equiv gl_WorkGroupID
    • blockDim \equiv gl_WorkGroupSize
    • gridDim \equiv gl_NumWorkGroups

Constant Memory and Texture Memory

  • Constant memory is read-only over a kernel execution.
    • cudaMemcpyToSymbol() should be used for copying to the constant memory while cudaMemcpy() is for the global memory.
    • NVIDIA hardware can broadcast a single memory read to each half-warp which is a group of 16 threads. If every thread in a half-warp requests data from the same address in constant memory, only a single read request is generated and the data can be broadcast to the other threads.
    • The hardware can aggressively cache the constant data since the memory is read-only. There is no additional memory traffic when half-warps request the same address once it has been read at first.
    • A downside is when all 16 threads in a half-warp request different data. The 16 different reads get serialized, taking 16 times the amount of time to place the request. Rather the global memory could issue the request at the same time.
  • Texture memory can be used for general purpose computing. Texture caches are designed for graphics applications where memory access patterns exhibit a great deal of spatial locality.

Streams

  • The standard allocation functions such as malloc() allocate pageable host memory, while cudaHostAlloc() allocates a buffer of page-locked host memory, or pinned memory.
    • The OS guarantees that it will never page this pinned memory out to disk, which ensures its residency in physical memory.
  • CUDA driver uses DMA(Direct Memory Access) to copy data to or from the host. Since DMA copies proceed without CPU intervention, CPU could be simultaneously paging data out or relocating their physical address by updating the OS’s pagetables. This possibility implies that using pinned memory for a DMA copy is essential.
    • With pageable memory, the copy happens twice, first from a pageable system buffer to a temporary pinned memory and then from the pinned memory to the GPU.
    • With pinned memory, the copy happens only once avoiding the cost of the transfer between pageable and pinned memory.
    • Since pinned memory can never be swapped out to disk, the system will run out of memory much faster.
  • A stream is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently.
    • When no stream is specified, the default stream, which is also called the null stream, is used. The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device.
    • No operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation in any stream on the device will begin.
  • In the default stream, data transfers using cudaMemcpy() are blocking, while the kernel launch is asynchronous.
    cudaMemcpy( ..., cudaMemcpyHostToDevice );
    kernel<<<...>>>( ... );
    cudaMemcpy( ..., cudaMemcpyDeviceToHost );
    

    Since the host-to-device data transfer is synchronous, the CPU thread will not reach the kernel call until the host-to-device transfer is complete. Once the kernel is issued, the CPU thread moves to the device-to-host transfer, but the transfer cannot begin due to the device-side order of execution. The asynchronous behavior of kernel launches makes overlapping device and host computation very simple.

    cudaMemcpy( ..., cudaMemcpyHostToDevice );
    kernel<<<...>>>( ... );
    CpuFunction( ... );
    cudaMemcpy( ..., cudaMemcpyDeviceToHost );
    

    As soon as the kernel() is launched on the device, the CPU thread executes CpuFunction(), overlapping its execution on the CPU with the kernel execution on the GPU.

  • To issue a data transfer to a non-default stream, cudaMemcpyAsync() function is required, which is taking a stream identifier.
  • cudaStreamSynchronize() blocks the host thread until all previously issued operations in the specified stream have completed, while cudaDeviceSynchronize() does the host code until all previously issued operations on the device have completed.
  • Kernel execution can be overlapped with data transfers if deviceOverlap field of a cudaDeviceProp struct is set.
    • The kernel execution and the data transfer to be overlapped must both occur in different, non-default streams. Besides, the host memory involved in the data transfer must be pinned memory.
    • The hardware has independent engines that handle copies and kernel executions, and the number of engines varies depending on the hardware.
    • The general guideline involves a breadth-first, or round-robin, but it is better to check first if the hardware has what kind of engines. For example, the C1060 has a single copy engine and a single kernel engine, while the C2050 has two copy engines, one for host-to-device transfers and another for device-to-host transfers, as well as a single kernel engine.

Tips

  • By running some performance experiments, optimal performance is achieved when the number of thread blocks is exactly twice the number of multiprocessors GPU contains.
  cudaDeviceProp property;
  cudaGetDeviceProperties( &property, 0 );
  const int blocks = property.multiProcessorCount * 2;
  kernel<<<blocks, ...>>>( ... );
  • Ironically, despite that the atomic operations cause performance degradation, sometimes alleviating the slowdown actually involves using more atomics, not fewer. The core problem is not the use of atomics so much as the fact that too many threads are competing for access to a relatively small number of memory addresses.
    • Consider shared memory for thread competition, not global memory.
  • It is possible for the GPU to directly access the CPU data using zero-copy memory if canMapHostMemory field of a cudaDeviceProp struct is set. The performance is different for discrete GPUs and integrated GPUs.
    • Integrated GPUs are graphics processors built into a system’s chipset and usually share regular system memory with the CPU. For integrated GPUs, the use of zero-copy memory is always a performance win because the memory is physically shared with the host anyway.
    • In cases where inputs and outputs are used exactly once, a performance enhancement can be observed when using zero-copy memory with a discrete GPU. But since the zero-copy memory is not cahced on the GPU, in situations where the memory gets read multiple times, a large penalty is paid that could be avoided by simply copying the data to the GPU first.
    • integrated field of a cudaDeviceProp struct is set if the device is an integrated GPU and not otherwise.
    • cudaHostAlloc() has a flag, cudaHostAllocMapped, for this. cudaHostAlloc() returns the CPU pointer and the GPU has a different virtual memory space than the CPU, so a valid GPU pointer for this memory should be obtained by calling cudaHostGetDevicePointer().
  • Pages can appear pinned to a single CPU thread only. If the pointer to the pinned memory is shared between threads, the other threads will see the buffer as standard, pageable data. As a remedy to this problem, this pinned memory can be allocated as portable.
    • cudaHostAlloc() has a flag, cudaHostAllocPortable, for this.

References

[1] Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming (1st. ed.). Addison-Wesley Professional.

[2] Tomas Akenine-Mller, Eric Haines, and Naty Hoffman. 2018. Real-Time Rendering, Fourth Edition (4th. ed.). A. K. Peters, Ltd., USA.

[3] How to Optimize Data Transfers in CUDA C/C++

[4] How to Overlap Data Transfers in CUDA C/C++

[5] [GPGPU Series 1] Index of Articles


© 2025. All rights reserved.