K-d Tree on GPU

Result

A K-d Tree is one of the essential structures for a realistic rendering such as photon mapping. Since most rendering logic works on the GPU, building a K-d tree on the GPU is more efficient and convenient, so searching with a radius or finding K-nn can be simple and fast. I found a multi-threaded version working on the CPU[1, 2] and the other version using CUDA[3], and rewrote these to work with OpenGL compute shaders.

First, the target vertices, which are called coordinates in [1, 2] should be sorted by axis. In this merge sorting, the process runs by block, especially thread block in CUDA. [3] uses 1,0241,024 unit as a shared size for most kernels and the coordinates is sorted by 1,0241,024 using 512512 threads per block. The algorithm behind this sorting is based on [4] introducing the concept of rankrank. This rankrank generated is used to merge the sorted adjacent two blocks, and this merge process is repeated until all elements are sored. However, some duplicates in coordinates might be included and these should be removed before going further with reference to [5]. __ballot_sync and __popc functions in CUDA and subgroupBallot and bitCount functions in OpenGL shading language with GL_KHR_shader_subgroup_ballot enabled are useful to do this. Partitioning is the next step for building a K-d tree. [3] explains two compaction algorithms that each running warp produces the same amount of output regardless of the median selected, which leads to the load being balanced over all warps. The total number of threads is pre-defined, so if they are not enough to partition the current depth level, another version of partitioning is executed. While constructing a K-d tree, the parent index is also stored other than the children indices to use when searching queries.

In terms of searching, there are two searching ways: searching with radius and finding K-nearest neighbor. These algorithms are intuitive when implemented on the CPU, but it is a little bit complicated to work with GPU kernels. Parallel tree traversal is a hassle because the recursion causes the stack overflow and the data structures like stack or queue require a lot of memory whether DFS or BFS is adopted as well and these structures are not built-in supported on the GPU. [5] suggests a stack-free searching algorithm that requires the parent, previous, and current indices only. This algorithm can be extended to a lot of similar situations since it explains the base logic to traverse trees.

References

[1] Building a Balanced k-d Tree in O(kn log n) Time

[2] https://github.com/chezruss/kd-tree

[3] https://github.com/johnarobinson77/KdTreeGPU

[4] N. Satish, M. Harris and M. Garland, “Designing efficient sorting algorithms for manycore GPUs,” 2009 IEEE International Symposium on Parallel & Distributed Processing, Rome, Italy, 2009, pp. 1-10

[5] Billeter, Markus & Olsson, Ola & Assarsson, Ulf. (2009). Efficient stream compaction on wide SIMD many-core architectures. Proceedings of the Conference on High Performance Graphics. 159-166.

[6] https://github.com/ingowald/cudaKDTree

Code

https://github.com/emoy-kim/KdTreeOnGPU


© 2025. All rights reserved.