Fragment Shader Thread Group: Quads

Quads

Lens

In fragment shader, the primitive rasterized is arragned into a thread group before excution, which is called quads. Since quads is 2×22 \times 2 threads, the fragement not covered by the primitive can also be excuted. However, these fragements are masked out after excution, which means these are discarded without updating the framebuffer. OpenGL provides the built-in value, gl_HelperInvocation, to check if the running fragment is masked out or not. This value is true if the fragment is masked out and is false otherwise. Besides, most GPU excutes a batch of threads such as a warp in NVIDIA GPU which is composed of 3232 threads. In this case, maximum 88 quads are run at the same time effectively as shown in the above figure.

This logic is useful when some glsl functions like dFdx or dFdy are excuted. The current fragment can peek at the registers of the adjacent fragment thread and do subraction to calculate partial derivatives. A caveat is that dFdx or dFdy in a conditional branch may provide unexpected result because the active thread in a quad will peek at the register of the other inactive thread followed by the old value reference.

Meanwhile, dFdx or dFdy are technically based on numerical differentiation. How can the fragment in the boundary of a quad be handled if forward or backward differencing is used? Both differencing are applied for these functions. When it comes to dFdx, the left side fragment should use forward differencing while the right side one should do backward differencing to avoid having to access the next quad, which means that their results are the same.

Reference

[1] https://on-demand.gputechconf.com/gtc/2016/presentation/s6138-christoph-kubisch-pierre-boudier-gpu-driven-rendering.pdf

[2] https://stackoverflow.com/questions/16365385/explanation-of-dfdx

[3] https://stackoverflow.com/questions/39579150/difference-between-dfdxfine-and-dfdxcoarse


© 2023. All rights reserved.