NVIDIA Programming Tips

General

  • Choose the lowest pixel shader verison preferably.
  • Choose the lowest data precesion preferably.
    • 12-bit fixed-point format fixed is fastest and should be used for low-precision calculation, such as color computation.
    • 16-bit floating-point format half delivers higher performance than the 32-bit floating-point format float.
    • float can be used when the highest possible accuracy is needed.
  • Save computations by using Algebra.
    • For example, dot( normalize( N ), normalize( L )) requirees two expensive reciprocal square root computations, (N/|N|) dot (L/|L|).
    • (N/|N|) dot (L/|L|) = (N dot L)/(|N|*|L|) = (N dot L)/sqrt( (N dot N)*(L dot L) )
  • Do not write overly generic library functions because the function may include unnecessary computation for the generality.
  • Precompute uniforms on the CPU before a shader run if possible.
    • A can be premultiplied by 3.0 on the CPU in a shader code like A * 3.0.
    • inverse and transpose matrices can be precomputed on the CPU because they do not have to calculate per-vertex or per-fragment.
  • Do not use uniform for constants like 0, 1, and 255. It makes it harder for compilers to distinguish between constants and shader parameters.
  • Balance the vertex and pixel shaders. Look for opportunities to move calculations to the vertex shader if the pixel shader seems bottleneck.
  • Use mipmapping to achieve better image quality, improved texture cache behavior, and higher performance.
  • Replace complex functions, such as log and exp, with texture lookups. Textures are a great way to encode complex functions, so think of them as multidimensional arrays.
  • Double speed when rendering only depth or stencil values. To enable this special rendering mode, the followings are required.
    • color writes are disabled.
    • the active depth-stencil surface is not multisampled.
    • texkill has not been applied to any fragments.
      • texkill (Direct X) - cancels rendering of the current pixel if any of the first three components (uvw) of the texture coordinates is less than zero.
    • depth replace has not been applied to any fragments.
    • alpha test is disabled.
    • no color key is used in any of the active textures.
      • color key (Direct X) - is used for alpha blending.
    • user clip plances are disabled.
  • Early-Z (a.k.a z-cull) optimization improves performance by avoiding the rendering of occluded surfaces. The followings are required. Violating these rules can invalidate the data the GPU uses for early optimization, and can disable z-cull until the depth buffer is cleared again.
    • do not create triangles with holes. that is, avoid alpha test or texkill.
    • do not modify depth. that is, allow the GPU to use the interpolated depth value.
  • Lay down depth first, which means that double speed depth rendering can be used as a first pass, then z-cull will automatically cull out fragments what are not visible while full shading.
  • Multi-GPU programming tips with OpenGL.
    • limit rendering to a single window.
    • ask for a pixel format with the PFD_SWAP_EXCHANGE flag set instead of PFD_SWAP_COPY. the swap exchange flag implies that the application does not rely on the back buffer content after a SwapBuffers() is performed.
    • rendering to the front buffer requires heavy synchronization and should be avoided at all costs.
    • limit pbuffer usage. rendering to a pbuffer requires the driver to broadcast the rendering to both GPUs because the rendered result may be used by either GPU later on.
    • render directly into textures instead of using glCopyTexSubImage() as it causes that the textures need to get updated on both GPUs.
    • Vertex Buffer Objects (VBO) tips:
      • load VBO working set first, or textures may block faster memory until the working set stabilizes.
      • avoid huge batches when drawing geometry.
      • use unsigned short for indices.
      • use glDrawRangeElements() instead of glDrawElements().
      • use the correct VBO usage hints for the type of data.
    • limit the amount of textures in use at a given time and not stream in new textures too often.
    • avoid rendering to only a section of the frame using methods as glViewport() or glScissor().
    • avoid reading back the color or depth buffers using glReadPixels() and never use glCopyPixels() because these cause pipleline stalls and inhibit parallelism.
    • never call glFinish() because this does not return until all pending OpenGL commands are complete.
  • Depth Bounds Test (DBT) allows the programmer to enable an additional criterion to allow discarding of a pixel after the scissor test and before alpha testing. Unlike the depth test, DBT has no dependency on the fragment’s window-space depth value. The min/max values are clamped to [0..1].

GeForce 6 & 7 Series (Shader Model 3.0)

  • Dynamic branching to save performance by skipping unnecessary calculation in loops or conditionals.
  • Instancing allows the programmer to submit a single draw call, which renders each of many objects, using the same data for that object shape, but then vary it through the per-instance streams.
  • Multiple Render Targets (MRTs) allow a pixel shader to write out data to up to four different targets. Note that MRTs restrict other GPU features.
    • hardware-accelerated antialiasing is inapplicable to MRT render targets.
    • all render targets must have the same width, height, and bit depth.
    • the post pixel shader blend operations alpha-blending, alpha-testing, fogging, and dithering are only available for MRT.
  • Use write masks and swizzles, which can help the compiler to identify vector types of schedule opportunities.
  • Use partial precision whenever possible. GeForce 6 & 7 Series have a special free fp16 normalize unit in the shader, which allows 16-bit floating-point normalization to happen very efficiently in parallel with other computations. Also, partial precision helps to reduce register pressure.

GeForce FX

  • The GeForce FX can natively handle 512 pixel instructions per pass in Direct3D and 1,024 in OpenGL. Quadro FX cards can handle 2,048 pixel instructions per pass.
  • The ARB_fragment_program extension requires 24-bit floating-point precision at a minimum, by default. Various flags can be put at the top of the ARB_fragment_program source code.
    • NV_fragment_program allows the half and fixed formats.
    • ARB_precision_hint_fastest makes the Unified Compiler determine the appropriate precision for each shader operation at run-time.
    • ARB_precision_hint_nicest forces the entire program to run in float precision.
  • half can only exactly represent the integers from -2,048 to 2,048, with no fractional bits left over.
    • if two values 4,096 and 4,097 are represented by the same 16-bit floating point number, the subtracted result will be zero.
    • the workaround is to move matrix and vector subtraction operations into the vertex shader.
    • vertex shaders are required at a minimum to support float, so they can easily handle large world and view spaces.
    • in general, perform constant calculations on the CPU, linear calculations in the vertex shader, and nonlinear calculations in the pixel shader.
  • On GeForce FX hardware, floating-point render targets do not support blending and floating-point textures larger than 32-bits per texel do not support mipmapping or filtering.

GeForce 8 & 9 Series (Shader Model 4.0)

  • Before the vertex shader can operate on a vertex, that vertex needs to be assembled into a single data chunk, which is called setup. During setup, each float of vertex data, attribute, is fetched from the appropriate location in video memory. On the GeForce 8 & 9 series cards, there is a fixed number of attributes that can be fetched per clock cycle. So if a vertex becomes extremely large, the vertex setup stage of the graphics pipeline will become the bottleneck.
    • to detect attribute bottlenecks, add in some dummy data to the vertex declaration and if the performance suffers.
    • check for and remove unused attributes.
    • try to perform logical grouping of attributes which means that combining a number of separate attributes into a single attribute (up to float4).
      • if a pair of texture coordinates is needed, it is better to pack them into a single float4 than using two separate float2 attributes.
  • From GeForce series 8 and later, NVIDIA graphics chips are using a unified shader architecture. This means that shader code for vertex, geometry and pixel shaders is executed on the same hardware and benefits from the same caching and speed.
  • The performance of a geometry shader is inversely proportional to the output size (the product of the vertex size and the number of vertices) declared. However, this performance degradation occurs at particular output sizes, and is not smooth. The main use of the geometry shader is not for doing heavy output algorithms such as tessellation.
  • A geometry shader runs on primitives, per-vertex operations will be duplicated for all primitives that share a vertex. This is potentially a waste of processing power. It is most useful when doing operations on small vertices or primitive data that requires outputting only small amounts of new data.
  • A decent use of geometry shader is point sprites.
  • Stream out is a new feature which makes a programmer bypass the rasterization and later stages of the graphics pipeline and write the output of the vertex/geometry shader directly into video memory.
  • Coarse Z/stencil culling (a.k.a z-cull) will not be able to cull any pixels in the following cases.
    • If clear functions are not used to clear the depth-stencil buffer.
    • If the pixel shader writes depth.
    • If the direction of the depth test is changed while writing depth.
    • If stencil writes are enabled while doing stencil testing.
  • Coarse Z/stencil culling (a.k.a z-cull) will perform less efficiently in the following circumstances.
    • If the depth buffer was written using a different depth test direction than that used for testing.
    • If the depth of teh scene contains a lot of high frequency information, that is, the depth varies a lot within a few pixels.
    • If too many large depth buffers are allocated.
  • Fine-grained Z/stencil culling (a.k.a early-z) is disabled in the following cases.
    • If the pixel shader outputs depth.
    • If the pixel shader uses the .z component of an input attribute.
    • If depth or stencil writes are enabled, or occlusion queries are enabled, and one of the following is true.
      • alpha-test is enabled
      • pixel shader dills pixels (clip(), texkill, discard)
      • alpha to coverage is enabled

Reference

[1] NVIDIA GPU Programming Guide Version 2.5.0.

[2] GPU Programming Guide Version for GeForce 8 and later GPUs.

[3] https://blog.hybrid3d.dev/2020-12-21-reason-for-slow-of-if-statement-in-shader#fn:flatten


© 2023. All rights reserved.