Things to Know When Using NVIDIA GPU
- General
- Shader Programming
- GeForce FX
- GeForce 6 & 7 Series (Shader Model 3.0)
- GeForce 8 & 9 Series (Shader Model 4.0)
- GeForce 600 Series
- Turing Architexture
- References
General
- Use mipmapping to achieve better image quality, improved texture cache behavior, and higher performance.
- Double speed when rendering only depth or stencil values. To enable this special rendering mode, the followings are required.
- color writes are disabled.
- the active depth-stencil surface is not multisampled.
- texkill has not been applied to any fragments.
- texkill(DirectX) - cancels rendering of the current pixel if any of the first three components(uvw) of the texture coordinates is less than zero.
- depth replace has not been applied to any fragments.
- alpha test is disabled.
- no color key is used in any of the active textures.
- color key(DirectX) - is used for alpha blending.
- user clip plances are disabled.
- Early-Z (or Z-Cull) optimization improves performance by avoiding the rendering of occluded surfaces. The followings are required. Violating these rules can invalidate the data the GPU uses for early optimization, and can disable z-cull until the depth buffer is cleared again.
- do not create triangles with holes. that is, avoid alpha test or texkill.
- do not modify depth. that is, allow the GPU to use the interpolated depth value.
- Lay down depth first, which means that double speed depth rendering can be used as a first pass, then z-cull will automatically cull out fragments what are not visible while full shading.
- Multi-GPU programming tips with OpenGL.
- limit rendering to a single window.
- ask for a pixel format with the PFD_SWAP_EXCHANGE flag set instead of PFD_SWAP_COPY. the swap exchange flag implies that the application does not rely on the back buffer content after a
SwapBuffers()
is performed. - rendering to the front buffer requires heavy synchronization and should be avoided at all costs.
- limit pbuffer usage. rendering to a pbuffer requires the driver to broadcast the rendering to both GPUs because the rendered result may be used by either GPU later on.
- render directly into textures instead of using
glCopyTexSubImage()
as it causes that the textures need to get updated on both GPUs. - Vertex Buffer Objects(VBOs) tips:
- load VBO working set first, or textures may block faster memory until the working set stabilizes.
- avoid huge batches when drawing geometry.
- use
unsigned short
for indices. - use
glDrawRangeElements()
instead ofglDrawElements()
. - use the correct VBO usage hints for the type of data.
- limit the amount of textures in use at a given time and not stream in new textures too often.
- avoid rendering to only a section of the frame using methods as
glViewport()
orglScissor()
. - avoid reading back the color or depth buffers using
glReadPixels()
and never useglCopyPixels()
because these cause pipleline stalls and inhibit parallelism. - never call
glFinish()
because this does not return until all pending OpenGL commands are complete.
- Depth Bounds Test(DBT) allows the programmer to enable an additional criterion to allow discarding of a pixel after the scissor test and before alpha testing. Unlike the depth test, DBT has no dependency on the fragment’s window-space depth value. The min/max values are clamped to [0..1].
- To be able to see the potential effects of pipeline optimization, it is important to measure the total rendering time per frame with double buffering disabled, that is, in single-buffer mode by turning vertical synchronization off. This is because with double buffering turned on, swapping of the buffers occurs only in synchronization with the frequency of the monitor.
- even with double buffering, screen tearing occurs when the back buffer is swapped with the front buffer while the screen has only been partially drawn by the video hardware. V-Sync is the technology to avoid this problem.
glfwSwapInterval(1)
acts like V-Sync as well. In general, the argument is 1 and does not use a value greater than 1.
- Changing the state can be expensive, and state change costs are mostly on the CPU side, in the driver. The cost order is as follows from the most expensive to least, as of 2014. One even more expensive change is switching between the GPU’s rendering mode and its compute shader mode.
- Render target(framebuffer object), ∼60k/sec.
- Shader program, ∼300k/sec.
- Blend mode(ROP), such as for transparency.
- Texture bindings, ∼1.5M/sec.
- Vertex format.
- Uniform buffer object(UBO) bindings.
- Vertex bindings.
- Uniform updates, ∼10M/sec.
- A common way to minimize texture binding changes is to put several texture images into one large texture or, better yet, a texture array. If the API supports it, bindless textures are another option to avoid state changes.
- Often several uniforms can be defined and set as a group, so binding a single uniform buffer object is considerably more efficient. In DirectX these are called constant buffers. Using these properly saves both time per function and time spent error-checking inside each individual API call.
- Modern drivers often defer setting state until the first draw call encountered. If redundant API calls are made before then, the driver will filter these out, thus avoiding the need to perform a state change. Often a dirty flag is used to note that a state change is needed, so going back to a base state after each draw call may become costly.
- Consider the routine that
Enable(X); Draw(M1); Disable(X);
thenEnable(X); Draw(M2); Disable(X);
for a state X. In this case, it is also likely to waste significant time setting the state again between the two draw calls, even though no actual state change occurs between them.
- Consider the routine that
Shader Programming
- Choose the lowest pixel shader verison preferably.
- Choose the lowest data precesion preferably.
- 12-bit fixed-point format
fixed
is fastest and should be used for low-precision calculation, such as color computation. - 16-bit floating-point format
half
delivers higher performance than the 32-bit floating-point formatfloat
. float
can be used when the highest possible accuracy is needed.
- 12-bit fixed-point format
- Save computations by using Algebra.
- For example,
dot(normalize(N), normalize(L))
requires two expensive reciprocal square root computations,(N/|N|) dot (L/|L|)
. (N/|N|) dot (L/|L|) = (N dot L)/(|N|*|L|) = (N dot L)/sqrt((N dot N)*(L dot L))
- For example,
- Do not write overly generic library functions because the function may include unnecessary computation for the generality.
- Precompute uniforms on the CPU before a shader run if possible.
A
can be premultiplied by3.0
on the CPU in a shader code likeA * 3.0
.- inverse and transpose matrices can be precomputed on the CPU because they do not have to calculate per-vertex or per-fragment.
- Do not use uniform for constants like 0, 1, and 255. It makes it harder for compilers to distinguish between constants and shader parameters.
- Balance the vertex and pixel shaders. Look for opportunities to move calculations to the vertex shader if the pixel shader seems bottleneck.
- Replace complex functions, such as
log
andexp
, with texture lookups. Textures are a great way to encode complex functions, so think of them as multidimensional arrays.
GeForce FX
- The GeForce FX can natively handle 512 pixel instructions per pass in Direct3D and 1,024 in OpenGL. Quadro FX cards can handle 2,048 pixel instructions per pass.
- The ARB_fragment_program extension requires 24-bit floating-point precision at a minimum by default. Various flags can be put at the top of the ARB_fragment_program source code.
- NV_fragment_program allows the
half
andfixed
formats. - ARB_precision_hint_fastest makes the Unified Compiler determine the appropriate precision for each shader operation at run-time.
- ARB_precision_hint_nicest forces the entire program to run in
float
precision.
- NV_fragment_program allows the
half
can only exactly represent the integers from -2,048 to 2,048, with no fractional bits left over.- if two values 4,096 and 4,097 are represented by the same 16-bit floating point number, the subtracted result will be zero.
- the workaround is to move matrix and vector subtraction operations into the vertex shader.
- vertex shaders are required at a minimum to support
float
, so they can easily handle large world and view spaces. - in general, perform constant calculations on the CPU, linear calculations in the vertex shader, and nonlinear calculations in the pixel shader.
- On GeForce FX hardware, floating-point render targets do not support blending and floating-point textures larger than 32-bits per texel do not support mipmapping or filtering.
GeForce 6 & 7 Series (Shader Model 3.0)
- Dynamic branching to save performance by skipping unnecessary calculation in loops or conditionals.
- Instancing allows the programmer to submit a single draw call, which renders each of many objects, using the same data for that object shape, but then vary it through the per-instance streams.
- Multiple Render Targets(MRTs) allow a pixel shader to write out data to up to four different targets. Note that MRTs restrict other GPU features.
- hardware-accelerated antialiasing is inapplicable to MRTs.
- all render targets must have the same width, height, and bit depth.
- the post pixel shader blend operations alpha-blending, alpha-testing, fogging, and dithering are only available for MRTs.
- Use write masks and swizzles, which can help the compiler to identify vector types of schedule opportunities.
- Use partial precision whenever possible. GeForce 6 & 7 Series have a special free fp16 normalize unit in the shader, which allows 16-bit floating-point normalization to happen very efficiently in parallel with other computations. Also, partial precision helps to reduce register pressure.
GeForce 8 & 9 Series (Shader Model 4.0)
- Before the vertex shader can operate on a vertex, that vertex needs to be assembled into a single data chunk, which is called setup. During setup, each float of vertex data, attribute, is fetched from the appropriate location in video memory. On the GeForce 8 & 9 series cards, there is a fixed number of attributes that can be fetched per clock cycle. So if a vertex becomes extremely large, the vertex setup stage of the graphics pipeline will become the bottleneck.
- to detect attribute bottlenecks, add in some dummy data to the vertex declaration and if the performance suffers.
- check for and remove unused attributes.
- try to perform logical grouping of attributes which means that combining a number of separate attributes into a single attribute (up to
float4
).- if a pair of texture coordinates is needed, it is better to pack them into a single
float4
than using two separatefloat2
attributes.
- if a pair of texture coordinates is needed, it is better to pack them into a single
- From GeForce series 8 and later, NVIDIA graphics chips are using a unified shader architecture. This means that shader code for vertex, geometry and pixel shaders is executed on the same hardware and benefits from the same caching and speed.
- The performance of a geometry shader is inversely proportional to the output size (the product of the vertex size and the number of vertices) declared. However, this performance degradation occurs at particular output sizes, and is not smooth. The main use of the geometry shader is not for doing heavy output algorithms such as tessellation.
- A geometry shader runs on primitives, per-vertex operations will be duplicated for all primitives that share a vertex. This is potentially a waste of processing power. It is most useful when doing operations on small vertices or primitive data that requires outputting only small amounts of new data.
- A decent use of geometry shader is point sprites.
- Stream out is a new feature which makes a programmer bypass the rasterization and later stages of the graphics pipeline and write the output of the vertex/geometry shader directly into video memory.
- Coarse Z/Stencil culling (or Z-Cull) will not be able to cull any pixels in the following cases.
- If clear functions are not used to clear the depth-stencil buffer.
- If the pixel shader writes depth.
- If the direction of the depth test is changed while writing depth.
- If stencil writes are enabled while doing stencil testing.
- Coarse Z/Stencil culling (or Z-Cull) will perform less efficiently in the following circumstances.
- If the depth buffer was written using a different depth test direction than that used for testing.
- If the depth of the scene contains a lot of high frequency information, that is, the depth varies a lot within a few pixels.
- If too many large depth buffers are allocated.
- Fine-grained Z/Stencil culling (or Early-Z) is disabled in the following cases.
- If the pixel shader outputs depth.
- If the pixel shader uses the
.z
component of an input attribute. - If depth or stencil writes are enabled, or occlusion queries are enabled, and one of the following is true.
- alpha-test is enabled
- pixel shader kills pixels (
clip()
, texkill,discard
) - alpha to coverage is enabled
GeForce 600 Series
- GPU Boost arose in part because some synthetic benchmarks worked many parts of the GPU’s pipeline simultaneously and so pushed power usage to the limit, meaning that NVIDIA had to lower its base clock rate to keep the chip from overheating.
- Many applications do not exercise all parts of the pipeline to such an extent, so can safely be run at a higher clock rate.
- The GPU Boost technology tracks GPU power and temperature characteristics and adjusts the clock rate accordingly.
- This variability can cause the same benchmark to run at different speeds, depending on the initial temperature of the GPU.
Turing Architexture
- Turing Streaming Multiprocessor(Turing SM) delivers a dramatic boost achieving 50% improvement per CUDA Core compared to the Pascal generation.
- In previous shader architectures, the floating-point math datapath sits idle whenever one of the non-FP-math instructions runs such as integer adds for addressing and fetching data, floating point compare or min/max for processing results. Turing adds a second parallel execution unit next to every CUDA core that executes these instructions in parallel with floating point math.
- The SM memory path has been redesigned to unify shared memory, L1, and texture caching into one unit. This translates to 2x more bandwidth and more than 2x more capacity available for L1 cache for common workloads. Combining the L1 data cache with the shared memory reduces latency and provides higher bandwidth than the L1 cache implementation used previously in Pascal GPUs.
- With Texture-Space Shading(TSS), objects are shaded in a private coordinate space (a texture space) that is stored in a texture as texels. TSS remembers which texels have been shaded and only shades those that have been newly requested. Texels shaded and recorded can be reused to service other shade requests in the same frame, in an adjacent scene, or in a subsequent frame.
- With TSS, the two major operations of visibility sampling(rasterization and z-testing) and appearance sampling(shading) can be decoupled and performed at a different rate, on a different sampling grid, or even on a different timeline.
- The geometry is still rasterized to produce screen-space pixels, but the screen-space pixel is mapped into a separate texture space and shade the associated texels in texture space. The mapping to texture space is a standard texture mapping operation. The texture is created on-demand based on sample requests, only generating values for texels that are referenced.
- Turing is the first GPU architecture to support GDDR6 memory. GDDR6 is the next big advance in high-bandwidth GDDR DRAM memory design. GDDR6 memory interface circuits in Turing GPUs have been completely redesigned for speed, power efficiency and noise reduction, achieving 14 Gbps transfer rates at 20% improved power efficiency compared to GDDR5X memory used in Pascal GPUs.
References
[1] NVIDIA GPU Programming Guide Version 2.5.0.
[2] GPU Programming Guide Version for GeForce 8 and later GPUs.
[4] GPU Boost
[5] Tomas Akenine-Mller, Eric Haines, and Naty Hoffman. 2018. Real-Time Rendering, Fourth Edition (4th. ed.). A. K. Peters, Ltd., USA.
[6] J. Gregory, Game Engine Architecture, Third Edition, CRC Press