White Paper Continued
Many interesting algorithms and techniques will really only be possible or practical to implement on upcoming DirectX 11-enabled GPUs that support Shader Model 5.0. Here is a summary of some of the key advantages Shader Model 5.0 offers over Shader Model 4.0:
These features are described in more detail below.
1. Improved Parallelism – The following features of DirectX 11-enabled GPUs greatly enhance a programmer’s ability to exploit parallelism:
- Increased Thread Group Size and 3D Thread Dispatch: A Thread Group is a set of threads
that work together to efficiently implement a partitioned data parallel algorithm. DirectX 11-enabled GPUs improve the efficiency of memory accesses by allowing the coherent exchange of data between threads within a group, thus enabling parallel algorithms to execute in fewer passes. This is designed to not only increases processing speed, but to improve power efficiency as well, by allowing higher throughput with fewer accesses to offchip memory. Shader Model 5.0 supports larger and more flexible thread groups with 3D indexing, giving programmers improved control over the domain of defining their algorithms, and enabling additional throughput due to increased multi-threading in the GPU.
- Atomic Operation Support: This is a key feature of CPUs that programmers have been demanding for GPUs as well. Atomic operations enable the more efficient and accurate combination of operations that try to modify the same memory addresses. GPUs are capable of running thousands of threads or thread groups in parallel, and if two or more of these threads try to manipulate the same variable or access the same memory location, it could result in data corruption. Without atomic operations, programmers either had to modify their algorithms to avoid these situations, or otherwise serialize updates to shared variables or memory locations (effectively eliminating much of the performance benefit from parallel processing). Atomic operations allow these situations to be handled gracefully regardless of the number of parallel threads being executed, which helps maximize performance and simplify porting of algorithms from the CPU to the GPU.
- Gather4: Modern GPUs use dedicated hardware blocks known as texture units to fetch data rapidly into their processing cores. These texture units have historically been optimized for rendering graphics, where techniques such as bilinear filtering are typically used to improve image quality. Compute Shaders often make use of these same units to fetch data as well, but they generally have no use for their filtering capabilities, leaving them underutilized.
GPUs with Shader Model 5.0 support have the ability to use the excess fetch capability with the Gather4 operation, which can fetch up to 4 values simultaneously and provide a 4xincrease in data bandwidth.
2. Improved Precision and Integer Processing: DirectX 11 enables support for double precision (64-bit) floating point operations on the GPU, according to the IEEE-754 standard. Until
recently, this level of precision was only supported on CPUs, with GPUs being limited to single precision (32-bit) operations. While single precision is sufficient for most graphics applications, it can be insufficient for some simulation or number-crunching tasks that require large numbers of iterations on a single data value, or work with very large or very small values. Shader Model 5.0 also adds new integer and bit manipulation operations, such as count bits set, find first bit, insert/extract bit fields, reverse bits, and new bit shift operations. Applications such as video processing and cryptography use operations like these extensively, and can therefore benefit
from improved performance on DirectX 11 GPUs.
3. Tight Integration between Compute Shaders and Rendering Pipeline: Although Compute Shaders are primarily intended to handle non-graphics tasks, they can often be used to enhance or interface with a rendering pipeline to influence what is sent to a display. Examples include simulation tasks, like game physics or artificial intelligence, that can control the movement or behavior of objects and characters that are drawn on-screen; sorting techniques, like order independent transparency, that optimize the rendering of large numbers of objects; and postprocessing effects, like tone mapping and depth of field, which can apply various filters to modify and enhance an image after it has finished rendering. DirectX 11 Compute Shaders share a common instruction set with other DirectX 11 shader types used for rendering (including Vertex, Hull, Domain, Geometry, and Pixel Shaders), and can share data structures to implement
these techniques in a much more practical and efficient manner.
4. Improved Ease of Programming and more efficient memory usage: Powerful hardware is useless without software that can take advantage of the hardware’s capabilities. As a compute language, Shader Model 5.0 enables significant improvements that can enhance a programmer’s ability to model programs and algorithms for the GPU that were once impractical or impossible. By liberating development time from working around the restrictions of earlier GPU compute languages, the programmer’s imagination and energies can be focused instead on the actual compute problem. Shaders Model 5.0 adds some key features that make it easier to model and solve compute problems on the GPU, including:
- Increased Shared Memory with Improved Access: A key feature of DirectX 11 Compute
Shaders is support for shared memory, which allows communication between threads.
Shader Model 5.0 doubles the amount of shared memory available to a thread group, from
16 to 32 kilobytes. In addition to more shared memory, DirectX 11 class GPUs allow indexed reads and writes to this memory, whereas older DirectX 10 / 10.1 class GPUs limited access to private writes with shared reads. Allowing threads to directly read and write shared memory increases data parallelism within thread groups and simplifies porting of CPU code to run on the GPU. The combination of larger thread groups and more shared memory can also greatly reduce the number of non-local memory accesses required by some algorithms, which would reduce memory bandwidth requirements and improve performance.
- Append/Consume Buffers: Shader Model 5.0 supports a new type of data buffer that
behaves like a stack or a list, instead of a fixed array of values. New data values are written
to the end of the list as they are generated, or read from the end of the list as they are
required. This is useful for implementing irregular data structures that require a different number of data values for each element, or for adaptive techniques like stream compaction that do a variable amount of processing for each element. Append buffers allow these processes to be performed in a single pass over the data, instead of requiring multiple passes which consume more memory bandwidth and compute cycles.
- Unordered Access Views (UAV): A UAV is a buffer that allows data to be written to or read from arbitrary locations, instead of a pre-defined order. Also known as “scatter/gather” operations, these add a great deal of flexibility that was not available in older GPUs. DirectX 11 extends this flexibility beyond what was possible with DirectX 10 class GPUs by allowing Compute Shaders to access up to 8 different UAVs at a time instead of just one. The DirectX 11 programming interface allows these UAVs to be accessed by Pixel Shaders as well, which facilitates data sharing between Compute Shaders and the rendering pipeline. These enhancements allow a variety of pre- and post-processing algorithms to be implemented more efficiently with DirectX 11 class GPUs.
- Indirect Compute Dispatch: This feature enables the generation of new workloads created
by previous rendering or compute shading without CPU intervention. This further reduces
CPU overhead and frees up more processing time to be used on other tasks.
Now that you have read the AMD White Paper you can see what AMD considers important and why DX 11 matters. This will likely be a hot topic for many months to come.