AMD High Bandwidth Memory Detailed With Joe Macri

AMD High Bandwidth Memory

AMD recently hosted a conference call on High Bandwidth Memory with Joe Macri that went gave more information about HBM and we wanted to share that information and the entire slide deck with our readers today. AMD believes that High Bandwidth Memory (HBM) interface is the right solution for their high-end Radeon R9 300 series video cards and will likely be rolling it out in the near future on APUs as well. This is new technology that you’ll want to know about and it won’t just be used by AMD. AMD and SKHynix worked together with a number of companies to help develop and set the HBM standard and HBM now has a JEDEC specification!

AMD High Bandwidth Memory

There were many challenges to bring HBM to market and the chart in the slide above was the image that AMD drew on the white board seven years ago to talk about what the next challenges were with regards to memory for GPUs. AMD saw that we are living in a power limited world and that as one sub system of the chip would go up in power that another would have to go down. The problem here is that the memory power for GDDR5 was ever increasing and there are only so many Watts that could be applied. AMD was able to look at memory bandwidth projections and see that down the road they wouldn’t be able to have the bandwidth needed if they kept headed down the GDDR5 path. AMD saw that bandwidth per watt was going to be a big challenge.

AMD_High_Bandwidth_Memory_Page_03

The other problem with GDDR5 memory is that form factors is another issue as devices are getting smaller, but the PCB area needed for the ASIC and memory for a graphics cards isn’t getting smaller. AMD saw this as an issue and started looking at the bandwidth per mm2 metric as they wanted to improve on this to help bring ‘incredible’ new video card form factors to market.

AMD_High_Bandwidth_Memory_Page_05

DRAM is note size or cost effective for integration in a SoC or GPU as they are optimized for the bit cells to hold a charge like a capacitor. You can implement logic, but the transistors act many generation older and merging the DRAM process with a logic process is tough to do and there are extra costs involved. AMD believes that this is the reason that embedded DRAM has never taken off and that keeping the DRAM seperate from your logic process is usually always the best choice.

AMD_High_Bandwidth_Memory_Page_06

AMD looked at scaling GDDR5 faster, but while it might have worked for a period of time it still had big issues. They couldn’t reduce the footprint size, the asymmetrical memory interface design was causing increased latencies and it was using more and more power.

AMD_High_Bandwidth_Memory_Page_07

AMD decided to go another direction and looked at going wide and slow since they were just looking for bandwidth. In this particular case they also got improved latencies, which wasn’t their primary goal. AMD created a passive interposer that has no active transistors. This allows for incredible densities due to the closeness of everything. With this design AMD was able to go from 32-bit on GDDR5 memory to 1024-bit width bus on HBM. The interposer is built by UMC and uses an older technology as it doesn’t need to be made using the latest fab technology.

AMD_High_Bandwidth_Memory_Page_08

This slide hows the general HBM layout. On the bottom you have a standard organic package substrate that is now totally separated from the high-speed memory sub system. Above that you have the silicon interposer that obviously attaches to the to the package substrate on one side and then the HBM logic chip and GPU. HBM is designed as a true 3D design that is made up of four storage chips and one logic chip. The little green columns in the in the image above show Through Silicon via (TSV) interconnections. These are basically holes that are punched in the silicon of the HBM DRAM and Logic chips as well as the interposer. All five of these chips have TSV’s on them. When you are building one of these pieces of silicon they are just 100 microns thick! The interconnect between the DRAM stack and the discrete GPU is shown on the PHY blocks on the edges of the chips.

The PHY’s are laid out on the edges of the chips to keep the latency to a minimum. HBM does decrease the latency as data is for the most part no longer being moved horizontally to a central part of the die like it was with GDDR5. With HBM it is pushed down vertically and that really helps reduce the latency. HBM has more channels and banks and that helps improve pseudo random access, which is critical to reducing latency for the HPC market.

AMD_High_Bandwidth_Memory_Page_09

This slide compares GDDR5 to HBM as you need to compare things differently since we have gone to the 3D space and are no longer 2D planer. As mentioned a few slides ago the bus width has gone from 32-bit to 1024-bit and the clock speed has gone from 1750MHz (7GBps) to 500MHz (1GBps). This is much slower, but they can get away with with it by having such a wide bus. By only running at 500MHz it means that AMD can also utilize much simpler clocking methods with no terminators needed for this interface. Simple clocking also means that this is a low power solution as it needs just 1.3V versus 1.5V. Bandwidth has gone from 28GB/s to 128GBs/s, so there is a massive bandwidth improvement that will be had by going from GDDR5 to HBM. Each Gen 1 HBM is capable of 128 GB/s per stack and there are four devices in each package. Each stack offers 1GB of storage space, so that means they first video cards using this technology will have 4GB of HBM memory. Moving to Gen 2 HBM will basically double performance, so things will only be getting better. AMD isn’t discussing how much the latency was reduced as they want to keep the memory subsystem under wraps so competitors have to figure it out themselves.

AMD_High_Bandwidth_Memory_Page_10

So when it comes to GB/s of bandwidth per watt has gone from 10.66GB/s to over 35GB/s and this is with HBM1, not the newer HBM2 technology.

AMD_High_Bandwidth_Memory_Page_11

AMD has a massive area reduction with HBM as they went from 672mm2 with a 1GB GDDR5 solution to just 35mm squared for a 1GB HBM Stack.  This is a 19x reduction in the surface area for the same amount of DRAM.

AMD_High_Bandwidth_Memory_Page_12

The overall PCB footprint for the AMD Radeon R9 290X video card is 9,900mm2 when you measure the area occupied by both the ASIC and memory. AMD says that footprint can be reduced to less than 4900mm2 with an HBM-based ASIC. This is greater than a 50% reduction in the overall PCB footprint and opens up a ton of board real-estate. This means that smaller form factors will be possible. AMD told Legit Reviews that the savings are actually greater than shown in this illustration as there is space savings to be had in the power subsystem since lower power subsystems require less space as well.

AMD_High_Bandwidth_Memory_Page_13

AMD believes that HBM will allow them to bring performance to places that they have never done before. AMD was able to pull this off with a large ecosystem and it took a large number of test chips to pull this off over the past seven years. SKHynix was the major contributor on the DRAM side.

HBM isn’t just for graphics cards though as AMD sees benefits will be had by HBM in the HPC, APUs, switches, printers, high-speed professional cameras and so on. From what we gather AMD will introducing HBM memory on couple a high-end desktop graphics card like the AMD Radeon R9 390X and R9 390. This won’t be used on a top to bottom product offering, which might cause some to wonder how cost effective this will be to do since it obviously won’t be a super high volume card. AMD said that is nothing to be concerned with as when GDDR5 was first introduced things were done very similarly and everything turned out okay. AMD’s RV770 GPU (Radeon HD4850 and Radeon HD 4870) was the first in the world to feature a 256-bit memory controller and is the Radeon H 4870 was unique in the sense that was the first GPU to support GDDR5 memory. The memory on the Radeon HD 4870 back in June 2008 was running at 900 MHz giving an effective transfer rate of 3.6 GHz and a memory bandwidth of up to 115.2 GB/s. AMD sees HBM moving into the discrete graphics card line and well beyond just GPUs if you look far enough out. AMD will use HBM where it makes sense and they feel that they will be the most aggressive with this technology and feels that they are a year ahead of their competition (NVIDIA) right now.

One of the final comments made by Joe Macri was that the end user places value on performance, form factor and power. Those are areas that AMD is arguably loosing to NVIDIA right no on the high-end, so we can’t wait to see how the new cards perform and to see if AMD can take the lead in new areas.

Print
  • (>_<)

    If AMD can put 4-8 gigs of HBM on a GPU then they can do the same with CPU’s as well as APU’s. In fact one interesting quote from the patents listed below is this:

    “Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of “empty” silicon that is available in an interposer.”

    AMD has plans to fill that empty silicon with much more memory.

    The point: REPLACE SYSTEM DYNAMIC RAM WITH ON-DIE HBM 2 OR 3! Eliminating the electrical path distance to a few millimeters from 4-8 centimeters would be worth a couple of clocks of latency. If AMD is building HBM and HBM 2 then they are also building HBM 3 or more!

    Imagine what 64gb of HBM could do for a massive server die such as Zen? The energy savings alone would be worth it never mind the hugely reduced motherboard size, eliminating sockets and RAM packaging. The increased amount of CPU’s/blade or mobo also reduces costs as servers can become much more dense.

    Most folks now only run 4-8 gigs in their laptops or desktops. Eliminating DRAM and replacing it with HBM is a huge energy and mechanical savings as well as a staggering performance jump and it destroys DDR5. That process will be very mature in a year and costs will drop. Right now the retail cost of DRAM per GB is about $10. Subtract packaging and channel costs and that drops to $5 or less. Adding 4-8 GB of HBM has a very cheap material cost, likely the main expense is the process, testing and yields. Balance that against the energy savings MOBO real estate savings and HBM replacing system DRAM becomes even more likely without the massive leap in performance as an added benefit.

    The physical cost savings is quite likely equivalent to the added process cost. Since Fiji will likely be released at a very competitive price point.

    AMD is planning on replacing system DRAM memory with stacked HBM. Here are the Patents. They are all published last year and this year with the same inventor; Gabriel H. Loh and the assignee is of course AMD.

    Stacked memory device with metadata management
    WO 2014025676 A1

    “Memory bandwidth and latency are significant performance bottlenecks in many processing systems. These performance factors may be improved to a degree through the use of stacked, or three-dimensional (3D), memory, which provides increased bandwidth and reduced intra-device latency through the use of through-silicon vias (TSVs) to interconnect multiple stacked layers of memory. However, system memory and other large-scale memory typically are implemented as separate from the other components of the system. A system implementing 3D stacked memory therefore can continue to be bandwidth-limited due to the bandwidth of the interconnect connecting the 3D stacked memory to the other components and latency-limited due to the propagation delay of the signaling traversing the relatively-long interconnect and the handshaking process needed to conduct such signaling. The inter-device bandwidth and inter-device latency have a particular impact on processing
    efficiency and power consumption of the system when a performed task requires multiple accesses to the 3D stacked memory as each access requires a back-and-forth communication between the 3D stacked memory and thus the inter-device bandwidth and latency penalties are incurred twice for each access.”

    Interposer having embedded memory controller circuitry
    US 20140089609 A1

    ” For high-performance computing systems, it is desirable for the processor and memory modules to be located within close proximity for faster communication (high bandwidth). Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of “empty” silicon that is available in an interposer. ”

    Die-stacked memory device with reconfigurable logic
    US 8922243 B2

    “Memory system performance enhancements conventionally are implemented in hard-coded silicon in system components separate from the memory, such as in processor dies and chipset dies. This hard-coded approach limits system flexibility as the implementation of additional or different memory performance features requires redesigning the logic, which design costs and production costs, as well as limits the broad mass-market appeal of the resulting component. Some system designers attempt to introduce flexibility into processing systems by incorporating a separate reconfigurable chip (e.g., a commercially-available FPGA) in the system design. However, this approach increases the cost, complexity, and size of the system as the system-level design must accommodate for the additional chip. Moreover, this approach relies on the board-level or system-level links to the memory, and thus the separate reconfigurable chip’s access to the memory may be limited by the
    bandwidth available on these links.”

    Hybrid cache
    US 20140181387 A1

    “Die-stacking technology enables multiple layers of Dynamic Random Access Memory (DRAM) to be integrated with single or multicore processors. Die-stacking technologies provide a way to tightly integrate multiple disparate silicon die with high-bandwidth, low-latency interconnects. The implementation could involve vertical stacking as illustrated in FIG. 1A, in which a plurality of DRAM layers 100 are stacked above a multicore processor 102. Alternately, as illustrated in FIG. 1B, a horizontal stacking of the DRAM 100 and the processor 102 can be achieved on an interposer 104. In either case the processor 102 (or each core thereof) is provided with a high bandwidth, low-latency path to the stacked memory 100.

    Computer systems typically include a processing unit, a main memory and one or more cache memories. A cache memory is a high-speed memory that acts as a buffer between the processor and the main memory. Although smaller than the main memory, the cache memory typically has appreciably faster access time than the main memory. Memory subsystem performance can be increased by storing the most commonly used data in smaller but faster cache memories.”

    Partitionable data bus
    US 20150026511 A1

    “Die-stacked memory devices can be combined with one or more processing units (e.g., Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Accelerated Processing Units (APUs)) in the same electronics package. A characteristic of this type of package is that it can include, for example, over 1000 data connections (e.g., pins) between the one or more processing units and the die-stacked memory device. This high number of data connections is significantly greater than data connections associated with off-chip memory devices, which typically have 32 or 64 data connections.”

    Non-uniform memory-aware cache management
    US 20120311269 A1

    “Computer systems may include different instances and/or kinds of main memory storage with different performance characteristics. For example, a given microprocessor may be able to access memory that is integrated directly on top of the processor (e.g., 3D stacked memory integration), interposer-based integrated memory, multi-chip module (MCM) memory, conventional main memory on a motherboard, and/or other types of memory. In different systems, such system memories may be connected directly to a processing chip, associated with other chips in a multi-socket system, and/or coupled to the processor in other configurations.

    Because different memories may be implemented with different technologies and/or in different places in the system, a given processor may experience different performance characteristics (e.g., latency, bandwidth, power consumption, etc.) when accessing different memories. For example, a processor may be able to access a portion of memory that is integrated onto that processor using stacked dynamic random access memory (DRAM) technology with less latency and/or more bandwidth than it may a different portion of memory that is located off-chip (e.g., on the motherboard). As used herein, a performance characteristic refers to any observable performance measure of executing a memory access operation.”

    All of this adds up to HBM being placed on-die as a replacement of or maybe supplement to system memory. But why have system DRAM if you can build much wider bandwidth memory closer to the CPU on-die? Unless of course you build socketed HBM DRAM and a completely new system memory bus to feed it.

    Replacing system DRAM memory with on-die HBM has the same benefits for the performance and energy demand of the system as it has for GPU’s. Also it makes for smaller motherboards, no memory sockets and no memory packaging.

    Of course this is all speculation. But it also makes sense.