Improving system performance by utilizing heterogeneous memories
As device technologies scale in the nanometer era, the current off-chip DRAM technologies are very close to reaching the physical limits where they cannot scale anymore. In order to continue the memory scaling, vendors are beginning to use new emerging memory technologies such as die-stacked DRAM. Although each type of memory technology has advantages and disadvantages, none has characteristics that are identical to conventional DRAM. For example, High Bandwidth Memory (HBM) has better bandwidth but lower capacity than DRAM whereas non-volatile memories offer more capacity, but are much slower than DRAM. With the emergence of such disparate memory technologies, future memory systems are certain to be heterogeneous where the main memory is composed of two or more types of memory. Recent and current trends show that the number of cores in a processor has been rising constantly. On the other hand, the off-chip DRAM bandwidth has not been scaling at the same rate as the bandwidth is limited mainly by the pin count. This trend has resulted in lower bandwidth per core in today's system where the bandwidth available to each core is lower than the system in the past. This has effectively created the “Bandwidth Wall” where the bandwidth per core does not scale anymore. Consequently, memory vendors are introducing HBM in order to enable the bandwidth scaling. The adoption of such high bandwidth memory and other emerging memory technologies has provided rich opportunities to explore how such heterogeneous main memory systems can be used effectively. In this dissertation, different ways to effectively use such heterogeneous memory systems, especially those containing off-chip DRAM and die-stacked DRAM such as HBM, are presented. First, hardware as well as software driven data management schemes are presented where either hardware or software explicitly migrates data between the two different memories. The hardware managed scheme does not use a fixed granularity migration scheme, but rather migrates variable amount of data between two different memories depending on the memory access characteristics. This approach achieves low off-chip memory bandwidth usage while maintaining a high hit rate in the more desirable memory. Similarly, a software driven scheme varies the migration granularity without any additional hardware support by monitoring the spatial locality characteristics of the running applications. In both solutions, the goal is to migrate just the right amount of data into capacity constrained memory to achieve the low off-chip memory bandwidth usage while still keeping the high hit rate. While the capacity of die-stacked DRAM is not sufficient to meet the demands of a server-class system, it is still non-trivial in size ranging from 8 to 16 GBs in typical configurations, so a fraction of the storage can be used for non-conventional uses such as storing address translations. With increasing deployment of virtual machines for cloud services and server applications, one major contributor of performance overheads in virtualized environments is memory address translation. An application executing on a guest OS generates guest virtual addresses that need to be translated to host physical addresses. In x86 architectures, both the guest and host page tables employ a 4-level radix-tree table organization. Translating a virtual address to physical address takes 4 memory references in a bare metal case using a radix-4 table, and in the virtualized case, it becomes a full 2D translation with up to 24 memory accesses. A method to use a fraction of die-stacked DRAM as a very large TLB is presented in this dissertation, so that almost all page table walks are eliminated. This leads to a substantial performance improvement in virtualized systems where the address translation takes a considerable fraction of execution time.