Browsing by Subject "Memory system"
Now showing 1 - 5 of 5
- Results Per Page
- Sort Options
Item Designing systems for emerging memory technologies(2018-08-01) Kwon, Youngjin, Ph. D.; Witchel, Emmett; Peter, Simon, Ph. D.; Anderson, Thomas; Rossbach, Christopher J.Emerging memory technologies open new challenges in system software: diversity and large capacity. Non-volatile memory (NVM) technologies will have excellent performance, byte- addressability, and large capacity, blurring the line between traditional volatile DRAM and non-volatile storage. NVM diverges from DRAM in significant ways, like limited write bandwidth. It is likely that future storage market will be diversified, having DRAM, NVM, SSD, and hard disk. Unfortunately, current file systems, built on top of old design ideas, cannot provide an efficient way to take advantage of the different storage media. Strata is a cross-media file system, fundamentally redesigning file systems to leverage different strengths of storage technologies while compensating their weaknesses. Modern applications such as large-scale machine learning and graph analytics want to load huge datasets into memory for fast computation. For these workloads, merely adding more RAM to a machine reaches a point of diminishing returns for performance because their poor spatial locality causes them to suffer high virtual to physical memory translation costs. NVM will make this problem worse because it provides cheaper cost-per-capacity than DRAM. Ingens, a efficient memory management system, addresses the shortcomings in modern operating systems and hypervisors that underlies these excessive address translation overheads and redesign huge page memory systems to make huge page widely used in practice.Item DRAM-aware prefetching and cache management(2010-12) Lee, Chang Joo, 1975-; Patt, Yale N.; Touba, Nur A.; Chiou, Derek; Namazi, Hossein; Mutlu, OnurMain memory system performance is crucial for high performance microprocessors. Even though the peak bandwidth of main memory systems has increased through improvements in the microarchitecture of Dynamic Random Access Memory (DRAM) chips, conventional on-chip memory systems of microprocessors do not fully take advantage of it. This results in underutilization of the DRAM system, in other words, many idle cycles on the DRAM data bus. The main reason for this is that conventional on-chip memory system designs do not fully take into account important DRAM characteristics. Therefore, the high bandwidth of DRAM-based main memory systems cannot be realized and exploited by the processor. This dissertation identifies three major performance-related characteristics that can significantly affect DRAM performance and makes a case for DRAM characteristic-aware on-chip memory system design. We show that on-chip memory resource management policies (such as prefetching, buffer, and cache policies) that are aware of these DRAM characteristics can significantly enhance entire system performance. The key idea of the proposed mechanisms is to send out to the DRAM system useful memory requests that can be serviced with low latency or in parallel with other requests rather than requests that are serviced with high latency or serially. Our evaluations demonstrate that each of the proposed DRAM-aware mechanisms significantly improves performance by increasing DRAM utilization for useful data. We also show that when employed together, the performance benefit of each mechanism is achieved additively: they work synergistically and significantly improve the overall system performance of both single-core and Chip MultiProcessor (CMP) systems.Item Exploiting long-term behavior for improved memory system performance(2016-08) Jain, Akanksha; Lin, Yun Calvin; Burger, Doug; Fussell, Donald S; Patt, Yale N; Pingali, KeshavMemory latency is a key bottleneck for many programs. Caching and prefetching are two popular hardware mechanisms to alleviate the impact of long memory latencies, but despite decades of research, significant headroom remains. In this thesis, we show how we can significantly improve caching and prefetching by exploiting a long history of the program's behavior. Towards this end, we define new learning goals that fully exploit long-term information, and we propose history representations that make it feasible to track and manipulate long histories. Based on these insights, we advance the state-of-the-art for three important memory system optimizations. For cache replacement, where existing solutions have relied on simplistic heuristics, our solution pursues the new goal of learning from the optimal solution for past references to predict caching decisions for future references. For irregular prefetching, where previous solutions are limited in scope due to their inefficient management of long histories, our goal is to realize the previously unattainable combination of two popular learning techniques, namely address correlation and PC-localization. Finally, for regular prefetching, where recent solutions learn increasingly complex patterns, we leverage long histories to simplify the learning goal and to produce more timely and accurate prefetches. Our results are significant. For cache replacement, our solution reduces misses for memory-intensive SPEC 2006 benchmarks by 17.4% compared to 11.4% for the previous best. For irregular prefetching, our prefetcher obtains 23.1% speedup (vs. 14.1% for the previous best) with 93.7% accuracy, and it comes close to the performance of an idealized prefetcher with no resource constraints. Finally, for regular prefetching, our prefetcher improves performance by 102.3% over a baseline with no prefetching compared to the 90% speedup for the previous state-of-the-art prefetcher; our solution also incurs 10% less traffic than the previous best regular prefetcher.Item Fair and high performance shared memory resource management(2011-12) Ebrahimi, Eiman; Patt, Yale N.; Touba, Nur A.; Pingali, Keshav; Chiou, Derek; Mutlu, OnurChip multiprocessors (CMPs) commonly share a large portion of memory system resources among different cores. Since memory requests from different threads executing on different cores significantly interfere with one another in these shared resources, the design of the shared memory subsystem is crucial for achieving high performance and fairness. Inter-thread memory system interference has different implications based on the type of workload running on a CMP. In multi-programmed workloads, different applications can experience significantly different slowdowns. If left uncontrolled, large disparities in slowdowns result in low system performance and make system software's priority-based thread scheduling policies ineffective. In a single multi-threaded application, memory system interference between threads of the same application can slow each thread down significantly. Most importantly, the critical path of execution can also be significantly slowed down, resulting in increased application execution time. This dissertation proposes three mechanisms that address different shortcomings of current shared resource management techniques targeted at multi-programmed workloads, and one mechanism which speeds up a single multi-threaded application by managing main-memory related interference between its different threads. With multi-programmed workloads, the key idea is that both demand- and prefetch-caused inter-application interference should be taken into account in shared resource management techniques across the entire shared memory system. Our evaluations demonstrate that doing so significantly improves both system performance and fairness compared to the state-of-the-art. When executing a single multi-threaded application on a CMP, the key idea is to take into account the inter-dependence of threads in memory scheduling decisions. Our evaluation shows that doing so significantly reduces the execution time of the multi-threaded application compared to using state-of-the-art memory schedulers designed for multi-programmed workloads. This dissertation concludes that the performance and fairness of CMPs can be significantly improved by better management of inter-thread interference in the shared memory resources, both for multi-programmed workloads and multi-threaded applications.Item Practical irregular prefetching(2020-08-13) Wu, Hao (Ph. D. in computer science); Lin, Yun Calvin; Jain, Akanksha; Fussell, Don; Rossbach, Christopher J.; Sunwoo, DamMemory accesses continue to be a performance bottleneck for many programs, and prefetching is an effective and widely used method for alleviating the memory bottleneck. However, prefetching can be difficult for irregular workloads, which the hardware has no clear patterns like sequential or strided patterns. For irregular workloads, one promising approach is to perform temporal prefetching, which memorizes temporal correlations that happen in the past and use them to predict future memory accesses. To store these correlations, it requires megabytes of metadata which cannot be feasibly stored on-chip. As a result, previous temporal prefetchers store metadata off-chip in DRAM, which introduces hardware implementation difficulties, increases DRAM latencies and increases DRAM traffic overhead. For example, the STMS prefetcher proposed by Wenisch et al. has 3.42x DRAM traffic overhead for irregular SPEC2006 workloads. These problems make previous temporal prefetchers impractical to implement in commercial hardware. In this thesis, we propose three methods to alleviate the metadata storage problems in temporal prefetching and make it practical in hardware. First, we propose MISB, a new scheme that uses a metadata prefetcher to manage on-chip metadata. With only 1/5 traffic overhead compared to STMS, MISB achieves 22.7% performance speedup over a baseline with no prefetching compared to 10.6% for an idealized STMS and 4.5% for a realistic ISB. Second, we present Triage, the first temporal prefetcher that stores its entire metadata on chip, which reduces hardware complexity and DRAM traffic by re-purposing part of last level cache to store metadata. Triage reduces 60% traffic compared to MISB and achieves 13.9% performance speedup over a baseline with no prefetching. In a bandwidth constrained 8-core environment, Triage has 11.4% speedup compared to 8.0% for MISB. Third, we present a new resource management scheme for Triage's on-chip metadata. This scheme integrates ISP's compressed metadata representation and makes several improvements. For irregular benchmarks, this scheme reduces on-chip metadata storage requirement by 38% and achieves 29.6% speedup compared to Triage's 25.3%.