Browsing by Subject "Cache"
Now showing 1 - 8 of 8
- Results Per Page
- Sort Options
Item Accelerating deep learning training : a storage perspective(2021-12-01) Mohan, Jayashree; Chidambaram, Vijay; Phanishayee, Amar; Witchel, Emmett; Rossbach, Christopher J; Krahenbuhl, PhilippDeep Learning, specifically Deep Neural Networks (DNNs), is stressing storage systems in new ways, moving the training bottleneck to the data pipeline (fetching, pre-processing data, and writing checkpoints), rather than computation at the GPUs; this leaves the expensive accelerator devices stalled for data. While prior research has explored different ways of accelerating DNN training time, the impact of storage systems, specifically the data pipeline, on ML training has been relatively unexplored. In this dissertation, we study the role of data pipeline in various training scenarios, and based on the insights from our study, we present the design and evaluation of systems that accelerate training. We first present a comprehensive analysis of how the storage subsystem affects the training of the widely used DNN models by building a tool, DS-Analyzer. Our study reveals that in many cases, DNN training time is dominated by data stalls: time spent waiting for data to be fetched from(or written to) storage and pre-processed. We then describe CoorDL, a user-space data loading library to address data stalls in dedicated single-user servers with fixed resource capacities. Next, we design and evaluate Synergy, a work-load aware scheduler for shared GPU clusters that mitigates data stalls by allocating auxiliary resources like CPU and memory cognizant of workload requirements. Finally, we present CheckFreq, a framework that frequently writes model state to storage (checkpoint) for fault-tolerance, thereby reducing wasted GPU work on job interruptions, while also minimizing stalls due to checkpointing. Our dissertation shows that data stalls squander away the improved performance of faster GPUs. Our dissertation further demonstrates that an efficient data pipeline is critical to speeding up end-to-end training, by building and evaluating systems that mitigate data stalls in several training scenarios.Item Efficient fine-grained virtual memory(2018-05) Zheng, Tianhao, Ph. D.; Erez, Mattan; Reddi, Vijay Janapa; Tiwari, Mohit; Lin, Calvin; Peter, SimonVirtual memory in modern computer systems provides a single abstraction of the memory hierarchy. By hiding fragmentation and overlays of physical memory, virtual memory frees applications from managing physical memory and improves programmability. However, virtual memory often introduces noticeable overhead. State-of-the-art systems use a paged virtual memory that maps virtual addresses to physical addresses in page granularity (typically 4 KiB ).This mapping is stored as a page table. Before accessing physically addressed memory, the page table is accessed to translate virtual addresses to physical addresses. Research shows that the overhead of accessing the page table can even exceed the execution time for some important applications. In addition, this fine-grained mapping changes the access patterns between virtual and physical address spaces, introducing difficulties to many architecture techniques, such as caches and prefecthers. In this dissertation, I propose architecture mechanisms to reduce the overhead of accessing and managing fine-grained virtual memory without compromising existing benefits. There are three main contributions in this dissertation. First, I investigate the impact of address translation on cache. I examine the restriction of virtually indexed, physically tagged (VIPT) caches with fine-grained paging and conclude that this restriction may lead to sub-optimal cache designs. I introduce a novel cache strategy, speculatively indexed, physically tagged (SIPT) to enable flexible cache indexing under fine-grained page mapping. SIPT speculates on the value of a few more index bits (1 - 3 in our experiments) to access the cache speculatively before translation, and then verify that the physical tag matches after translation. Utilizing the fact that a simple relation generally exists between virtual and physical addresses, because memory allocators often exhibit contiguity, I also propose low-cost mechanisms to predict and correct potential mis-speculations. Next, I focus on reducing the overhead of address translation for fine-grained virtual memory. I propose a novel architecture mechanism, Embedded Page Translation Information (EMPTI), to provide general fine-grained page translation information on top of coarse-grained virtual memory. EMPTI does so by speculating that a virtual address is mapped to a pre-determined physical location and then verifying the translation with a very-low-cost access to metadata embedded with data. Coarse-grained virtual memory mechanisms (e.g., segmentation) are used to suggest the pre-determined physical location for each virtual page. Overall, EMPTI achieves the benefits of low overhead translation while keeping the flexibility and programmability of fine-grained paging. Finally, I improve the efficiency of metadata caching based on the fact that memory mapping contiguity generally exists beyond a page boundary. In state-of-the-art architectures, caches treat PTEs (page table entries) as regular data. Although this is simple and straightforward, it fails to maximize the storage efficiency of metadata. Each page in the contiguously mapped region costs a full 8-byte PTE. However, the delta between virtual addresses and physical addresses remain the same and most metadata are identical. I propose a novel microarchitectural mechanism that expands the effective PTE storage in the last-level-cache (LLC) and reduces the number of page-walk accesses that miss the LLC.Item Hardware transactional memory : a systems perspective(2009-08) Rossbach, Christopher John; Witchel, EmmettThe increasing ubiquity of chip multiprocessor machines has made the need for accessible approaches to parallel programming all the more urgent. The current state of the art, based on threads and locks, requires the programmer to use mutual exclusion to protect shared resources, enforce invariants, and maintain consistency constraints. Despite decades of research effort, this approach remains fraught with difficulty. Lock-based programming is complex and error-prone, largely due to well-known problems such as deadlock, priority inversion, and poor composability. Tradeoffs between performance and complexity for locks remain unattractive. Coarse-grain locking is simple but introduces artificial sharing, needless serialization, and yields poor performance. Fine-grain locking can address these issues, but at a significant cost in complexity and maintainability. Transactional memory has emerged as a technology with the potential to address this need for better parallel programming tools. Transactions provide the abstraction of isolated, atomic execution of critical sections. The programmer specifies regions of code which access shared data, and the system is responsible for executing that code in a way that is isolated and atomic. The programmer need not reason about locks and threads. Transactional memory removes many of the pitfalls of locking: transactions are livelock- and deadlock-free and may be composed freely. Hardware transactional memory, which is the focus of this thesis, provides an efficient implementation of the TM abstraction. This thesis explores several key aspects of supporting hardware transactional memory (HTM): operating systems support and integration, architectural, design, and implementation considerations, and programmer-transparent techniques to improve HTM performance in the presence of contention. Using and supporting HTM in an OS requires innovation in both the OS and the architecture, but enables practical approaches and solutions to some long-standing OS problems. Innovations in transactional cache coherence protocols enable HTM support in the presence of multi-level cache hierarchies, rich HTM semantics such as suspend/resume and multiple transactions per thread context, and can provide the building blocks for support of flexible contention management policies without the need to trap to software handlers. We demonstrate a programmer-transparent hardware technique for using dependences between transactions to commit conflicting transactions, and suggest techniques to allow conflicting transactions to avoid performance-sapping restarts without using heuristics such as backoff. Both mechanisms yield better performance for workloads that have significant write-sharing. Finally, in the context of the MetaTM HTM model, this thesis contributes a high-fidelity cross-design comparison of representative proposals from the literature: the result is a comprehensive exploration of the HTM design space that compares the behavior of models of MetaTM (70, 75), LogTM (58, 94), and Sun's Rock (22).Item Memory-subsystem resource management for the many-core era(2011-05) Kaseridis, Dimitrios; John, Lizy Kurian; Touba, Nur A.; Chiou, Derek; Holt, Jim; Gratz, Paul V.As semiconductor technology continues to scale lower in the nanometer era, the communication between processor and main memory has been particularly challenged. The well-studied frequency, memory and power ``walls'' have redirect architects towards utilizing Chip Multiprocessors (CMP) as an attractive architecture for leveraging technology scaling. In order to achieve high efficiency and throughput, CMPs rely heavily on sharing resources among multiple cores, especially in the case of the memory hierarchy. Unfortunately, such sharing introduces resource contention and interference between the multiple executing threads. The ever-increasing access latency difference between processor and memory, the gradually increasing memory bandwidth demands to main memory, and the decreasing cache capacity size available to each core due to multiple core integration, has made the need for an efficient memory subsystem resource management more critical than ever before. This dissertation focuses on managing the sharing of the Last-level Cache (LLC) capacity and the main memory bandwidth, as the two most important resources that significantly affect system performance and energy consumption. The presented schemes include efficient solutions to all of the three basic requirements for implementing a resource management schemes, that is: a) profiling mechanisms to capture applications' resource requirements, b) microarchitecture mechanisms to enforce a resource allocation scheme, and c) resource allocations algorithms/policies to manage the available memory resources throughput the whole memory hierarchy of a CMP system. To achieve these targets the dissertation first describes a set of low overhead, non-invasive profiling mechanisms that are able to project applications’ memory resource requirements and memory sharing behavior. Two memory resource partitioning schemes are presented. The first one, the Bank-aware dynamic partitioning scheme provides a low overhead solution for partitioning cache resources of large CMP architectures that are based on a Dynamic Non-Uniform Cache Architecture (DNUCA) last-level cache design, consistent with the current industry trends. In addition, the second scheme, the Bandwidth-aware dynamic scheme presents a system-wide optimization of memory-subsystem resource allocation and job scheduling for large, multi-chip CMP systems. The scheme is seeking for optimizations both within and outside single CMP chips, aiming at overall system throughput and efficiency improvements. As cache partitioning schemes with isolated partitions impose a set of restrictions in the use of the last-level cache, which can severely affect the performance of large CMP designs, this dissertation presents a Quasi-partitioning scheme that breaks such restrictions while providing most of the benefits of cache partitioning schemes. The presented solution is able to efficiently scale to a significant larger number of cores than what previously described schemes that are based on isolated partition can achieve. Finally, as the memory controller is one of the fundamental components of the memory-subsystem, a well-designed memory-subsystem resource management needs to carefully utilize the memory controller resources and coordinate its functionality with the operation of the main memory and the last-level cache. To improve execution fairness and system throughput, this dissertation presents a criticality-based, memory controller requests priority scheme. The scheme ranks demand read and prefetch operations based on their latency sensitivity, while it coordinates its operation with the DRAM page-mode policy and the memory data prefetcher.Item Mitigating bank conflicts in main memory via selective data duplication and migration(2021-05-07) Lin, Ching-Pei; Patt, Yale N.; Chiou, Derek; Erez, Mattan; Witchel, Emmett; Wilkerson, ChrisMain memory is organized as a hierarchy of banks, rows, and columns. Only data from a single row can be accessed from each bank at any given time. Switching between different rows of the same bank requires serializing long latency operations to the bank. Consequently, memory performance suffers on bank conflicts when concurrent requests access different rows of the same bank. Many prior solutions to the bank conflict problem required modifications to the memory device and/or the memory access protocol. Such modifications create hurdles for adoption due to the commodity nature of the memory business. Instead, I propose two new runtime solutions that work with unmodified memory devices and access protocols. The first, Duplicon Cache, duplicates select data to multiple banks, allowing duplicated data to be sourced from either the original bank or the alternate bank, whichever is more lightly loaded. The second, Continuous Row Compaction, identifies data that are frequently accessed together, then migrates them to non-conflicting rows across different banks. To limit the data transfer overhead from data duplication and migration, only select data are duplicated/migrated. The key is to identify large working sets of the running applications that remain stable over very long time intervals, and slowly duplicate/migrate them over time, amortizing the cost of duplication/migration. In effect, the set of duplicated/migrated data form a cache within main memory that captures large stable working sets of the application.Item Mitigating DRAM complexities through coordinated scheduling policies(2011-05) Stuecheli, Jeffrey Adam; John, Lizy Kurian; Ambler, Tony; Erez, Mattan; Swartzlander, Earl; Zhang, LixinContemporary DRAM systems have maintained impressive scaling by managing a careful balance between performance, power, and storage density. In achieving these goals, a significant sacrifice has been made in DRAM's operational complexity. To realize good performance, systems must properly manage the significant number of structural and timing restrictions of the DRAM devices. DRAM's efficient use is further complicated in many-core systems where the memory interface has to be shared among multiple cores/threads competing for memory bandwidth. In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPU's data needs, and are mostly oblivious to the main memory. This work demonstrates that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes dramatically expands the memory controller's visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this work demonstrates that performance-limiting effects of highly-threaded architectures combined with complex DRAM operation can be overcome. This work shows that an awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. The use of the "Page-Mode" feature of DRAM devices can mitigate many DRAM constraints. Current open-page policies attempt to garner the highest level of page hits. In an effort to achieve this, such greedy schemes map sequential address sequences to a single DRAM resource. This non-uniform resource usage pattern introduces high levels of conflict when multiple workloads in a many-core system map to the same set of resources. This work presents a scheme that provides a careful balance between the benefits (increased performance and decreased power), and the detractors (unfairness) of page-mode accesses. In the proposed Minimalist approach, the system targets "just enough" page-mode accesses to garner page-mode benefits, avoiding system unfairness. This is accomplished with the use of a fair memory hashing scheme to control the maximum number of page mode hits. High density memory is becoming ever more important as many execution streams are consolidated onto single chip many-core processors. DRAM is ubiquitous as a main memory technology, but while DRAM's per-chip density and frequency continue to scale, the time required to refresh its dynamic cells has grown at an alarming rate. This work shows how currently-employed methods to schedule refresh operations are ineffective in mitigating the significant performance degradation caused by longer refresh times. Current approaches are deficient -- they do not effectively exploit the flexibility of DRAMs to postpone refresh operations. This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the industry standard DRAM memory specifications. The proposed mechanisms are shown to mitigate much of the penalties seen with dense DRAM devices. In summary this work presents a significant improvement in the ability to exploit the capabilities of high density, high frequency, DRAM devices in a many-core environment. This is accomplished though coordination of previously disparate system components, exploiting integration of such components into highly integrated system designs.Item Nearly free resilient memory architectures that balance resilience, performance, and cost(2017-08-29) Kim, Dong Wan; Erez, Mattan; Touba, Nur A.; Fussell, Donald S.; Reddi, Vijay Janapa; Tsai, Timothy K.Memory reliability has been a major design constraint for mission-critical and large-scale systems for many years. Continued innovation is still necessary because the rate of faults, and the errors they lead to, grows with system size and because some faults become more likely as fabrication technology advances. Furthermore, recent field studies have shown that more severe permanent/intermittent and multi-bit faults are roughly as frequent as single-bit and transient ones. Therefore, strong error checking and correcting (ECC) schemes that can correct multi-bit errors have been developed and are in use. However, using ECC to correct the numerous recurring errors from permanent faults forces trading off cost, performance, and reliability. Firstly, a permanent fault is likely to result in numerous erroneous accesses, each requiring possibly high correction overhead. Secondly, once redundancy is used for correction, further errors may go uncorrected leading to data loss, which called detected uncorrectable error (DUE), or worse, go undetected and result in silent data corruption (SDC). Strong ECC can then be used to tolerate more errors, but at higher overhead. The straightforward solution to addressing this issue of repeated costly corrections and reduced coverage is to replace faulty memory devices, however, doing so is expensive, and requires either increased system down time or increased storage and bandwidth overheads. An economical alternative is to retire and possibly remap just the faulty memory regions. The existing retirement techniques, however, either require sophisticated software support, impact capacity, reliability, and/or performance, or introduce additional storage and hardware structures. Implementing a strong ECC such as Single Device Data Correction (SDDC) ECC (or chipkill-level ECC) is typically expensive in terms of storage and complexity. It is even challenging to implement a SDDC level ECC in emerging high bandwidth memories such as HBM2. This is because a single ECC codeword is transferred from one memory device in HBM2 for instance, and thus simply adding a redundant device results in high overhead in storage, energy, and bandwidth. Such wide data width memories are, however, widely used in graphics processing units (GPUs) and Intel’s Xeon Phi processors (e.g. Knight Landing) to exploit high memory bandwidth. As GPUs are popular to build large scale high performance systems, improving resilience of GPU memory systems has been an important issue, but the current GPU memory ECC is limited to single-bit error correction double-bit error detection (SECDED). To improve GPU memory resilience further, therefore, multiple techniques are coordinated. One interesting addition is to use a software driven memory repair technique, which retires affected pages with virtual memory support. This approach reduces the risk of uncorrected or even undetected memory errors in the future. However, it can be unsafe to avoid a DUE because the underlying ECC code is weak. Therefore, it needs to proactively retire susceptible memory blocks to avoid the potential threat of a system failure in the future. In this dissertation, I develop strong, and low-cost memory fault tolerance mechanisms that improve both system resilience and availability without wasting resources, increasing memory access complexity, and compromising the fault tolerance of existing resilience schemes. I first identify two interesting characteristics of DRAM failures in the field. First, permanent faults are as frequent as transient faults. Second, most faults affect small memory regions. Based on such analysis of DRAM failure patterns, I propose and evaluate two novel hardware-only memory repair mechanisms that improve memory system reliability significantly without compromising performance and increasing overhead. I also develop a strong, low-cost GPU memory fault tolerance mechanism based on three insights. First, ECC for GPUs should not interfere with the high-bandwidth DRAM system that uses possibly just one DRAM device for each data access. The second insight is that the way in which GPUs are used as accelerators offers unique opportunities to tolerate even severe memory errors without relying on SDDC ECC. The third insight is that nearly all permanent memory faults can be repaired with low overhead techniques. According to these observations, I propose and evaluate a multi-tier, strong GPU memory fault tolerance mechanism where the techniques at each level closely work together to significantly improve accelerator memory system resilience with low overhead.Item Practical irregular prefetching(2020-08-13) Wu, Hao (Ph. D. in computer science); Lin, Yun Calvin; Jain, Akanksha; Fussell, Don; Rossbach, Christopher J.; Sunwoo, DamMemory accesses continue to be a performance bottleneck for many programs, and prefetching is an effective and widely used method for alleviating the memory bottleneck. However, prefetching can be difficult for irregular workloads, which the hardware has no clear patterns like sequential or strided patterns. For irregular workloads, one promising approach is to perform temporal prefetching, which memorizes temporal correlations that happen in the past and use them to predict future memory accesses. To store these correlations, it requires megabytes of metadata which cannot be feasibly stored on-chip. As a result, previous temporal prefetchers store metadata off-chip in DRAM, which introduces hardware implementation difficulties, increases DRAM latencies and increases DRAM traffic overhead. For example, the STMS prefetcher proposed by Wenisch et al. has 3.42x DRAM traffic overhead for irregular SPEC2006 workloads. These problems make previous temporal prefetchers impractical to implement in commercial hardware. In this thesis, we propose three methods to alleviate the metadata storage problems in temporal prefetching and make it practical in hardware. First, we propose MISB, a new scheme that uses a metadata prefetcher to manage on-chip metadata. With only 1/5 traffic overhead compared to STMS, MISB achieves 22.7% performance speedup over a baseline with no prefetching compared to 10.6% for an idealized STMS and 4.5% for a realistic ISB. Second, we present Triage, the first temporal prefetcher that stores its entire metadata on chip, which reduces hardware complexity and DRAM traffic by re-purposing part of last level cache to store metadata. Triage reduces 60% traffic compared to MISB and achieves 13.9% performance speedup over a baseline with no prefetching. In a bandwidth constrained 8-core environment, Triage has 11.4% speedup compared to 8.0% for MISB. Third, we present a new resource management scheme for Triage's on-chip metadata. This scheme integrates ISP's compressed metadata representation and makes several improvements. For irregular benchmarks, this scheme reduces on-chip metadata storage requirement by 38% and achieves 29.6% speedup compared to Triage's 25.3%.