Browsing by Subject "Memory management (Computer science)"

Now showing 1 - 10 of 10

Adaptive caching for high-performance memory systems
(2007) Qureshi, Moinuddin Khalil Ahmed, 1978-; Patt, Yale N.
One of the major limiters to computer system performance has been the access to main memory, which is typically two orders of magnitude slower than the processor. To bridge this gap, modern processors already devote more than half of the on-chip transistors to the last-level cache. However, traditional cache designs – developed for small first-level caches – are inefficient for large caches. Therefore, cache misses are common which results in frequent memory accesses and reduced processor performance. The importance of cache management has become even more critical because of the increasing memory latency, increasing working sets of many emerging applications, and decreasing size of cache devoted to each core due to increased number of cores on a single chip. This dissertation focuses on analyzing some of the problems with managing large caches and proposing cost-effective solutions to improve their performance. Different workloads and program phases have different locality characteristics that make them better suited to different replacement policies. This dissertation proposes hybrid replacement policy that can select from multiple replacement policies depending on which policy has the highest performance. To implement hybrid replacement with low-overhead, it shows that cache behavior can be approximated by sampling few sets and proposes the concept of Dynamic Set Sampling. The commonly used LRU replacement policy results in thrashing for memory-intensive workloads that have a working set bigger than the cache size. This dissertation shows that performance of memory-intensive workloads can be improved significantly by changing the recency position where the incoming line is inserted. The proposed mechanism reduces cache misses by 21% over LRU, is robust across a wide variety of workloads, incurs a total storage overhead of less than two bytes, and does not change the existing cache structure. Modern systems try to service multiple cache misses in parallel. The variation in Memory Level Parallelism (MLP) causes some misses to be more costly on performance than other misses. This dissertation presents the first study on MLP-aware cache replacement and proposes to improve performance by eliminating some of the performance-critical isolated misses. Finally, this dissertation also analyzes cache partitioning policies for shared caches in chip multi-processors. Traditional partitioning policies either divide the cache equally among all applications or use the LRU policy to do a demand based cache partitioning. This dissertation shows that performance can be improved if the shared cache is partitioned based on how much the application benefits from the cache, rather than on its demand for the cache. It proposes a novel low-overhead circuit that can dynamically monitor the utility of cache for any application. The proposed partitioning improves weighted-speedup by 11%, throughput by 17% and fairness by 11% on average compared to LRU.
Efficient runahead execution processors
(2006) Mutlu, Onur; Patt, Yale N.
High-performance processors tolerate latency using out-of-order execution. Unfortunately, today’s processors are facing memory latencies in the order of hundreds of cycles. To tolerate such long latencies, out-of-order execution requires an instruction window that is unreasonably large, in terms of design complexity, hardware cost, and power consumption. Therefore, current processors spend most of their execution time stalling and waiting for long-latency cache misses to return from main memory. And, the problem is getting worse because memory latencies are increasing in terms of processor cycles. The runahead execution paradigm improves the memory latency tolerance of an out-of-order execution processor by performing potentially useful execution while a longlatency cache miss is in progress. Runahead execution unblocks the instruction window blocked by a long-latency cache miss allowing the processor to execute far ahead in the program path. This results in other long-latency cache misses to be discovered and their data to be prefetched into caches long before it is needed. This dissertation presents the runahead execution paradigm and its implementation on an out-of-order execution processor that employs state-of-the-art hardware prefetching techniques. It is shown that runahead execution on a 128-entry instruction window achieves the performance of a processor with three times the instruction window size for a current, 500-cycle memory latency. For a near-future 1000-cycle memory latency, it is shown that runahead execution on a 128-entry window achievesthe performance of a conventional processor with eight times the instruction window size, without requiring a significant increase in hardware cost and complexity. This dissertation also examines and provides solutions to two major limitations of runahead execution: its energy inefficiency and its inability to parallelize dependent cache misses. Simple and effective techniques are proposed to increase the efficiency of runahead execution by reducing the extra instructions executed without affecting the performance improvement. An efficient runahead execution processor employing these techniques executes only 6.2% more instructions than a conventional out-of-order execution processor but achieves 22.1% higher Instructions Per Cycle (IPC) performance. Finally, this dissertation proposes a new technique, called address-value delta (AVD) prediction, that predicts the values of pointer load instructions encountered in runahead execution in order to enable the parallelization of dependent cache misses using runahead execution. It is shown that a simple 16-entry AVD predictor improves the performance of a baseline runahead execution processor by 14.3% on a set of pointer-intensive applications, while it also reduces the executed instructions by 15.5%. An analysis of the high-level programming constructs that result in AVD-predictable load instructions is provided. Based on this analysis, hardware and software optimizations are proposed to increase the benefits of AVD prediction.
Enhancing memory controllers to improve DRAM power and performance
(2006) Hur, Ibrahim; Lin, Calvin
Technological advances and new architectural techniques have enabled processor performance to double almost every two years. However, these performance improvements have not resulted in comparable speedups for all applications, because the memory system performance has not kept pace with processor performance in modern systems. In this dissertation, by concentrating on the interface between the processors and memory, the memory controller, we propose novel solutions to all three aspects of the memory problem, that is bandwidth, latency, and power. To increase available bandwidth between the memory controller and DRAM, we introduce a new scheduling approach. To hide memory latency, we introduce a new hardware prefetching technique that is useful for applications with regular or irregular memory accesses. And finally, we show how memory controllers can be used to improve DRAM power consumption. We evaluate our techniques in the context of the memory controller of a highly tuned modern processor, the IBM Power5+. Our evaluation for both technical and commercial benchmarks in single-threaded and simultaneous multi-threaded environments show that our techniques for bandwidth increase, latency hiding, and power reduction achieve signicant improvements. For example, for singlethreaded applications, when our scheduling approach and prefetching method are implemented together, they improve the performance of the SPEC2006fp, NAS, and a set of commercial benchmarks by 14.3%, 13.7%, and 11.2%, respectively. In addition to providing substantial performance and power improvements, our techniques are superior to the previously proposed methods in terms of cost as well. For example, a version of our scheduling approach has been implemented in the Power5+, and it has increased the transistor count of the hip by only 0.02%. This dissertation shows that without increasing the complexity of neither the processor nor the memory organization, all three aspects of memory systems an be significantly improved with low- cost enhancements to the memory controller.
Memory management for high-performance applications
(2002) Berger, Emery David; McKinley, Kathryn S.
Memory managers are a source of performance and robustness problems for application software. Current general-purpose memory managers do not scale on multiprocessors, cause false sharing of heap objects, and systematically leak memory. Even on uniprocessors, the memory manager is often a performance bottleneck. General-purpose memory managers also do not provide the bulk deletion semantics required by many applications, including web servers and compilers. The approaches taken to date to address these and other memory management problems have been largely ad hoc. Programmers often attempt to work around these problems by writing custom memory managers. This approach leads to further difficulties, including data corruption caused when programmers inadvertently free custom-allocated objects to the general-purpose memory manager. In this thesis, we develop a framework for analyzing and designing high-quality memory managers. We develop a memory management infrastructure called heap layers that allows programmers to compose efficient memory managers from reusable and independently testable components. We conduct the first comprehensive examination of custom memory managers and show that most of these achieve slight or no performance improvements over a state-of-the-art general-purpose memory manager. Building on the knowledge gained in this study, we develop a hybrid memory management abstraction called reaps that combines the best of both approaches, allowing server applications to manage memory quickly and flexibly while avoiding memory leaks. We identify a number of previously unnoticed problems with concurrent memory management and analyze previous work in the light of these discoveries. We then present a concurrent memory manager called Hoard and prove that it avoids these problems.
Prefetch mechanisms by application memory access pattern
(2007-05) Agaram, Kartik Kandadai; Keckler, Stephen W.
Modern computer systems spend a substantial fraction of their running time waiting for data from memory. While prefetching has been a promising avenue of research for reducing and tolerating latencies to memory, it has also been a challenge to implement. This challenge exists largely because of the growing complexity of memory hierarchies and the wide variety of application behaviors. In this dissertation we propose a new methodology that emphasizes decomposing complex behavior at the application level into regular components that are intelligible at a high level to the architect. This dissertation is divided into three stages. In the first, we build tools to help decompose application behavior by data structure and phase, and use these tools to create a richer picture of application behavior than with conventional simulation tools, yielding compressed summaries of dominant access patterns. The variety of a access patterns drives the next stage: design of a prefetch system that improves on the state of the art. Every prefetching system must make low-overhead decisions on what to prefetch, when to prefetch it, and where to store prefetched data. Visualizing application a access patterns allows us to articulate the subtleties in making these decisions and the many ways that a mechanism that improves one decision for one set of applications may degrade the quality of another decision for a different set. Our insights lead us to a new system called TwoStep with a small set of independent but synergistic mechanisms. In the third stage we perform a detailed evaluation of TwoStep. We find that while it outperforms past approaches for the most irregular applications in our benchmark suite, it is unable to improve on the speedups for more regular applications. Understanding why leads to an improved understanding of two general categories of prefetch techniques. Prefetching an either look back at past history or look forward by precomputing an application's future requirements. Applications with a low compute-a ccess ratio an benefit from history-based prefetching if their access pattern is not too irregular. Applications with irregular access patterns may benefit from precomputation-based prefetching, as long as their compute-access ratio is not too low.
Scalable hardware memory disambiguation
(2007-12) Sethumadhavan, Lakshminarasimhan, 1978-; Burger, Douglas C., Ph. D.
This dissertation deals with one of the long-standing problems in Computer Architecture – the problem of memory disambiguation. Microprocessors typically reorder memory instructions during execution to improve concurrency. Such microprocessors use hardware memory structures for memory disambiguation, known as LoadStore Queues (LSQs), to ensure that memory instruction dependences are satisfied even when the memory instructions execute out-of-order. A typical LSQ implementation (circa 2006) holds all in-flight memory instructions in a physically centralized LSQ and performs a fully associative search on all buffered instructions to ensure that memory dependences are satisfied. These LSQ implementations do not scale because they use large, fully associative structures, which are known to be slow and power hungry. The increasing trend towards distributed microarchitectures further exacerbates these problems. As on-chip wire delays increase and high-performance processors become necessarily distributed, centralized structures such as the LSQ can limit scalability. This dissertation describes techniques to create scalable LSQs in both centralized and distributed microarchitectures. The problems and solutions described in this thesis are motivated and validated by real system designs. The dissertation starts with a description of the partitioned primary memory system of the TRIPS processor, of which the LSQ is an important component, and then through a series of optimizations describes how the power, area, and centralization problems of the LSQ can be solved with minor performance losses (if at all) even for large number of in flight memory instructions. The four solutions described in this dissertation — partitioning, filtering, late binding and efficient overflow management — enable power-, area-efficient, distributed and scalable LSQs, which in turn enable aggressive large-window processors capable of simultaneously executing thousands of instructions. To mitigate the power problem, we replaced the power-hungry, fully associative search with a power-efficient hash table lookup using a simple address-based Bloom filter. Bloom filters are probabilistic data structures used for testing set membership and can be used to quickly check if an instruction with the same data address is likely to be found in the LSQ without performing the associative search. Bloom filters typically eliminate more than 80% of the associative searches and they are highly effective because in most programs, it is uncommon for loads and stores to have the same data address and be in execution simultaneously. To rectify the area problem, we observe the fact that only a small fraction of all memory instructions are dependent, that only such dependent instructions need to be buffered in the LSQ, and that these instructions need to be in the LSQ only for certain parts of the pipelined execution. We propose two mechanisms to exploit these observations. The first mechanism, area filtering, is a hardware mechanism that couples Bloom filters and dependence predictors to dynamically identify and buffer only those instructions which are likely to be dependent. The second mechanism, late binding, reduces the occupancy and hence size of the LSQ. Both of these optimizations allows the number of LSQ slots to be reduced by up to one-half compared to a traditional organization without any performance degradation. Finally, we describe a new decentralized LSQ design for handling LSQ structural hazards in distributed microarchitectures. Decentralization of LSQs, and to a large extent distributed microarchitectures with memory speculation, has proved to be impractical because of the high performance penalties associated with the mechanisms for dealing with hazards. To solve this problem, we applied classic flow-control techniques from interconnection networks for handling resource con- flicts. The first method, memory-side buffering, buffers the overflowing instructions in a separate buffer near the LSQs. The second scheme, execution-side NACKing, sends the overflowing instruction back to the issue window from which it is later re-issued. The third scheme, network buffering, uses the buffers in the interconnection network between the execution units and memory to hold instructions when the LSQ is full, and uses virtual channel flow control to avoid deadlocks. The network buffering scheme is the most robust of all the overflow schemes and shows less than 1% performance degradation due to overflows for a subset of SPEC CPU 2000 and EEMBC benchmarks on a cycle-accurate simulator that closely models the TRIPS processor. The techniques proposed in this dissertation are independent, architectureneutral and their cumulative benefits result in LSQs that can be partitioned at a fine granularity and have low design complexity. Each of these partitions selectively buffers only memory instructions with true dependences and can be closely coupled with the execution units thus minimizing power, area, and latency. Such LSQ designs with near-ideal characteristics are well suited for microarchitectures with thousands of instructions in-flight and may enable even more aggressive microarchitectures in the future.
Surface chemistry of FeHx with dielectric surfaces : towards directed nanocrystal growth
(2008-08) Winkenwerder, Wyatt August, 1981-; Ekerdt, John G.
The surface chemistry of GeH[subscript x] with dielectric surfaces is relevant to the application of germanium (Ge) nanocrystals for nanocrystal flash memory devices. GeH[subscript x] surface chemistry was first explored for thermally-grown SiO₂ revealing that GeH[subscript x] undergoes two temperature dependent reactions that remove Ge from the SiO₂ surface as GeH₄ and Ge, respectively. Ge only accumulates due to reactions between GeH[subscript x] species that form stable Ge clusters on the SiO₂ surface. Next, a Si-etched SiO₂ surface is probed by GeH[subscript x] revealing that the Si-etching defect activates the surface toward Ge deposition. The activation involves two separate reactions involving, first, the capture of GeH[subscript x] by the defect and second, a reaction between the captured Ge and remaining GeH[subscript x] species leading to the formation of Ge clusters. Reacting the defect with diborane, deactivates it toward GeH[subscript x] and also deactivates intrinsic hydroxyl groups toward GeH[subscript x] adsorption. A structure is proposed for the Si-etching defect. The surface chemistry of GeHx with HfO₂ is studied showing that the hafnium germinate that forms beneath the Ge nanocrystals exists as islands and not a continuous film. Annealing the hafnium germinate under a silane atmosphere will reduce it to Ge while leading to the deposition of hafnium silicate (HfSiO[subscript x]) and silicon (Si). Treating the HfO₂ with silane prior to Ge nanocrystal growth yields a surface with hafnium silicate islands on which Si also deposits. Ge deposition on this surface leads to the suppression of hafnium germinate formation. Electrical testing of capacitors made from Ge nanocrystals and HfO₂ shows that Ge nanocrystals encapsulated in Si/HfSiO[subscript x] layers have greatly improved retention characteristics.
Switch-based Fast Fourier Transform processor
(2008-12) Mohd, Bassam Jamil, 1968-; Swartzlander, Earl E.
The demand for high-performance and power scalable DSP processors for telecommunication and portable devices has increased significantly in recent years. The Fast Fourier Transform (FFT) computation is essential to such designs. This work presents a switch-based architecture to design radix-2 FFT processors. The processor employs M processing elements, 2M memory arrays and M Read Only Memories (ROMs). One processing element performs one radix-2 butterfly operation. The memory arrays are designed as single-port memory, where each has a size of N/(2M); N is the number of FFT points. Compared with a single processing element, this approach provides a speedup of M. If not addressed, memory collisions degrade the processor performance. A novel algorithm to detect and resolve the collisions is presented. When a collision is detected, a memory management operation is executed. The performance of the switch architecture can be further enhanced by pipelining the design, where each pipeline stage employs a switch component. The result is a speedup of Mlog2N compared with a single processing element performance. The utilization of single-port memory reduces the design complexities and area. Furthermore, memory arrays significantly reduce power compared with the delay elements used in some FFT processors. The switch-based architecture facilitates deactivating processing elements for power scalability. It also facilitates implementing different FFT sizes. The VLSI implementation of a non-pipeline switch-based processor is presented. Matlab simulations are conducted to analyze the performance. The timing, power and area results from RTL, synthesis and layout simulations are discussed and compared with other processors.
A technology-scalable composable architecture
(2007) Kim, Changkyu; Burger, Douglas C., Ph. D.
Clock rate scaling can no longer sustain computer system performance scaling due to power and thermal constraints and diminishing performance returns of deep pipelining. Future performance improvements must therefore come from mining concurrency from applications. However, increasing global on-chip wire delays will limit the amount of state available in a single cycle, thereby hampering the ability to mine concurrency with conventional approaches. To address these technology challenges, the processor industry has migrated to chip multiprocessors (CMPs). The disadvantage of conventional CMP architectures, however, is their relative inflexibility to meet the wide range of application demands and operating targets that now exist. The granularity (e.g., issue width), the number of processors in a chip and memory hierarchies are fixed at design time based on the target workload mix, which result in suboptimal operation as the workload mix and operating targets change over time. In this dissertation, we explore the concept of composability to address both the increasing wire delay problem and the inflexibility of conventional CMP architectures. The basic concept of composability is the ability to dynamically adapt to diverse applications and operating targets, both in terms of granularity and functionality, by aggregating finegrained processing units or memory units. First, we propose a composable on-chip memory substrate, called Non-Uniform Access Cache Architecture (NUCA) to address increasing on-chip wire delay for future large caches. The NUCA substrate breaks large on-chip memories into many fine-grained memory banks that are independently accessible, with a switched network embedded in the cache. Lines can be mapped into this array of memory banks with fixed mappings or dynamic mappings, where cache lines can move around within the cache to further reduce the average cache hit latency. Second, we evaluate a range of strategies to build a composable processor. Composable processors provide flexibility of adapting the granularity of processors to various application demands and operating targets, and thus choose the hardware configurations best suited to any given point. A composable processor consists of a large number of lowpower, fine-grained processor cores that can be aggregated dynamically to form more powerful logical processors. We present architectural innovations to support composability in a power- and area-efficient manner.
The use of memory state knowledge to improve computer memory system organization
(2011-05) Isen, Ciji; John, Lizy Kurian; McKinley, Kathryn S.; Erez, Mattan; Aziz, Adnan; Bhargava, Ravi; Gratz, Paul V.
The trends in virtualization as well as multi-core, multiprocessor environments have translated to a massive increase in the amount of main memory each individual system needs to be fitted with, so as to effectively utilize this growing compute capacity. The increasing demand on main memory implies that the main memory devices and their issues are as important a part of system design as the central processors. The primary issues of modern memory are power, energy, and scaling of capacity. Nearly a third of the system power and energy can be from the memory subsystem. At the same time, modern main memory devices are limited by technology in their future ability to scale and keep pace with the modern program demands thereby requiring exploration of alternatives to main memory storage technology. This dissertation exploits dynamic knowledge of memory state and memory data value to improve memory performance and reduce memory energy consumption. A cross-boundary approach to communicate information about dynamic memory management state (allocated and deallocated memory) between software and hardware viii memory subsystem through a combination of ISA support and hardware structures is proposed in this research. These mechanisms help identify memory operations to regions of memory that have no impact on the correct execution of the program because they were either freshly allocated or deallocated. This inference about the impact stems from the fact that, data in memory regions that have been deallocated are no longer useful to the actual program code and data present in freshly allocated memory is also not useful to the program because the dynamic memory has not been defined by the program. By being cognizant of this, such memory operations are avoided thereby saving energy and improving the usefulness of the main memory. Furthermore, when stores write zeros to memory, the number of stores to the memory is reduced in this research by capturing it as compressed information which is stored along with memory management state information. Using the methods outlined above, this dissertation harnesses memory management state and data value information to achieve significant savings in energy consumption while extending the endurance limit of memory technologies.