Browsing by Subject "Cache memory"
Now showing 1 - 10 of 10
- Results Per Page
- Sort Options
Item Adaptive caching for high-performance memory systems(2007) Qureshi, Moinuddin Khalil Ahmed, 1978-; Patt, Yale N.One of the major limiters to computer system performance has been the access to main memory, which is typically two orders of magnitude slower than the processor. To bridge this gap, modern processors already devote more than half of the on-chip transistors to the last-level cache. However, traditional cache designs – developed for small first-level caches – are inefficient for large caches. Therefore, cache misses are common which results in frequent memory accesses and reduced processor performance. The importance of cache management has become even more critical because of the increasing memory latency, increasing working sets of many emerging applications, and decreasing size of cache devoted to each core due to increased number of cores on a single chip. This dissertation focuses on analyzing some of the problems with managing large caches and proposing cost-effective solutions to improve their performance. Different workloads and program phases have different locality characteristics that make them better suited to different replacement policies. This dissertation proposes hybrid replacement policy that can select from multiple replacement policies depending on which policy has the highest performance. To implement hybrid replacement with low-overhead, it shows that cache behavior can be approximated by sampling few sets and proposes the concept of Dynamic Set Sampling. The commonly used LRU replacement policy results in thrashing for memory-intensive workloads that have a working set bigger than the cache size. This dissertation shows that performance of memory-intensive workloads can be improved significantly by changing the recency position where the incoming line is inserted. The proposed mechanism reduces cache misses by 21% over LRU, is robust across a wide variety of workloads, incurs a total storage overhead of less than two bytes, and does not change the existing cache structure. Modern systems try to service multiple cache misses in parallel. The variation in Memory Level Parallelism (MLP) causes some misses to be more costly on performance than other misses. This dissertation presents the first study on MLP-aware cache replacement and proposes to improve performance by eliminating some of the performance-critical isolated misses. Finally, this dissertation also analyzes cache partitioning policies for shared caches in chip multi-processors. Traditional partitioning policies either divide the cache equally among all applications or use the LRU policy to do a demand based cache partitioning. This dissertation shows that performance can be improved if the shared cache is partitioned based on how much the application benefits from the cache, rather than on its demand for the cache. It proposes a novel low-overhead circuit that can dynamically monitor the utility of cache for any application. The proposed partitioning improves weighted-speedup by 11%, throughput by 17% and fairness by 11% on average compared to LRU.Item Algorithms and data structures for cache-efficient computation: theory and experimental evaluation(2007-08) Chowdhury, Rezaul Alam; Ramachandran, VijayaThe ideal-cache model is an abstraction of the memory hierarchy in modern computers which facilitates the design of algorithms that can use the caches (i.e., memory levels) in the hierarchy efficiently without using the knowledge of cache parameters. In addition to possibly running faster than traditional flat-memory algorithms due to reduced cache-misses, these cache-oblivious algorithms are also system-independent and thus more portable than cache-aware algorithms. These algorithms are useful both in applications that work on massive datasets and in applications that run on small-memory systems such as handheld devices. The major contribution of this dissertation is a number of new cache-efficient and cache-oblivious algorithms and data structures for problems in three different domains: graph algorithms, problems in the Gaussian Elimination Paradigm (GEP), and problems with dynamic programming algorithms. Among graph problems we concentrate on shortest path computation, and for the computation-intensive problems in the latter two domains we also present efficient parallelizations of our cacheoblivious algorithms for distributed and shared caches. We perform extensive experimental study of most of our algorithms, and compare them with best known existing algorithms and software. In the area of graph algorithms and data structures, we introduce the first efficient cache-oblivious priority queue supporting Decrease-Key operations, and use it to obtain the first non-trivial cache-oblivious single-source shortest path algorithms for both directed and undirected graphs with general non-negative edge-weights. Our experimental results show that shortest path computation using a light-weight version of this new priority queue is faster than using highly optimized traditional priority queues even when the computation is in-core. We also present several new cache-efficient exact and approximate all-pairs shortest path algorithms for both weighted and unweighted undirected graphs. The Gaussian Elimination Paradigm (GEP) includes many important practical problems with constructs similar to that in Gaussian elimination without pivoting, e.g., Floyd-Warshall’s all-pairs shortest path, LU decomposition without pivoting, matrix multiplication, etc. We present a general cache-oblivious framework for cache-efficient sequential and parallel solution of any problem in GEP. Our experimental results comparing our cache-oblivious algorithms with industrial-strength cache-aware BLAS (i.e., Basic Linear Algebra Subprogram) code suggest that our GEP framework offers an attractive trade-off between efficiency and portability. In the domain of dynamic programs, we present efficient cache-oblivious sequential and parallel algorithms for a number of important dynamic programs in bioinformatics including optimal pairwise sequence alignment, median of three sequences, and RNA secondary structure prediction with and without (simple) pseudoknots. All our algorithms improve significantly over the cache complexity of earlier algorithms, and either match or improve over their space complexity. We empirically compare most of our algorithms with the best publicly available code written by others, and our experimental results indicate that our algorithms run faster than these software.Item Algorithms for distributed caching and aggregation(2007-12) Tiwari, Mitul; Plaxton, C. GregIn recent years, there has been an explosion in the amount of distributed data due to the ever decreasing cost of both storage and bandwidth. There is a growing need for automatic distributed data management techniques. The three main areas in dealing with distributed data that we address in this dissertation are (1) cooperative caching, (2) compression caching, and (3) aggregation. First, we address cooperative caching, in which caches cooperate to locate and cache data objects. The benefits of cooperative caching have been demonstrated by various studies. We address a hierarchical generalization of cooperative caching in which caches are arranged as leaf nodes in a hierarchical tree network, and we call this variant Hierarchical Cooperative Caching. We present a deterministic hierarchical generalization of LRU that is constantcompetitive when the capacity blowup is linear in d, the depth of the cache hierarchy. Furthermore, we show that any randomized hierarchical cooperative caching algorithm with capacity blowup b has competitive ratio Ω(log d b ) against an oblivious adversary. Thus we establish that there is no resource competitive algorithm for the hierarchical cooperative caching problem. Second, we address a class of compression caching problems in which a file can be cached in multiple formats with varying sizes and encode/decode costs. In this work, we address three problems in this class of compression caching. The first problem assumes that the encode cost and decode cost associated with any format of a file are equal. For this problem we present a resource competitive online algorithm. To explore the existence of resource competitive online algorithms for compression caching with arbitrary encode costs and decode costs, we address two other natural problems in the aforementioned class, and for each of these problems, we show that there exists a non-constant lower bound on the competitive ratio of any online algorithm, even if the algorithm is given an arbitrary factor capacity blowup. Thus, we establish that there is no resource competitive algorithm for compression caching in its full generality. Third, we address the problem of aggregation over trees with the goal of adapting aggregation aggressiveness. Consider a distributed network with nodes arranged in a tree, and each node having a local value. We consider the problem of aggregating values (e.g., summing values) from all nodes to the requesting nodes in the presence of writes. The goal is to minimize the total number of messages exchanged. The key challenges are to define a noix tion of “acceptable” aggregate values, and to design algorithms with good performance that are guaranteed to produce such values. We formalize the acceptability of aggregate values in terms of certain consistency guarantees. We propose a lease-based aggregation mechanism, and evaluate algorithms based on this mechanism in terms of consistency and performance. With regard to consistency, we adapt the definitions of strict and causal consistency to apply to the aggregation problem. We show that any lease-based aggregation algorithm provides strict consistency in sequential executions, and causal consistency in concurrent executions. With regard to performance, we propose an online lease-based aggregation algorithm, and show that, for sequential executions, the algorithm is constant competitive against any offline algorithm that provides strict consistency. Our online lease-based aggregation algorithm is presented in the form of a fully distributed protocol, and the aforementioned consistency and performance results are formally established with respect to this protocol. We also present experimental results to show that the algorithm performs well under various workloads.Item Competitive cache replacement strategies for a shared cache(2011-05) Katti, Anil Kumar; Ramachandran, Vijaya; Plaxton, GregWe consider cache replacement algorithms at a shared cache in a multicore system which receives an arbitrary interleaving of requests from processes that have full knowledge about their individual request sequences. We establish tight bounds on the competitive ratio of deterministic and randomized cache replacement strategies when processes share memory blocks. Our main result for this case is a deterministic algorithm called GLOBAL-MAXIMA which is optimum up to a constant factor when processes share memory blocks. Our framework is a generalization of the application controlled caching framework in which processes access disjoint sets of memory blocks. We also present a deterministic algorithm called RR-PROC-MARK which exactly matches the lower bound on the competitive ratio of deterministic cache replacement algorithms when processes access disjoint sets of memory blocks. We extend our results to multiple levels of caches and prove that an exclusive cache is better than both inclusive and non-inclusive caches; this validates the experimental findings in the literature. Our results could be applied to shared caches in multicore systems in which processes work together on multithreaded computations like Gaussian elimination paradigm, fast Fourier transform, matrix multiplication, etc. In these computations, processes have full knowledge about their individual request sequences and can share memory blocks.Item Efficient runahead execution processors(2006) Mutlu, Onur; Patt, Yale N.High-performance processors tolerate latency using out-of-order execution. Unfortunately, today’s processors are facing memory latencies in the order of hundreds of cycles. To tolerate such long latencies, out-of-order execution requires an instruction window that is unreasonably large, in terms of design complexity, hardware cost, and power consumption. Therefore, current processors spend most of their execution time stalling and waiting for long-latency cache misses to return from main memory. And, the problem is getting worse because memory latencies are increasing in terms of processor cycles. The runahead execution paradigm improves the memory latency tolerance of an out-of-order execution processor by performing potentially useful execution while a longlatency cache miss is in progress. Runahead execution unblocks the instruction window blocked by a long-latency cache miss allowing the processor to execute far ahead in the program path. This results in other long-latency cache misses to be discovered and their data to be prefetched into caches long before it is needed. This dissertation presents the runahead execution paradigm and its implementation on an out-of-order execution processor that employs state-of-the-art hardware prefetching techniques. It is shown that runahead execution on a 128-entry instruction window achieves the performance of a processor with three times the instruction window size for a current, 500-cycle memory latency. For a near-future 1000-cycle memory latency, it is shown that runahead execution on a 128-entry window achievesthe performance of a conventional processor with eight times the instruction window size, without requiring a significant increase in hardware cost and complexity. This dissertation also examines and provides solutions to two major limitations of runahead execution: its energy inefficiency and its inability to parallelize dependent cache misses. Simple and effective techniques are proposed to increase the efficiency of runahead execution by reducing the extra instructions executed without affecting the performance improvement. An efficient runahead execution processor employing these techniques executes only 6.2% more instructions than a conventional out-of-order execution processor but achieves 22.1% higher Instructions Per Cycle (IPC) performance. Finally, this dissertation proposes a new technique, called address-value delta (AVD) prediction, that predicts the values of pointer load instructions encountered in runahead execution in order to enable the parallelization of dependent cache misses using runahead execution. It is shown that a simple 16-entry AVD predictor improves the performance of a baseline runahead execution processor by 14.3% on a set of pointer-intensive applications, while it also reduces the executed instructions by 15.5%. An analysis of the high-level programming constructs that result in AVD-predictable load instructions is provided. Based on this analysis, hardware and software optimizations are proposed to increase the benefits of AVD prediction.Item Hardware techniques to reduce communication costs in multiprocessors(2006) Huh, Jaehyuk; Burger, Douglas C., Ph. D.This dissertation explores techniques for reducing the costs of inter-processor communication in shared memory multiprocessors (MP). We seek to improve MP performance by enhancing three aspects of multiprocessor cache designs: miss reduction, low communication latency, and high coherence bandwidth. In this dissertation, we propose three techniques to enhance the three factors: shared non-uniform cache architecture, coherence decoupling, and subspace snooping. As a miss reduction technique, we investigate shared cache designs for future Chip-Multiprocessors (CMPs). Cache sharing can reduce cache misses by eliminating unnecessary data duplication and by reallocating the cache capacity dynamically. We propose a reconfigurable shared non-uniform cache architecture and evaluate the trade-offs of cache sharing with varied sharing degrees. Although shared caches can improve caching efficiency, the most significant disadvantage of shared caches is the increase of cache hit latencies. To mitigate the effect of the long latencies, we evaluate two latency management techniques, dynamic block migration and L1 prefetching. However, improving the caching efficiency does not reduce the cache misses induced by MP communication. For such communication misses, the latencies of cache coherence should be either reduced or hidden and the coherence bandwidth should scale with the number of processors. To mitigate long communication latencies, coherence decoupling uses speculation for communication data. Coherence decoupling allows processors to run speculatively at communication misses with predicted values. Our prediction mechanism, called Speculative Cache Lookup (SCL) protocol, uses stale values in the local caches. We show that the SCL read component can hide false sharing and silent store misses effectively. We also investigate the SCL update component to hide the latencies of truly shared misses by updating invalid blocks speculatively. To improve the coherence bandwidth, we propose subspace snooping, which improves the snooping bandwidth with future large-scale shared-memory machines. Even with huge optical bus bandwidth, traditional snooping protocols may not scale to hundreds of processors, since all processors should respond to every bus access. Subspace snooping allows only a subset of processors to be snooped for a bus access, thus increasing the effective snoop tag bandwidth. We evaluate subspace snooping with a large broadcasting bandwidth provided by optical interconnects.Item Improving program locality on-the-fly(2006-05) Huang, Xianglong, 1975-; McKinley, Kathryn S.As increases in processor speed continue to outpace increases in cache and memory speed, programs are losing more performance to poor locality. Object-oriented languages exacerbate this problem by adopting new features such as just-in-time (JIT) compilation, dynamic class loading, and many small methods. However, they provide significant software engineering benefits and have become enormously popular. Solutions that bridge the memory gap will combine good performance with fast software development. We found although unique features of object-oriented languages, such as automatic memory management, dynamic compilation, and runtime monitoring systems, generate performance overhead, they also provide new opportunities for online optimizations which are not exploited by previous work. In this thesis, we take advantage of these opportunities with new approaches that improve data and instruction locality at runtime with low overhead. To improve data locality, we first implement a new dynamic, low overhead, online class analysis to find data locality. Our algorithm detects hot fields for hot objects and then reorders the objects according to their heat on-the-fly in a copying generational collector. The overall time variation between static orderings can be up to 25% and there is no consistent winner. In contrast, our new dynamic class reordering always matches or improves over the best static ordering since its history-based copying order tunes memory layout to program traversal. To improve instruction locality, we develop two schemes for improving instruction locality in a Java Virtual Machine environment. We first describes a partial code reordering system, which reduces the program instruction working set and cache conflict misses with extremely low overhead. We then present a code management system that uses dynamic profiling to reorder all JITcompiled code to improve instruction locality with novel efficient algorithms. Both systems show that the VM can dynamically improve instruction locality with little overhead. These results indicate that the VM layer for modern languages are not just a cost-to-be-borne, but instead open up a new class of optimizations for monitoring and improving data and instruction locality, and code quality. Thus these results point to a future of programming languages that are robust, dependable, and high performance.Item Prefetch mechanisms by application memory access pattern(2007-05) Agaram, Kartik Kandadai; Keckler, Stephen W.Modern computer systems spend a substantial fraction of their running time waiting for data from memory. While prefetching has been a promising avenue of research for reducing and tolerating latencies to memory, it has also been a challenge to implement. This challenge exists largely because of the growing complexity of memory hierarchies and the wide variety of application behaviors. In this dissertation we propose a new methodology that emphasizes decomposing complex behavior at the application level into regular components that are intelligible at a high level to the architect. This dissertation is divided into three stages. In the first, we build tools to help decompose application behavior by data structure and phase, and use these tools to create a richer picture of application behavior than with conventional simulation tools, yielding compressed summaries of dominant access patterns. The variety of a access patterns drives the next stage: design of a prefetch system that improves on the state of the art. Every prefetching system must make low-overhead decisions on what to prefetch, when to prefetch it, and where to store prefetched data. Visualizing application a access patterns allows us to articulate the subtleties in making these decisions and the many ways that a mechanism that improves one decision for one set of applications may degrade the quality of another decision for a different set. Our insights lead us to a new system called TwoStep with a small set of independent but synergistic mechanisms. In the third stage we perform a detailed evaluation of TwoStep. We find that while it outperforms past approaches for the most irregular applications in our benchmark suite, it is unable to improve on the speedups for more regular applications. Understanding why leads to an improved understanding of two general categories of prefetch techniques. Prefetching an either look back at past history or look forward by precomputing an application's future requirements. Applications with a low compute-a ccess ratio an benefit from history-based prefetching if their access pattern is not too irregular. Applications with irregular access patterns may benefit from precomputation-based prefetching, as long as their compute-access ratio is not too low.Item Scalable primary cache memory architectures(2004) Agarwal, Vikas; John, Lizy KurianFor the past decade, microprocessors have been improving in overall performance at a rate of approximately 50–60% per year by exploiting a rapid increase in clock rate and improving instruction throughput. A part of this trend has included the growth of on-chip caches which in modern processor can be as large as 2MB. However, as smaller technologies become prevalent, achieving low average memory access time by simply scaling existing designs becomes more difficult because of process limitations. This research shows that scaling an existing design by either keeping the latency of various structures constant or allowing the latency to vary while keeping the capacity constant leads to degradation in the instructions per cycle (IPC). The goal of this research is to improve IPC at small feature sizes, using a combination of circuit and architectural techniques. This research develops technology-based models to estimate cache access times and uses the models for architectural performance estimation. The performance of a microarchitecture with clustered functional units coupled with a partitioned primary data cache is estimated using the cache access time models. This research evaluates both static and dynamic data mapping on the partitioned primary data cache and shows that using dynamic mapping in combination with the partitioned cache outperforms both the unified cache as well as the statically mapped design. In conjunction with the dynamic data mapping, this research proposes and evaluates predictive instruction steering strategies that help improve the performance of clustered processor designs. This research shows that a hybrid predictive instruction steering policy coupled with an aggressive dynamic mapping of data in a partitioned primary data cache can significantly improve the instructions per cycle (IPC) of a clustered processor, relative to dependence based steering with a unified data cache.Item Volume lease: a scalable cache consistency framework(2003) Yin, Jian; Dahlin, Mike; Alvisi, Lorenzo