Browsing by Subject "Computer architecture"

Now showing 1 - 20 of 52

Accurate modeling of core and memory locality for proxy generation targeting emerging applications and architectures
(2017-12) Panda, Reena; John, Lizy Kurian; Swartzlander, Earl E; Khurshid, Sarfraz; Gerstlauer, Andreas; Ganesan, Karthik
Designing optimal computer systems for improved performance and energy efficiency requires architects and designers to have a deep understanding of the end-user workloads. However, many end-users (e.g., large corporations, banks, defense organizations, etc.) are apprehensive to share their applications with designers due to the confidential nature of software code and data. In addition, emerging applications pose significant challenges to early design space exploration due to their long-running nature and the highly complex nature of their software stack that cannot be supported on many early performance models. The above challenges can be overcome by using a proxy benchmark. A miniaturized proxy benchmark can be used as a substitute of the original workload to perform early computer performance evaluation. The process of generating a proxy benchmark consists of extracting a set of key statistics to summarize the behavior of end-user applications through profiling and using the collected statistics to synthesize a representative proxy benchmark. Using such proxy benchmarks can help designers to understand the behavior of end-user’s workloads in a reasonable time without the users having to disclose sensitive information about their workloads. Prior proxy benchmarking schemes leverage micro-architecture independent metrics, derived from detailed simulation tools, to generate proxy benchmarks. However, many emerging workloads do not work reliably with many profiling or simulation tools, in which case it becomes impossible to apply prior proxy generation techniques to generate proxy benchmarks for such complex applications. Furthermore, these techniques model instruction pipeline-level locality in great detail, but abstract out memory locality modeling using simple stride-based models. This results in poor cloning accuracy especially for emerging applications, which have larger memory footprints and complex access patterns. A few detailed cache and memory locality modeling techniques have also been proposed in literature. However, these techniques either model limited locality metrics and suffer from poor cloning accuracy or are fairly accurate, but at the expense of significant metadata overhead. Finally, none of the prior proxy benchmarking techniques model both core and memory locality with high accuracy. As a result, they are not useful for studying system-level performance behavior. Keeping the above key limitations and shortcomings of prior work in mind, this dissertation presents several techniques that expand the frontiers of workload proxy benchmarking, thereby enabling computer designers to gain a better and faster understanding of end-user application behavior without compromising the privileged nature of software or data. This dissertation first presents a core-level proxy benchmark generation methodology that leverages performance metrics derived from hardware performance counter measurements to create miniature proxy benchmarks targeting emerging big-data applications. The presented performance counter based characterization and associated extrapolation into generic parameters for proxy generation enables faster analysis (runs almost at native hardware speeds, unlike prior workload cloning proposals) and proxy generation for emerging applications that do not work with simulators or profiling tools. The generated proxy benchmarks are representative of the performance of the real-world big-data applications, including operating system and run-time effects, and yet converge to results quickly without needing any complex software stack support. Next, to improve upon the accuracy and efficiency of prior memory proxy benchmarking techniques, this dissertation presents a novel memory locality modeling technique that leverages localized pattern detection to create miniature memory proxy benchmarks. The presented technique models memory reference locality by decomposing an application’s memory accesses into a set of independent streams (localized by using address region based localization property), tracking fine-grained patterns within the localized streams and, finally, chaining or interleaving accesses from different localized memory streams to create an ordered proxy memory access sequence. This dissertation further extends the workload cloning approach to Graphics Processing Units (GPUs) and presents a novel proxy generation methodology to model the inherent memory access locality of GPU applications, while also accounting for the GPU’s parallel execution model. The generated memory proxy benchmarks help to enable fast and efficient design space exploration of futuristic memory hierarchies. Finally, this dissertation presents a novel technique to integrate accurate core and memory locality models to create system-level proxy benchmarks targeting emerging applications. This is a new capability that can facilitate efficient overall system (core, cache and memory subsystem) design-space exploration. This dissertation further presents a novel methodology that exploits the synthetic benchmark generation framework to create hypothetical workloads with performance behavior that does not currently exist. Such proxies can be generated to cover anticipated code trends and can represent futuristic workloads before the workloads even exist.
Adaptive predication via compiler-microarchitecture cooperation
(2007) Kim, Hyesoon, 1974-; Patt, Yale N.
Even after decades of research in branch prediction, branch predictors still remain imperfect, which results in significant performance loss in aggressive processors that sup- port large instruction windows and deep pipelines. Predicated execution can reduce the number of branch mispredictions by eliminating hard-to-predict branches. However, the additional instruction overhead and data dependencies due to predicated execution some- times offset the performance benefits of having fewer mispredictions. This dissertation presents two cooperative compiler-microarchitecture mechanisms to reduce the branch mis- prediction penalty by combining predicated execution and branch prediction. The first mechanism is a set of new control flow instructions, called wish branches. With wish branches, the compiler generates code that can be executed either as normal branch code or as predicated code. At run-time, the hardware chooses between normal branch code and predicated code based on the run-time branch behavior and the estimated run-time effectiveness of each solution. The results show that wish branches can signifi- cantly improve both performance and energy efficiency compared to predication or branch prediction. To provide the benefit of predicated code to non-predicated Instruction Set Archi- tectures (ISAs) and to increase the benefit of predicated execution beyond the benefit of wish branches, this dissertation also presents and evaluates the Diverge-Merge Processor (DMP) architecture. In the diverge-merge processor, the compiler analyzes the control-flow graphs of the program and marks branches suitable for dynamic predication –called di- verge branches– and their corresponding control flow merge points. The hardware not only chooses whether to use branch prediction or predication, but also decides “which” instruc- tions after a branch should be predicated based on run-time branch behavior. This solution significantly reduces the overhead of predicated code and allows a very large set of control- flow graphs to be predicated, neither of which was possible previously because predication was performed statically without any run-time information. This dissertation compares DMP with all other major previously-proposed branch processing paradigms available in the literature in terms of performance, power, energy consumption, and complexity. The results show that DMP is the most energy-efficient and high-performance paradigm for branch handling. Code generation algorithms for the DMP architecture and cost-benefit analysis models of dynamic predication are also evaluated.
Advancing value prediction
(2019-05-09) Subramanian, Anjana; Lin, Yun Calvin
Read after write dependencies form a key bottleneck in single thread performance. Value prediction [9][10][18] is a speculative technique that overcomes these dependencies by predicting results of instruction execution, thereby preventing dependent instructions from stalling. Usually, the penalties for value mispredictions are extremely high. As a result, value predictors have evolved to prioritize accuracy over coverage. To improve upon the state-of-the-art, our goals are: (i) to develop more powerful prediction mechanisms that have a better accuracy-coverage tradeoff (ii) to maximize performance gains obtained from correct predictions. We present two independent pieces of work that address each of these. To achieve the first goal, we design a Heterogeneous Context-based Value Predictor (HCVP) that combines the use of branch history with value history to represent program context information. We demonstrate that this combination provides better predictability than using either of them individually and that it allows for the use of relatively short value history lengths that provide more coverage than very long ones. HCVP does not maintain speculative value histories as it more tolerant to the update problem that occurs when back to back instances of the same instruction are predicted. Our predictor performs better than the state-of-the-art value predictors (E VTAGE and DFCM++) to achieve a 29% speedup over a baseline with no value prediction. When combined with the E Stride predictor, it achieves a speedup of 46%, which is 9% higher than that achieved by E VTAGE E Stride (EVES), the winner of the First Championship Value Prediction. To achieve the second goal, we exploit the fact that some instructions are more performance critical than others. We categorize instructions by various parameters to find one or more classes of instructions that provide high performance benefits for correct predictions. We find that loads, address producing instructions, and high fanout instructions are extremely beneficial for value prediction.
Architectural techniques to accelerate multimedia applications on general-purpose processors
(2001-08) Talla, Deependra, 1975-; John, Lizy Kurian
General-purpose processors (GPPs) have been augmented with multimedia extensions to improve performance on multimedia-rich workloads. These extensions operate in a single instruction multiple data (SIMD) fashion to extract data level parallelism in multimedia and digital signal processing (DSP) applications. This dissertation consists of a comprehensive evaluation of the execution characteristics of multimedia applications on SIMD enhanced GPPs, detection of bottlenecks in the execution of multimedia applications on SIMD enhanced GPPs, and the design and implementation of architectural techniques to eliminate and alleviate the impact of the various bottlenecks to accelerate multimedia applications. This dissertation identifies several bottlenecks in the processing of SIMD enhanced multimedia and DSP applications on GPPs. It is found that approximately 75-85% of instructions in the dynamic instruction stream of media workloads are not performing useful computations but merely supporting the useful computations by performing address generation, address transformation/data reorganization, loads/stores, and loop branches. This leads to an underutilization of the SIMD computation units with only 1-12% of the peak SIMD throughput being achieved. This dissertation proposes the use of hardware support to efficiently execute the overhead/supporting instructions by overlapping them with the useful computation instructions. A 2-way GPP with SIMD extensions augmented with the proposed MediaBreeze hardware significantly outperforms a 16-way SIMD GPP without MediaBreeze hardware on multimedia kernels. On multimedia applications, a 2-/4-way SIMD GPP augmented with MediaBreeze hardware is superior to a 4-/8-way SIMD GPP without MediaBreeze hardware. The performance improvements are achieved at an area cost that is less than 0.3% of current GPPs and power consumption that is less than 1% of the total processor power without elongating the critical path of the processor.
Architectures and algorithms for high performance switching
(2004) Prakash, Amit; Aziz, Adnan
Switches are ubiquitous in modern computing, appearing in wide-area networks, multiprocessor servers, and data storage systems. With the the advent of high-speed link technology, switches have become the bottleneck in moving data in the network. Existing switch architectures either require the interconnection network and packet buffers to work at a very high speed or require complex scheduling problems to be solved quickly. In this dissertation we investigate whether there are switch architectures that can support high-speed links that are simultaneously easy to schedule, and can be built out of inexpensive components. The approach we take is using parallelism to solve complex scheduling problems. We choose switching architectures such that the corresponding scheduling problem can be efficiently solved with a reasonable amount of hardware. In particular, we present two switch architectures for which we have developed efficient scheduling algorithms. The first switch achieves optimum throughput and optimum average latency while the second switch guarantees optimum throughput only but uses considerably less hardware.
Atomic block formation for explicit data graph execution architectures
(2010-08) Maher, Bertrand Allen; McKinley, Kathryn S.; Burger, Douglas C., Ph. D.; Keckler, Stephen W.; Mahlke, Scott A.; Pingali, Keshav
Limits on power consumption, complexity, and on-chip latency have focused computer architects on power-efficient designs that exploit parallelism. One approach divides programs into atomic blocks of operations that execute semi-independently, which efficiently creates a large window of potentially concurrent operations. This dissertation studies the intertwined roles of the compiler, architecture, and microarchitecture in achieving efficiency and high performance with a block-atomic architecture. For such an architecture to achieve high performance the compiler must form blocks effectively. The compiler must create large blocks of instructions to amortize the per-block overhead, but control flow and content restrictions limit the compiler's options. Block formation should consider factors such of frequency of execution, block size such as selecting control-flow paths that are frequently executed, and exploiting locality of computations to reduce communication overheads. This dissertation determines what characteristics of programs influence block formation and proposes techniques to generate effective blocks. The first contribution is a method for solving phase-ordering problems inherent to block formation, mitigating the tension between block-enlarging optimizations---if-conversion, tail duplication, loop unrolling, and loop peeling---as well as scalar optimizations. Given these optimizations, analysis shows that the remaining obstacles to creating larger blocks are inherent in the control flow structure of applications, and furthermore that any fixed block size entails a sizable amount of wasted space. To eliminate this overhead, this dissertation proposes an architectural implementation of variable-size blocks that allow the compiler to dramatically improve block efficiency. We use these mechanisms to develop policies for block formation that achieve high performance on a range of applications and processor configurations. We find that the best policies differ significantly depending on the number of participating cores. Using machine learning, we discover generalized policies for particular hardware configurations and find that the best policy varies significantly between applications and based on the number of parallel resources available in the microarchitecture. These results show that effective and efficient block-atomic execution is possible when the compiler and microarchitecture are designed cooperatively.
Braids: out-of-order performance with almost in-order complexity
(2007-12) Tseng, Francis, 1976-; Patt, Yale N.
There is still much performance to be gained by out-of-order processors with wider issue widths. However, traditional methods of increasing issue width do not scale; that is, they drastically increase design complexity and power requirements. This dissertation introduces the braid, a compile-time generated entity that enables the execution core to scale to wider widths by exploiting the small fanout and short lifetime of values produced by the program. A braid captures dataflow and register usage information of the program which are known to the compiler but are not traditionally conveyed to the microarchitecture through the instruction set architecture. Braid processing requires identification by the compiler, minor augmentations to the instruction set architecture, and support by the microarchitecture. The execution core of the braid microarchitecture consists of a number of braid execution units (BEUs). The BEU is tailored to efficiently carry out the execution of a braid in an in-order fashion. Each BEU consists of a FIFO scheduler, a busy-bit vector, two functional units, and a small internal register file. The braid microarchitecture provides a number of opportunities for the reduction of design complexity. It reduces the port requirements of the renaming mechanism, it simplifies the steering process, it reduces the area, size, and port requirements of the register file, and it reduces the paths and port requirements of the bypass network. The complexity savings result in a design characterized by a lower power requirement, a shorter pipeline, and a higher clock frequency. On an 8-wide design, the result from executing braids is performance within 9% of a very aggressive conventional out-of-order microarchitecture with the complexity of an in-order implementation. Three bottlenecks are identified in the braid microarchitecture and a solution is presented to address each. The limitation on braid size is addressed by dynamic merging. The underutilization of braid execution resources caused by long-latency instructions is addressed by context sharing. The poor utilization of braid execution resources caused by single-instruction braids is addressed by heterogeneous execution resources.
Bridging the gap between mobile CPU design and user satisfaction via crowdsourcing
(2016-05) Halpern, Matthew Franklin; Janapa Reddi, Vijay; Tiwari, Mohit
This report aims to provide an understanding of how the mobile CPU designs have evolved and its influence on end-user satisfaction. To that end, a quantitative performance analysis is conducted across ten cutting-edge mobile CPU designs studied within top-selling off-the-shelf smartphones released over the past seven years. This analysis is then used to guide a large-scale user study spanning over 25,000 participants via crowdsourcing on the Amazon Mechanical Turk service. The user study asks participants to assess the responsiveness of interactive application use cases for a set of current-generation applications (e.g. Angry Birds and FaceBook) and next-generation applications (i.e. face recognition and augmented reality) relative to the performance capabilities of the devices studied. This framework allows us to quantitatively link how the mobile CPU designs studied impacted end-user satisfaction. The study results indicate that mobile CPU designs have exhibited signifiant performance improvements through aggressive core scaling techniques prevalent in desktop CPUs. Just as was observed in desktop CPU design, these same techniques have lead to excessive mobile CPU power consumption. However, from an end-user perspective this power consumption was not without success. Mobile CPUs have evolved to provide satisfactory experiences for the studied current- generation applications. The reason is that many of these applications rely heavily on single-threaded performance. Other, more recent applications, actually multi-thread user-critical parts of the applications, which also demonstrates that multi- core mobile CPUs are an important design consideration – contrary to conventional wisdom. However, looking ahead, the same mobile CPUs where not able to provide satisfactory experiences for many of the next-generation applications studied, questioning the sustainability of these power-hungry design techniques in future mobile CPU designs.
Constructing adaptable and scalable synthetic benchmarks for microprocessor performance evaluation
(2007-12) Joshi, Ajay Manohar, 1976-; John, Lizy Kurian
Benchmarks set standards for innovation in computer architecture research and industry product development. Consequently, it is of paramount importance that the benchmarks used in computer architecture research and development are representative of real-world applications. However, composing such representative workloads poses practical challenges to application analysis teams and benchmark developers - (1) Benchmarks that are standardized are open-source whereas applications of interest are typically proprietary, (2) Benchmarks are rigid, measure single-point performance, and only represent a sample of the application behavior space (3) Benchmark suites take several years to develop, but applications evolve at a faster rate, and (4) Benchmarks geared towards temperature and power characterization are difficult to develop and standardize. The objective of this dissertation is to develop an adaptive benchmark generation strategy to construct synthetic benchmarks to address these benchmarking challenges. We propose an approach for automatically distilling key hardware-independent performance attributes of a proprietary workload and capture them into a miniature synthetic benchmark clone. The advantage of the benchmark clone is that it hides the functional meaning of the code, but exhibits similar performance and power characteristics as the target application across a wide range of microarchitecture configurations. Moreover, the dynamic instruction count of the synthetic benchmark clone is substantially shorter than the proprietary application, greatly reducing overall simulation time -- for the SPEC CPU 2000 suite, the simulation time reduction is over five orders of magnitude compared to the entire benchmark execution. We develop an adaptive benchmark generation strategy that trades off accuracy to provide the flexibility to easily alter program characteristics. The parameterization of workload metrics makes it possible to succinctly describe an application's behavior using a limited number of fundamental program characteristics. This provides the ability to alter workload characteristics and construct scalable benchmarks that allows researchers to explore a wider range of the application behavior space, conduct program behavior studies, and model emerging workloads. The parameterized workload model is the foundation for automatically constructing power and temperature oriented synthetic workloads. We show that machine learning algorithms can be effectively used to search the application behavior space to automatically construct benchmarks for evaluating the power and temperature characteristics of a computer architecture design. The need for a scientific approach to construct synthetic benchmarks, to complement application benchmarks, has long been recognized by the computer architecture research community, and this dissertation work is a significant step towards achieving that goal.
CORDIC-based high-speed direct digital frequency synthesis
(2003) Kang, Chang Yong; Swartzlander, Earl E.
The circular-mode CORDIC (coordinate rotation digital computer) algorithm is analyzed in the context of direct digital frequency synthesis (DDFS). It is shown how the CORDIC parameters should be selected to meet given DDFS parameters, and, through simulations, it is demonstrated that jamming outperforms truncation as a datapath quantization scheme, making it an attractive alternative to rounding. It is also shown that the CORDIC output can be made exact to the digits by an additional rounding process, which is especially useful for DDFS applications where the CORDIC output is truncated to the final digital-to-analog converter (DAC) width. Variations of the basic circular-mode CORDIC algorithm are investigated, and previous works for fast DDFS implementation based on those algorithms are discussed. A new DDFS architecture based on the differential CORDIC (DCORDIC) algorithm is proposed. The proposed architecture allows a bit-level pipelining in the angle path by implementing a two-dimensional systolic array. Unlike other fast DDFS architectures, it incorporates the phase accumulator in the bit-level pipelining framework so that a bottleneck-free datapath throughout the whole system is achieved. The sequential implementation and the hybrid implementation of the architecture for area-sensitive and low-latency designs, respectively, are proposed for further research.
Delay-sensitive branch predictors for future technologies
(2002-05) Jiménez, Daniel Angel, 1969-; Lin, Yun Calvin
Design of platforms for computing context with spatio-temporal locality
(2011-05) Ziotopoulos, Agisilaos Georgios; De Veciana, Gustavo; Garg, Vijay; Mok, Al; Julien, Christine; Touba, Nur; Breternitz, Mauricio
This dissertation is in the area of pervasive computing. It focuses on designing platforms for storing, querying, and computing contextual information. More specifically, we are interested in platforms for storing and querying spatio-temporal events where queries exhibit locality. Recent advances in sensor technologies have made possible gathering a variety of information on the status of users, the environment machines, etc. Combining this information with computation we are able to extract context, i.e., a filtered high-level description of the situation. In many cases, the information gathered exhibits locality both in space and time, i.e., an event is likely to be consumed in a location close to the location where the event was produced, at a time whic h is close to the time the event was produced. This dissertation builds on this observation to create better platforms for computing context. We claim three key contributions. We have studied the problem of designing and optimizing spatial organizations for exchanging context. Our thesis has original theoretical work on how to create a platform based on cells of a Voronoi diagram for optimizing the energy and bandwidth required for mobiles to exchange contextual information t hat is tied to specific locations in the platform. Additionally, we applied our results to the problem of optimizing a system for surveilling the locations of entities within a given region. We have designed a platform for storing and querying spatio-temporal events exhibiting locality. Our platform is based on a P2P infrastructure of peers organized based on the Voronoi diagram associated with their locations to store events based on their own associated locations. We have developed theoretical results based on spatial point processes for the delay experienced by a typical query in this system. Additionally, we used simulations to study heuristics to improve the performance of our platform. Finally, we came up with protocols for the replicated storage of events in order to increase the fault-tolerance of our platform. Finally, in this thesis we propose a design for a platform, based on RFID tags, to support context-aware computing for indoor spaces. Our platform exploits the structure found in most indoor spaces to encode contextual information in suitably designed RFID tags. The elements of our platform collaborate based on a set of messages we developed to offer context-aware services to the users of the platform. We validated our research with an example hardware design of the RFID tag and a software emulation of the tag's functionality.
Design of wide-issue high-frequency processors in wire delay dominated technologies
(2004) Murukkathampoondi, Hrishikesh Sathyavasu; Burger, Douglas C., Ph. D.; Chase, Craig M.
Designing on-chip memory systems for throughput architectures
(2015-12) Diamond, Jeffrey Robert; Fussell, Donald S., 1951-; Keckler, Stephen W.; van de Geijn, Robert; Lin, Calvin; Eijkhout, Victor
Driven by the high arithmetic intensity and embarrassingly parallel nature of real time computer graphics, GPUs became the first wide spread throughput architecture. With the end of Dennard scaling and the plateau of single thread performance, nearly all computer chips at all scales have now become explicitly parallel, containing a hierarchy of cores and threads. Initially, these individual cores were imagined to be no different from traditional uniprocessors, and parallel programs no different than traditional parallel programs. Like GPUs, these modern chips share finite on-chip resources between threads. This results in novel performance and optimization issues at any granularity of parallelism, from cell phones to GPUs. Unfortunately, the performance characteristics of these systems tend to be non-linear and counter-intuitive. The programmer’s software stack has been slow in adapting to this paradigm shift. Compilers still focus primarily on optimizing single thread performance at the expense of throughput. Existing parallel applications are not a perfect match for modern multicore, multithreaded processors. And existing methodologies for performance analysis and simulation are not aligned with multicore issues. This dissertation begins with a mathematical analysis of throughput performance in the presence of shared on-chip resources. When cache hit rates begin to fall, there is a steep drop off in throughput performance. An optimistic view of this regime is that even small improvements to cache efficiency offer significant benefits. This motivates the exploration of general throughput optimizations in both hardware and software that apply to both coarse-grained and fine-grained parallel architectures, requiring no programmer intervention or tuning. This dissertation provides two such solutions. The first solution is a compiler optimization called “loop microfission” that can boost throughput performance by up to 50%. In the context of the intrachip scalability of supercomputing applications, we demonstrate the failings of conven- tional performance tuning software and compiler algorithms in the presence of shared resources. We introduce a new approach to throughput optimization, including a memory friendly performance analysis tool, and show that techniques for throughput optimization are similar to traditional optimizations, but require new priorities. The second solution is a hardware optimization called Arbitrary Modulus In- dexing (AMI), a technique that generalizes efficient implementations of the DIV/- MOD operation from Mersenne Primes to all integers. We show that the primary performance bottlenecks in modern GPUs for regular, memory intensive applications are bank and set conflicts in the shared on-chip memory system. AMI completely eliminates conflicts in all facets of the memory system at negligible hardware cost, and has even broader potential for optimizations throughout computer architecture.
Distributed selective re-execution for EDGE architectures
(2005) Desikan, Rajagopalan; Burger, Douglas C., Ph. D.
Speculation is a key technique that modern processors use to achieve high performance. Traditionally, speculation meant control speculation, in which the processor predicts the outcome of control instructions when they are fetched, and validates the prediction when the instructions are executed. More recently, processors have adopted another form of speculation called data speculation to improve performance. Data speculation involves the prediction of the data values produced by instructions, and forwarding the predicted values to consumers in the data-flow graph. For both control and data speculation, mis-speculation recovery is required when the speculation is incorrect. The conventional mechanism for mis-speculation recovery consists of flushing the processor pipeline of all incorrect state and restarting execution from the corrected state. However, pipeline flushes have become increasingly expensive in modern microprocessors with large instruction windows and deep pipelines. Selective re-execution is a technique that can reduce the penalty of mis-speculation recovery by re-executing only instructions that received incorrect values due to the mis-speculation. Conventional mechanisms to implement selective re-execution have had limited success because of the enormous complexity involved in the implementation. In this dissertation, we introduce a new selective re-execution mechanism that exploits the properties of a data flow-like Explicit Data Graph Execution (EDGE) architecture to support efficient mis-speculation recovery, while scaling to large window sizes. This Distributed Selective Re-Execution (DSRE) mechanism permits multiple speculative waves of computation to traverse a data flow graph simultaneously. The mechanism has no centralized state, and uses simple state bits to determine instructions to re- re on a mis-speculation, thus reducing the complexity of selective re-execution. We evaluate DSRE as a recovery mechanism for load-store dependence mis-speculation on a high-level EDGE architecture simulator, the Grid Processor Architecture (GPA) simulator, and on the more detailed TRIPS prototype processor simulator. DSRE provides 17% and 4.2% speedup, respectively, over dependence prediction, on the two simulators. Our results show that DSRE needs to be used in conjunction with pipeline flushing to achieve high performance. Predictors need to be aware of the the costs associated with each mechanism, and use the appropriate recovery mechanism for each speculation.
Efficient deep neural network model training by reducing memory and compute demands
(2019-12) Lym, Sangkug; Erez, Mattan; Clemons, Jason; Gerstlauer, Andreas M; Sujay Sanghavi; Orshansky, Micheal E
Deep neural network models are commonly used in various real-life applications due to their high prediction accuracy for different tasks. In particular, CNN (convolutional neural network) models have become the de facto choices for most vision applications such as image classification, object segmentation, and object detection. Modern CNN models contain hundreds of million of parameters and training them requires millions of computation- and memory access-heavy iterations. To reduce this expensive CNN model training cost, this dissertation presents computation and memory cost-efficient training mechanisms with a combination of workload scheduling, learning algorithm, and accelerator architecture optimizations. This dissertation also introduces a performance model for data-parallel accelerators as a fast and accurate method to estimate the performance impact of the proposed architectural optimizations and to help fine-grain accelerator design space exploration. The first part of this dissertation discusses reducing the memory bandwidth demand for CNN training. I first analyze data reuse opportunities in CNN training and show that CNN training has high data locality between network layers but that conventional training mechanisms fail to utilize this inter-layer locality. Then, I develop a CNN training scheduling mechanism that modifies the network execution ordering in a way that captures the inter-layer locality while supporting high compute resource utilization. I also introduce a training accelerator that adopts architectural optimizations that hide additional data transfers caused by the proposed scheduling modification and realize effective training speedup. The proposed training accelerator has 45 mixed precision FLOPS and, with the memory bandwidth-efficient network training scheduling, beats a state-of-the-art GPU that has ∼3X higher peak FLOPs. The second part of this dissertation focuses on reducing the computation cost of CNN training. To reduce computations during training, I use neural network model pruning from the beginning of training. The insight is that a fully trained CNN model contains many non-critical parameters and pruning such parameters during training has only a minor impact on the learning quality. I also choose to structurally prune these parameters to provide high data parallelism avoiding complex data indexing, thus maintaining high compute resource utilization. For the practical implementation of pruning while training, I propose three algorithmic optimizations. Theses optimizations are designed to remove the need for the memory accesses caused by tensor reshaping, reduce the number of training runs in finding the desired pruning hyper-parameters, and maintain high data parallelism even for processing a highly pruned CNN model. Overall, the proposed algorithm speeds up the training of commonly used state-of-the-art image classifiers by 39% with only 1.9% accuracy loss. The third part of this dissertation deals with training pruned CNN models on accelerators with large systolic arrays. I first show my observation that processing structurally-pruned CNN models on a large systolic array severely underutilizes its PEs (processing elements) because the reduced number of channels decreases parallelism. Then, I show that naively splitting a large core into multiple small cores improves PE utilization but decreases input reuse and incurs >4% area overhead. To improve PE utilization and maintain high input reuse, I propose a flexible systolic array architecture that can reconfigure its structure to one of several modes, each designed for efficient execution of CNN layers with different dimensions. I also develop compile-time heuristics that optimize mapping the layer workload to the flexible systolic array resources for both high performance and energy efficiency. My new mechanisms increase PE utilization by 36% compared to a single large-core design and improve training energy efficiency by 18% compared to many-small-core designs. The last part of this dissertation is about developing an accelerator performance model for accurate CNN execution time estimation. For accurate performance modeling, I introduce a memory traffic model that predicts the data traffic at different levels of the GPU memory system hierarchy. This involves an in-depth analysis of the memory access patterns of data-parallel convolution kernels and the spatial locality. I demonstrate that the proposed performance model can provide guideline to fine-tune the GPU resources for efficient CNN performance scaling.
Efficient fault tolerance for pipelined structures and its application to superscalar and dataflow machines
(2008-12) Mizan, Elias, 1976-; Chiou, Derek
Silicon reliability has reemerged as a very important problem in digital system design. As voltage and device dimensions shrink, combinational logic is becoming sensitive to temporary errors caused by single event upsets, transistor and interconnect aging and circuit variability. In particular, computational functional units are very challenging to protect because current redundant execution techniques have a high power and area overhead, cannot guarantee detection of some errors and cause a substantial performance degradation. As traditional worst-case design rules that guarantee error avoidance become too conservative to be practical, new microarchitectures need to be investigated to address this problem. To this end, this dissertation introduces Self-Imposed Temporal Redundancy (SITR), a speculative microarchitectural temporal redundancy technique suitable for pipelined computational functional units. SITR is able to detect most temporary errors, is area and energy-efficient and can be easily incorporated in an out-of-order microprocessor. SITR can also be used as a throttling mechanism against thermal viruses and, in some cases, allows designers to design very aggressive bypass networks capable of achieving high instruction throughput, by tolerating timing violations. To address the performance degradation caused by redundant execution, this dissertation proposes using a tiled-data ow model of computation because it enables the design of scalable, resource-rich computational substrates. Starting with the WaveScalar tiled-data flow architecture, we enhance the reliability of its datapath, including computational logic, interconnection network and storage structures. Computations are performed speculatively using SITR while traditional information redundancy techniques are used to protect data transmission and storage. Once a value has been verified, confirmation messages are transmitted to consumer instructions. Upon error detection, nullification messages are sent to the instructions affected by the error. Our experimental results demonstrate that the slowdown due to redundant computation and error recovery on the tiled-data flow machine is consistently smaller than on a superscalar von Neumann architecture. However, the number of additional messages required to support SITR execution is substantial, increasing power consumption. To reduce this overhead without significantly affecting performance, we introduce wave-based speculation, a mechanism targeted for data flow architectures that enables speculation only when it is likely to benefit performance.
Explicit data graph compilation
(2009-12) Smith, Aaron Lee, 1977-; Burger, Douglas C., Ph. D.; John, Lizy K.; Keckler, Stephen W.; Lin, Calvin; McKinley, Kathryn S.
Technology trends such as growing wire delays, power consumption limits, and diminishing clock rate improvements, present conventional instruction set architectures such as RISC, CISC, and VLIW with difficult challenges. To show continued performance growth, future microprocessors must exploit concurrency power efficiently. An important question for any future system is the division of responsibilities between programmer, compiler, and hardware to discover and exploit concurrency. In this research we develop the first compiler for an Explicit Data Graph Execution (EDGE) architecture and show how to solve the new challenge of compiling to a block-based architecture. In EDGE architectures, the compiler is responsible for partitioning the program into a sequence of structured blocks that logically execute atomically. The EDGE ISA defines the structure of, and the restrictions on, these blocks. The TRIPS prototype processor is an EDGE architecture that employs four restrictions on blocks intended to strike a balance between software and hardware complexity. They are: (1) fixed block sizes (maximum of 128 instructions), (2) restricted number of loads and stores (no more than 32 may issue per block), (3) restricted register accesses (no more than eight reads and eight writes to each of four banks per block), and (4) constant number of block outputs (each block must always generate a constant number of register writes and stores, plus exactly one branch). The challenges addressed in this thesis are twofold. First, we develop the algorithms and internal representations necessary to support the new structural constraints imposed by the block-based EDGE execution model. This first step provides correct execution and demonstrates the feasibility of EDGE compilers. Next, we show how to optimize blocks using a dataflow predication model and provide results showing how the compiler is meeting this challenge on the SPEC2000 benchmarks. Using basic blocks as the baseline performance, we show that optimizations utilizing the dataflow predication model achieve up to 64% speedup on SPEC2000 with an average speedup of 31%.
Fine-grained containment domains for throughput processors
(2015-08) Lee, Ikhwan, Ph. D.; Erez, Mattan; Parker, Mike; Pingali, Keshav; Reddi, Vijay J; Touba, Nur A
Continued scaling of semiconductor technology has made modern processors rely on large design margins to guarantee correct operation under worst case conditions. Design margins appear in the form of higher supply voltage or lower clock frequency, leading to inefficiency. In practice, it is rare to observe such worst-case conditions and the processor can run at a reduced voltage or higher frequency experiencing only few infrequent errors. Recent proposals have used hardware error detectors and recovery mechanisms to detect and re- cover from these rare errors, a technique known as timing speculation. While this is effective for out-of-order processors with inherent capability to recover from misspeculation, implementing similar hardware for throughput processors such as the Graphics Processing Units (GPUs) is prohibitively costly due to the massive amount of thread context that needs to be preserved. Further- more, recovery overhead is much higher since the SIMD (Single Instruction Multiple Data) execution model of GPUs require multiple threads to roll back together in case of an error. In this dissertation, I develop a hardware/software co-design approach to enable reduced-margin operation on GPUs that overcomes the limitations of existing techniques. The proposed scheme leverages the hierarchical programming model of GPUs to provide hierarchical and uncoordinated local checkpoint-recovery. By decomposing a program into a hierarchically nested tree of code blocks which I refer to as containment domains (CDs), the pro- gram becomes amenable to automatic analysis and tuning, and an optimum trade-off can be made between preservation and recovery overhead. To aid this optimization process, an analytical model is developed to estimate the performance efficiency of a given application setting at a given error rate. With the analytical model, an exhaustive search can be performed to find the optimal solution. The tunability also allows the proposed scheme to easily adapt to a wide range of error rates making it future proof against emerging uncertainties in semiconductor design. The proposed scheme combines software and hardware components to achieve the highest efficiency in preservation, restoration, and recovery. The software components include: 1) an API and runtime that lets the programmers describe the hierarchy of containment domains within an application and preserve the state required for rollback recovery, and 2) a compiler analysis that automatically inserts preservation routines for register variables. The hardware components include: 1) a stack structure to keep track of recovery program counters (PC), 2) a set of error containment mechanisms to guarantee that no erroneous data is propagated outside of a containment domain and 3) an error reporting architecture that keeps track of affected threads and initiate recovery of them.
Floating-point fused multiply-add architectures
(2007-05) Quinnell, Eric Charles, 1982-; Swartzlander, Earl E.
This dissertation presents the results of the research, design, and implementations of several new architectures for floating-point fused multiplier-adders used in the x87 units of microprocessors. These new architectures have been designed to provide solutions to the implementation problems found in modern-day fused multiply-add units. The new three-path fused multiply-add architecture shows a 12% reduction in latency and a 15% reduction in power as compared to a classic fused multiplier-adder. The new bridge fused multiply-add architecture presents a design capable of full performance floating-point addition and floating-point multiplication instructions while still providing the functionality and performance gain of a classic fused multiplier-adder. Each new architecture presented as well as a collection of modern floating-point arithmetic units that are used for comparison have been designed and implemented using the Advanced Micro Devices (AMD) 65 nanometer silicon on insulator transistor technology and circuit design toolset. All designs use the AMD ‘Barcelona’ native quadcore standard-cell library as an architectural building block to create and contrast the new architectures in a cutting-edge and realistic industrial technology.