Browsing by Subject "Multicore"

Now showing 1 - 6 of 6

Bottleneck identification and acceleration in multithreaded applications
(2014-12) Joao, José Alberto; Patt, Yale N.
When parallel applications do not fully utilize the cores that are available to them they are missing the opportunity to have better performance. Sometimes threads have to wait for other threads. I call the code segments that make other threads wait bottlenecks. Examples of these bottlenecks include contended critical sections, threads arriving late to barriers and the slowest stage of a pipelined program. Other times all threads are running but some of them, which I call lagging threads, are making less progress, setting the stage to become bottlenecks. My thesis proposes identifying the code segments that are more critical for performance and efficiently accelerating them using faster cores, by either migrating execution to large cores of an Asymmetric Chip Multi-Processor (ACMP) or executing locally on DVFS-accelerated cores. The key contribution of this dissertation is a Utility of Acceleration metric that combines a measure of the acceleration for each code segment with a measure of its criticality. This metric enables meaningful comparisons to decide which bottlenecks or lagging threads to accelerate with each of the available acceleration mechanisms. My evaluation shows significant performance improvement for single multithreaded applications and sets of multiple single- and multi-threaded applications, and also reduction in energy-delay product due to the efficient utilization of the available acceleration mechanisms.
Efficient execution of sequential applications on multicore systems
(2011-08) Robatmili, Behnam; McKinley, Kathryn S.; Burger, Douglas C., Ph. D.; Keckler, Stephen W.; Lin, Calvin; Reinhardt, Steve
Conventional CMOS scaling has been the engine of the technology revolution in most application domains. This trend has changed as in each technology generation, transistor densities continue to increase while due to the limits on threshold voltage scaling, per-transistor energy consumption decreases much more slowly than in the past. The power scaling issues will restrict the adaptability of designs to operate in different power and performance regimes. Consequently, future systems must employ more efficient architectures for optimizing every thread in the program across different power and performance regimes, rather than architectures that utilize more transistors. One solution is composable or dynamic multicore architectures that can span a wide range of energy/performance operating points by enabling multiple simple cores to compose to form a larger and more powerful core. Explicit Data Graph Execution (EDGE) architectures represent a highly scalable class of composable processors that exploit predicated dataflow block execution and distributed microarchitectures. However, prior EDGE architectures suffer from several energy and performance bottlenecks including expensive intra-block operand communication due to fine-grain instruction distribution among cores, the compiler-generated fanout trees built for high-fanout operand delivery, poor next-block prediction accuracy, and low speculation rates due to predicates and expensive refills after pipeline flushes. To design an energy-efficient and flexible dynamic multicore, this dissertation employs a systematic methodology that detects inefficiencies and then designs and evaluates solutions that maximize power and performance efficiency across different power and performance regimes. Some innovations and optimization techniques include: (a) Deep Block Mapping extracts more coarse-grained parallelism and reduces cross-core operand network traffic by mapping each block of instructions into the instruction queue of one core instead of distributing blocks across all composed cores as done in previous EDGE designs, (b) Iterative Path Predictor (IPP) reduces branch and predication overheads by unifying multi-exit block target prediction and predicate path prediction while providing improved accuracy for each, (c) Register Bypassing reduces cross-core register communication delays by bypassing register values predicted to be critical directly from producing to consuming cores, (d) Block Reissue reduces pipeline flush penalties by reissuing instructions in previously executed instances of blocks while they are still in the instruction queue, and (e) Exposed Operand Broadcasts (EOBs) reduce wide-fanout instruction overheads by extending the ISA to employ architecturally exposed low-overhead broadcasts combined with dataflow for efficient operand delivery for both high- and low-fanout instructions. These components form the basis for a third-generation EDGE microarchitecture called T3. T3 improves energy efficiency by about 2x and performance by 47% compared to previous EDGE architectures. T3 also performs in a highly power efficient manner across a wide spectrum of energy and performance operating points (low-power to high-performance), extending the domain of power/performance trade-offs beyond what dynamic voltage and frequency scaling offers on state-of-the-art conventional processors. This high level of flexibility and power efficiency makes T3 an attractive candidate for future systems which need to operate on a wide range of workloads under varying power and performance constraints.
Exploiting hardware heterogeneity and parallelism for performance and energy efficiency of managed languages
(2015-12) Jibaja, Ivan; Witchel, Emmett; McKinley, Kathryn S.; Blackburn, Stephen M; Batory, Don; Lin, Calvin
On the software side, managed languages and their workloads are ubiquitous, executing on mobile, desktop, and server hardware. Managed languages boost the productivity of programmers by abstracting away the hardware using virtual machine technology. On the hardware side, modern hardware increasingly exploits parallelism to boost energy efficiency and performance with homogeneous cores, heterogenous cores, graphics processing units (GPUs), and vector instructions. Two major forms of parallelism are: task parallelism on different cores and vector instructions for data parallelism. With task parallelism, the hardware allows simultaneous execution of multiple instruction pipelines through multiple cores. With data parallelism, one core can perform the same instruction on multiple pieces of data. Furthermore, we expect hardware parallelism to continue to evolve and provide more heterogeneity. Existing programming language runtimes must continuously evolve so programmers and their workloads may efficiently utilize this evolving hardware for better performance and energy efficiency. However, efficiently exploiting hardware parallelism is at odds with programmer productivity, which seeks to abstract hardware details. My thesis is that managed language systems should and can abstract hardware parallelism with modest to no burden on developers to achieve high performance, energy efficiency, and portability on ever evolving parallel hardware. In particular, this thesis explores how the runtime can optimize and abstract heterogenous parallel hardware and how the compiler can exploit data parallelism with new high-level languages abstractions with a minimal burden on developers. We explore solutions from multiple levels of abstraction for different types of hardware parallelism. (1) For asymmetric multicore processors (AMP) which have been recently introduced, we design and implement an application scheduler in the Java virtual machine (JVM) that requires no changes to existing Java applications. The scheduler uses feedback from dynamic analyses that automatically identify critical threads and classifies application parallelism. Our scheduler automatically accelerates critical threads, honors thread priorities, considers core availability and thread sensitivity, and load balances scalable parallel threads on big and small cores to improve the average performance by 20% and energy efficiency by 9% on frequency-scaled AMP hardware for scalable, non-scalable, and sequential workloads over prior research and existing schedulers. (2) To exploit vector instructions, we design SIMD.js, a portable single instruction multiple data (SIMD) language extension for JavaScript (JS), and implement its compiler support that together add fine-grain data parallelism to JS. Our design principles seek portability, scalable performance across various SIMD hardware implementations, performance neutral without SIMD hardware, and compiler simplicity to ease vendor adoption on multiple browsers. We introduce type speculation, compiler optimizations, and code generation that convert high-level JS SIMD operations into minimal numbers of SIMD native instructions. Finally, to accomplish wide adoption of our portable SIMD language extension, we explore, analyze, and discuss the trade-offs of four different approaches that provide the functionality of SIMD.js when vector instructions are not supported by the hardware. SIMD.js delivers an average performance improvement of 3.3× on micro benchmarks and key graphic algorithms on various hardware platforms, browsers, and operating systems. These language extension and compiler technologies are in the final approval process to be included in the JavaScript standards. This thesis shows using virtual machine technologies protects programmers from the underlying details of hardware parallelism, achieves portability, and improves performance and energy efficiency.
Galois : a system for parallel execution of irregular algorithms
(2015-05) Nguyen, Donald Do; Pingali, Keshav; Alvisi, Lorenzo; Lethin, Richard; Padua, David; Witchel, Emmett
A programming model which allows users to program with high productivity and which produces high performance executions has been a goal for decades. This dissertation makes progress towards this elusive goal by describing the design and implementation of the Galois system, a parallel programming model for shared-memory, multicore machines. Central to the design is the idea that scheduling of a program can be decoupled from the core computational operator and data structures. However, efficient programs often require application-specific scheduling to achieve best performance. To bridge this gap, an extensible and abstract scheduling policy language is proposed, which allows programmers to focus on selecting high-level scheduling policies while delegating the tedious task of implementing the policy to a scheduler synthesizer and runtime system. Implementations of deterministic and prioritized scheduling also are described. An evaluation of a well-studied benchmark suite reveals that factoring programs into operators, schedulers and data structures can produce significant performance improvements over unfactored approaches. Comparison of the Galois system with existing programming models for graph analytics shows significant performance improvements, often orders of magnitude more, due to (1) better support for the restrictive programming models of existing systems and (2) better support for more sophisticated algorithms and scheduling, which cannot be expressed in other systems.
Performance prediction for dynamic voltage and frequency scaling
(2014-08) Miftakhutdinov, Rustam Raisovich; Patt, Yale N.
This dissertation proves the feasibility of accurate runtime prediction of processor performance under frequency scaling. The performance predictors developed in this dissertation allow processors capable of dynamic voltage and frequency scaling (DVFS) to improve their performance or energy efficiency by dynamically adapting chip or core voltages and frequencies to workload characteristics. The dissertation considers three processor configurations: the uniprocessor capable of chip-level DVFS, the private cache chip multiprocessor capable of per-core DVFS, and the shared cache chip multiprocessor capable of per-core DVFS. Depending on processor configuration, the presented performance predictors help the processor realize 72–85% of average oracle performance or energy efficiency gains.
Scaling managed runtime systems for future multicore hardware
(2009-12) Ha, Jung Woo; McKinley, Kathryn S.; Arnold, Matthew; Blackburn, Stephen M.; Keckler, Stephen W.; Witchel, Emmett
The exponential improvement in single processor performance has recently come to an end, mainly because clock frequency has reached its limit due to power constraints. Thus, processor manufacturers are choosing to enhance computing capabilities by placing multiple cores into a single chip, which can improve performance given parallel software. This paradigm shift to chip multiprocessors (also called multicore) requires scalable parallel applications that execute tasks on each core, otherwise the additional cores are worthless. Making an application scalable requires more than simply parallelizing the application code itself. Modern applications are written in managed languages, which require automatic memory management, type and memory abstractions, dynamic analysis and just-in-time (JIT) compilation. These managed runtime systems monitor and interact frequently with the executing application. Hence, the managed runtime itself must be scalable, and the instrumentation that monitors the application should not perturb its scalability. While multicore hardware forces a redesign of managed runtimes for scalability, it also provides opportunities when applications do not fully utilize all of the cores. Using available cores for concurrent helper threads that enhance the software, with debugging, security, and software support will make the runtime itself more capable and more scalable. This dissertation presents two novel techniques that improve the scalability of managed runtimes by utilizing unused cores. The first technique is a concurrent dynamic analysis framework that provides a low-overhead buffering mechanism called Cache-friendly Asymmetric Buffering (CAB) that quickly offloads data from the application to helper threads that perform specific dynamic analyses. Our framework minimizes application instrumentation overhead, prevents microarchitectural side-effects, and supports a variety of dynamic analysis clients, ranging from call graph and path profiling to cache simulation. The use of this framework ensures that helper threads perturb the performance of application as little as possible. Our second technique is concurrent trace-based just-in-time compilation, which exploits available cores for the JavaScript runtime. The JavaScript language limits applications to a single-thread, so extra cores are worthless unless they are used by the runtime components. We redesigned a production trace-based JIT compiler to run concurrently with the interpreter, and our technique is the first to improve both responsiveness and throughput in a trace-based JIT compiler. This thesis presents the design and implementation of both techniques and shows that they improve scalability and core utilization when running applications in managed runtimes. Industry is already adopting our approaches, which demonstrates the urgency of the scalable runtime problem and the utility of these techniques.