Browsing by Subject "DVFS"

Now showing 1 - 8 of 8

Bottleneck identification and acceleration in multithreaded applications
(2014-12) Joao, José Alberto; Patt, Yale N.
When parallel applications do not fully utilize the cores that are available to them they are missing the opportunity to have better performance. Sometimes threads have to wait for other threads. I call the code segments that make other threads wait bottlenecks. Examples of these bottlenecks include contended critical sections, threads arriving late to barriers and the slowest stage of a pipelined program. Other times all threads are running but some of them, which I call lagging threads, are making less progress, setting the stage to become bottlenecks. My thesis proposes identifying the code segments that are more critical for performance and efficiently accelerating them using faster cores, by either migrating execution to large cores of an Asymmetric Chip Multi-Processor (ACMP) or executing locally on DVFS-accelerated cores. The key contribution of this dissertation is a Utility of Acceleration metric that combines a measure of the acceleration for each code segment with a measure of its criticality. This metric enables meaningful comparisons to decide which bottlenecks or lagging threads to accelerate with each of the available acceleration mechanisms. My evaluation shows significant performance improvement for single multithreaded applications and sets of multiple single- and multi-threaded applications, and also reduction in energy-delay product due to the efficient utilization of the available acceleration mechanisms.
Characterization of smartphone governor strategies and making of a workload aware governor
(2018-05-04) Banerjee, Sarbartha; John, Lizy Kurian
This thesis presents the importance of workload characterization towards governing the operational voltage and frequency of a smartphone processor by running a series of workload on an ARM v8 processor. The idea of finishing a task as fast as possible to return to idle state(race-to-idle) versus the idea of choosing the correct frequency for time deltas(pace-to-idle) is studied in detail. Android governors either statically use a single frequency for the entire active time or determines the voltage and frequency dynamically based on the load average on the processor. Similar load averaging strategies are used for other blocks in SoC (System on Chip) like the GPU or the media processor. However, the different blocks of a SoC draw power from the same current source. Owing to lack of fine-grained workload characterization, the power is redirected to the not-so-important unit providing poor performance and energy efficiency. The behavior of different existing governors is explored by running on a variety of workload and analyze the optimal strategy for energy efficiency satisfying an acceptable user performance. Crucial traits of active user applications are inferred from scheduler to fine tune the optimal voltage and frequency across different blocks under constrained power source to build a system-wide governor.
E³ : energy-efficient EDGE architectures
(2010-08) Govindan, Madhu Sarava; Keckler, Stephen W.; Burger, Douglas C.; McKinley, Kathryn S.; Chiou, Derek; Hunt, Jr., Warren A.; Brooks, David
Increasing power dissipation is one of the most serious challenges facing designers in the microprocessor industry. Power dissipation, increasing wire delays, and increasing design complexity have forced industry to embrace multi-core architectures or chip multiprocessors (CMPs). While CMPs mitigate wire delays and design complexity, they do not directly address single-threaded performance. Additionally, programs must be parallelized, either manually or automatically, to fully exploit the performance of CMPs. Researchers have recently proposed an architecture called Explicit Data Graph Execution (EDGE) as an alternative to conventional CMPs. EDGE architectures are designed to be technology-scalable and to provide good single-threaded performance as well as exploit other types of parallelism including data-level and thread-level parallelism. In this dissertation, we examine the energy efficiency of a specific EDGE architecture called TRIPS Instruction Set Architecture (ISA) and two microarchitectures called TRIPS and TFlex that implement the TRIPS ISA. TRIPS microarchitecture is a first-generation design that proves the feasibility of the TRIPS ISA and distributed tiled microarchitectures. The second-generation TFlex microarchitecture addresses key inefficiencies of the TRIPS microarchitecture by matching the resource needs of applications to a composable hardware substrate. First, we perform a thorough power analysis of the TRIPS microarchitecture. We describe how we develop architectural power models for TRIPS. We then improve power-modeling accuracy using hardware power measurements on the TRIPS prototype combined with detailed Register Transfer Level (RTL) power models from the TRIPS design. Using these refined architectural power models and normalized power modeling methodologies, we perform a detailed performance and power comparison of the TRIPS microarchitecture with two different processors: 1) a low-end processor designed for power efficiency (ARM/XScale) and 2) a high-end superscalar processor designed for high performance (a variant of Power4). This detailed power analysis provides key insights into the advantages and disadvantages of the TRIPS ISA and microarchitecture compared to processors on either end of the performance-power spectrum. Our results indicate that the TRIPS microarchitecture achieves 11.7 times better energy efficiency compared to ARM, and approximately 12% better energy efficiency than Power4, in terms of the Energy-Delay-Squared (ED²) metric. Second, we evaluate the energy efficiency of the TFlex microarchitecture in comparison to TRIPS, ARM, and Power4. TFlex belongs to a class of microarchitectures called Composable Lightweight Processors (CLPs). CLPs are distributed microarchitectures designed with simple cores and are highly configurable at runtime to adapt to resource needs of applications. We develop power models for the TFlex microarchitecture based on the validated TRIPS power models. Our quantitative results indicate that by better matching execution resources to the needs of applications, the composable TFlex system can operate in both regimes of low power (similar to ARM) and high performance (similar to Power4). We also show that the composability feature of TFlex achieves a signification improvement (2 times) in the ED² metric compared to TRIPS. Third, using TFlex as our experimental platform, we examine the efficacy of processor composability as a potential performance-power trade-off mechanism. Most modern processors support a form of dynamic voltage and frequency scaling (DVFS) as a performance-power trade-off mechanism. Since the rate of voltage scaling has slowed significantly in recent process technologies, processor designers are in dire need of alternatives to DVFS. In this dissertation, we explore processor composability as an architectural alternative to DVFS. Through experimental results we show that processor composability achieves almost as good performance-power trade-offs as pure frequency scaling (no changes in supply voltages), and a much better performance-power trade-off compared to voltage and frequency scaling (both supply voltage and frequency change). Next, we explore the effects of additional performance-improving techniques for the TFlex system on its energy efficiency. Researchers have proposed a variety of techniques for improving the performance of the TFlex system. These include: (1) block mapping techniques to trade off intra-block concurrency with communication across the operand network; (2) predicate prediction and (3) operand multi-cast/broadcast mechanism. We examine each of these mechanisms in terms of its effect on the energy efficiency of TFlex, and our experimental results demonstrate the effects of operand communication, and speculation on the energy efficiency of TFlex. Finally, this dissertation evaluates a set of fine-grained power management (FGPM) policies for TFlex: instruction criticality and controlled speculation. These policies rely on a temporally and spatially fine-grained dynamic voltage and frequency scaling (DVFS) mechanism for improving power efficiency. The instruction criticality policy seeks to improve power efficiency by mapping critical computation in a program to higher performance-power levels, and by mapping non-critical computation to lower performance-power levels. Controlled speculation policy, on the other hand, maps blocks that are highly likely to be on correct execution path in a program to higher performance levels, and the other blocks to lower performance levels. Our experimental results indicate that idealized instruction criticality and controlled speculation policies improve the operating range and flexibility of the TFlex system. However, when the actual overheads of fine-grained DVFS, especially energy conversion losses of voltage regulator modules (VRMs), are considered the power efficiency advantages of these idealized policies quickly diminish. Our results also indicate that the current conversion efficiencies of on-chip VRMs need to improve to as high as 95% for the realistic policies to be feasible.
Integrated temperature sensors in deep sub-micron CMOS technologies
(2014-05) Chowdhury, Golam Rasul; Hassibi, Arjang
Integrated temperature sensors play an important role in enhancing the performance of on-chip power and thermal management systems in today's highly-integrated system-on-chip (SoC) platforms, such as microprocessors. Accurate on-chip temperature measurement is essential to maximize the performance and reliability of these SoCs. However, due to non-uniform power consumption by different functional blocks, microprocessors have fairly large thermal gradient (and variation) across their chips. In the case of multi-core microprocessors for example, there are task-specific thermal gradients across different cores on the same die. As a result, multiple temperature sensors are needed to measure the temperature profile at all relevant coordinates of the chip. Subsequently, the results of the temperature measurements are used to take corrective measures to enhance the performance, or save the SoC from catastrophic over-heating situations which can cause permanent damage. Furthermore, in a large multi-core microprocessor, it is also imperative to continuously monitor potential hot-spots that are prone to thermal runaway. The locations of such hot spots depend on the operations and instruction the processor carries out at a given time. Due to practical limitations, it is an overkill to place a big size temperature sensor nearest to all possible hot spots. Thus, an ideal on-chip temperature sensor should have minimal area so that it can be placed non-invasively across the chip without drastically changing the chip floor plan. In addition, the power consumption of the sensors should be very low to reduce the power budget overhead of thermal monitoring system, and to minimize measurement inaccuracies due to self-heating. The objective of this research is to design an ultra-small size and ultra-low power temperature sensor such that it can be placed in the intimate proximity of all possible hot spots across the chip. The general idea is to use the leakage current of a reverse-bias p-n junction diode as an operand for temperature sensing. The tasks within this project are to examine the theoretical aspect of such sensors in both Silicon-On-Insulator (SOI), and bulk Complementary Metal-Oxide Semiconductor (CMOS) technologies, implement them in deep sub-micron technologies, and ultimately evaluate their performances, and compare them to existing solutions.
Microprocessor power management and a stand-alone benchmarking application for Android based platforms
(2011-12) Yeager, Hans L.; Aziz, Adnan; Gerstlauer, Andreas
Components used in mobile hand-held devices (smart phones and tablets) vary greatly in performance and power consumption. The microprocessors used in these devices also have vastly different capabilities and manufacturing limitations leading to significant variation effects. Battery life is a significant concern to the end users of these products. A stand-alone Android application capable of benchmarking a device's performance and power consumption is introduced. The application does not require the end user to have any analytic equipment or to have a technical background. This enables individual end users to better understand their particular device's performance and battery life interaction. They may also use the application to determine if their device's performance or battery life has degraded over time. Data is also uploaded to a central location so that devices can be compared against each other. The benchmarking application is capable of resolving variation effects caused by device, environmental changes and power management actions. This application demonstrates the feasibility of creating a low cost ecosystem where thousands of devices can be quantitatively compared.
Performance prediction for dynamic voltage and frequency scaling
(2014-08) Miftakhutdinov, Rustam Raisovich; Patt, Yale N.
This dissertation proves the feasibility of accurate runtime prediction of processor performance under frequency scaling. The performance predictors developed in this dissertation allow processors capable of dynamic voltage and frequency scaling (DVFS) to improve their performance or energy efficiency by dynamically adapting chip or core voltages and frequencies to workload characteristics. The dissertation considers three processor configurations: the uniprocessor capable of chip-level DVFS, the private cache chip multiprocessor capable of per-core DVFS, and the shared cache chip multiprocessor capable of per-core DVFS. Depending on processor configuration, the presented performance predictors help the processor realize 72–85% of average oracle performance or energy efficiency gains.
Predictive power management for multi-core processors
(2010-12) Bircher, William Lloyd; John, Lizy Kurian; Erez, Mattan; Keckler, Steve; Lefurgy, Charles; Moon, Tess; Pan, David
Energy consumption by computing systems is rapidly increasing due to the growth of data centers and pervasive computing. In 2006 data center energy usage in the United States reached 61 billion kilowatt-hours (KWh) at an annual cost of 4.5 billion USD [Pl08]. It is projected to reach 100 billion KWh by 2011 at a cost of 7.4 billion USD. The nature of energy usage in these systems provides an opportunity to reduce consumption. Specifically, the power and performance demand of computing systems vary widely in time and across workloads. This has led to the design of dynamically adaptive or power managed systems. At runtime, these systems can be reconfigured to provide optimal performance and power capacity to match workload demand. This causes the system to frequently be over or under provisioned. Similarly, the power demand of the system is difficult to account for. The aggregate power consumption of a system is composed of many heterogeneous systems, each with a unique power consumption characteristic. This research addresses the problem of when to apply dynamic power management in multi-core processors by accounting for and predicting power and performance demand at the core-level. By tracking performance events at the processor core or thread-level, power consumption can be accounted for at each of the major components of the computing system through empirical, power models. This also provides accounting for individual components within a shared resource such as a power plane or top-level cache. This view of the system exposes the fundamental performance and power phase behavior, thus making prediction possible. This dissertation also presents an extensive analysis of complete system power accounting for systems and workloads ranging from servers to desktops and laptops. The analysis leads to the development of a simple, effective prediction scheme for controlling power adaptations. The proposed Periodic Power Phase Predictor (PPPP) identifies patterns of activity in multi-core systems and predicts transitions between activity levels. This predictor is shown to increase performance and reduce power consumption compared to reactive, commercial power management schemes by achieving higher average frequency in active phases and lower average frequency in idle phases.
QoS-aware mechanisms for improving cost-efficiency of datacenters
(2018-05) Zhu, Haishan; Erez, Mattan; Pingali, Keshav; Chang, Jichuan; de Veciana, Gustavo; Tiwari, Mohit
Warehouse Scale Computers (WSCs) promise high cost-efficiency by amortizing power, cooling, and management overheads. WSCs today host a large variety of jobs with two broad performance requirements categories: latency-critical (LC) and best-effort (BE). Ideally, to fully utilize all hardware resources, WSC operators can simply fill all the nodes with computing jobs. Unfortunately, because colocated jobs contend for shared resources, systems with high loads often experience performance degradation, which negatively impacts the Quality of Service (QoS) for LC jobs. In fact, service providers usually over-provision resources to avoid any interference with LC jobs, leading to significant resource inefficiencies. In this dissertation, I explore opportunities across different system-abstraction layers to improve the cost-efficiency of dataceters by increasing resource utilization of WSCs with little or no impact on the performance of LC jobs. The dissertation has three main components. First, I explore opportunities to improve the throughput of multicore systems by reducing the performance variation of LC jobs. The main insight is that by reshaping the latency distribution curve, performance headroom of LC jobs can be effectively converted to improved BE throughput. I develop, implement, and evaluate a runtime system that achieves this goal with existing hardware. I leverage the cache partitioning, per-core frequency scaling, and thread masking of server processors. Evaluation results show the proposed solution enables 30% higher system throughput compared to solutions proposed in prior works while maintaining at least as good QoS for LC jobs. Second, I study resource contention in near-future heterogeneous memory architectures (HMA). This study is motivated by recent developments in non-volatile memory (NVM) technologies, which enable higher storage density at the cost of same performance. To understand the performance and QoS impact of HMAs, I design and implement a performance emulator in the Linux kernel that runs unmodified workloads with high accuracy, low overhead, and complete transparency. I further propose and evaluate multiple data and resource management QoS mechanisms, such as locality-aware page admission, occupancy management, and write buffer jailing. Third, I focus on accelerated machine learning (ML) systems. By profiling the performance of production workloads and accelerators, I show that accelerated ML tasks are highly sensitive to main memory interference due to fine-grained interaction between CPU and accelerator tasks. As a result, memory resource contention can significantly decreases the performance and efficiency gains of accelerators. I propose a runtime system that leverages existing hardware capabilities and show 17% higher system efficiency compared to previous approaches. This study further exposes opportunities for future processor architectures