Browsing by Subject "QoS"

Now showing 1 - 6 of 6

Core-characteristic-aware off-chip memory management in a multicore system-on-chip
(2012-12) Jeong, Min Kyu; Erez, Mattan; John, Lizy K.; Chiou, Derek; Lin, Calvin; Schulte, Michael J.
Future processors will integrate an increasing number of cores because the scaling of single-thread performance is limited and because smaller cores are more power efficient. Off-chip memory bandwidth that is shared between those many cores, however, scales slower than the transistor (and core) count does. As a result, in many future systems, off-chip bandwidth will become the bottleneck of heavy demand from multiple cores. Therefore, optimally managing the limited off-chip bandwidth is critical to achieving high performance and efficiency in future systems. In this dissertation, I will develop techniques to optimize the shared use of limited off-chip memory bandwidth in chip-multiprocessors. I focus on issues that arise from the sharing and exploit the differences in memory access characteristics, such as locality, bandwidth requirement, and latency sensitivity, between the applications running in parallel and competing for the bandwidth. First, I investigate how the shared use of memory by many cores can result in reduced spatial locality in memory accesses. I propose a technique that partitions the internal memory banks between cores in order to isolate their access streams and eliminate locality interference. The technique compensates for the reduced bank-level parallelism of each thread by employing memory sub-ranking to effectively increase the number of independent banks. For three different workload groups that consist of benchmarks with high spatial locality, low spatial locality, and mixes of the two, the average system efficiency improves by 10%, 7%, 9% for 2-rank systems, and 18%, 25%, 20% for 1-rank systems, respectively, over the baseline shared-bank system. Next, I improve the performance of a heterogeneous system-on-chip (SoC) in which cores have distinct memory access characteristics. I develop a deadline-aware shared memory bandwidth management scheme for SoCs that have both CPU and GPU cores. I show that statically prioritizing the CPU can severely constrict GPU performance, and propose to dynamically adapt the priority of CPU and GPU memory requests based on the progress of GPU workload. The proposed dynamic bandwidth management scheme provides the target GPU performance while prioritizing CPU performance as much as possible, for any CPU-GPU workload combination with different complexities.
Exploring tradeoffs in wireless networks under flow-level traffic: energy, capacity and QoS
(2009-12) Kim, Hongseok; De Veciana, Gustavo
Wireless resources are scarce, shared and time-varying making resource allocation mechanisms, e.g., scheduling, a key and challenging element of wireless system design. In designing good schedulers, we consider three types of performance metrics: system capacity, quality of service (QoS) seen by users, and the energy expenditures (battery lifetimes) incurred by mobile terminals. In this dissertation we investigate the impact of scheduling policies on these performance metrics, their interactions, and/or tradeoffs, and we specifically focus on flow-level performance under stochastic traffic loads. In the first part of the dissertation we evaluate interactions among flow-level performance metrics when integrating QoS and best effort flows in a wireless system using opportunistic scheduling. We introduce a simple flow-level model capturing the salient features of bandwidth sharing for an opportunistic scheduler which ensures a mean throughput to each QoS stream on every time slot. We show that the integration of QoS and best effort flows results in a loss of opportunism, which in turn results in a reduction of the stability region, degradation in system capacity, and increased file transfer delay. In the second part of the dissertation we study several ways in which mobile terminals can backoff on their uplink transmit power (thus slow down their transmissions) in order to extend battery lifetimes. This is particularly effective when a wireless system is underloaded, so the degradation in the users' perceived performance can be negligible. The challenge, however, is developing a mechanism that achieves a good tradeoff among transmit power, idling/circuit power, and the performance customers will see. We consider systems with flow-level dynamics supporting either real-time or best effort (e.g., file transfers) sessions. We show that significant energy savings can be achieved by leveraging dynamic spare capacity. We then extend our study to the case where mobile terminals have multiple transmit antennas. In the third part of the dissertation we develop a framework for user association in infrastructure-based wireless networks, specifically focused on adaptively balancing flow loads given spatially inhomogeneous traffic distributions. Our work encompasses several possible user association objective functions resulting in rate-optimal, throughput-optimal, delay-optimal, and load-equalizing policy, which we collectively denote [alpha]-optimal user association. We prove that the optimal load vector that minimizes this function is the fixed point of a certain mapping. Based on this mapping we propose an iterative distributed user association policy and prove that it converges to the globally optimal decision in steady state. In addition we address admission control policies for the case where the system cannot be stabilized.
Power-aware processor system design
(2020-05) Kalyanam, Vijay Kiran; Abraham, Jacob A.; Orshansky , Michael; Pan, David; Touba, Nur; Tupuri, Raghuram
With everyday advances in technology and low-cost economics, processor systems are moving towards split grid shared power delivery networks (PDNs) while providing increased functionality and higher performance capabilities resulting in increased power consumption. Split grid refers to dividing up the power grid resources among various homogeneous and heterogeneous functional modules and processors. When the PDN is shared and common across multiple processors and function blocks, it is called a Shared PDN. In order to keep the power in control on a split-grid shared PDN, the processor system is required to operate when various hardware modules interact with each other while the supply voltage (V [subscript DD]) and clock frequency (F [subscript CLK]) are scaled. Software or hardware assisted power-collapse and low-power retention modes can be automatically engaged in the processor system. The processor system should also operate at maximum performance under power constraints while consuming the full thermal design power (TDP). The processor system should neither violate board and card current limits nor the power management integrated circuit (PMIC) limits or its slew rate requirements for current draw on the shared PDN. It is expected to operate within thermal limits below an operating temperature. The processor system is also required to detect and mitigate current violations within microseconds and temperature violations in milliseconds. The processor system is expected to be robust and should be able to tolerate voltage droops. Its importance is highlighted with the processor system being on shared PDN. Because of the sharing of the PDN, the voltage droop mitigation scheme is expected to be quick and must suppress V [subscript DD] droop propagation at the source while only introducing negligible performance penalties during this mitigation. Without a solution for V [subscript DD] droop in place, the entire V [subscript DD] of shared PDN is forced to be at a higher voltage, increasing overall system power. This can potentially affect the days of use (DoU) of battery-operated systems, and reliability and cooling of wired systems. A multi-threaded processor system is expected to monitor the current, power and voltage violations and react quickly without affecting the performance of its hardware threads while maintaining quality of service (QoS). Early high-level power estimates are a necessity to project how much power will be consumed by a future processor system. These power projections are used to plan for software use cases and to reassign power-domains of processors and function blocks belonging to the shared PDN. Additionally, it helps to re-design boards and power-cards, re-implement the PDN, change PMIC and plan for additional power, current, voltage and temperature violation related mitigation schemes if the existing solutions are insufficient. The split grid shared PDN that is implemented in a system-on-chip (SoC) is driven by low cost electronics and forces multiple voltage rails for a better energy efficiency. To support this, there is a need for incorporation of voltage levels and power-states into a processor behavioral register transfer level (RTL) model. Low power verification is a must in a split-grid PDN. To facilitate these, the RTL is annotated with voltage supplies and isolation circuits that engage and protect during power collapse scenarios across various voltage domains. The power-aware RTL design is verified, identified and corrected for low power circuit and RTL bugs prior to tape-out. The mandatory features to limit current, power, voltage and temperatures in these high performance and power hungry processor systems introduce a need to provide high level power projections for a processor system accounting for various split-grid PDN supplying V [subscript DD] to the processor, the interface bus, various function blocks, and co-processors. To solve this problem, a power prediction solution is provided that has an average-power error of 8% in prediction and works with reasonable accuracy by tracking instantaneous power for unknown software application traces. The compute time to calculate power using the generated prediction model is 100000X faster and uses 100X less compute memory compared to a commercial electronic design automation (EDA) RTL power tool. This solution is also applied to generate a digital power meter (DPM) in hardware for real-time power estimates while the processor is operational. These high-level power estimates project the potential peak-currents in these processor systems. This resulted in a need for new tests to be created and validated on silicon in order to functionally stress the split-grid shared PDN for extreme voltage droop and sustained high current usage scenarios. For this reason, functional test sequences are created for high power and voltage stress testing of multi-threaded processors. The PDN is a complex system and needs different functional test sequences to generate various kinds of high and low power instruction packets that can stress it. These voltage droop stress tests affect V [subscript MIN] margins in various voltage and frequency modes of operation in a commercial multi-threaded processor. These results underscore a need for voltage mitigation solutions. The processor system operating on a split grid shared PDN can have its V [subscript MIN] increased due to voltage stress tests or a power-virus software application. The shared PDN imposes requirements to mitigate the voltage noise at the source and avoid any possibility of increases to the shared PDN V [subscript DD]. This necessitates implementing a proactive system that can mitigate voltage droop before it occurs while lowering the processor’s minimum voltage of operation (V [subscript MIN]) to help in system power reduction. To mitigate the voltage droops, a proactive clock gating system (PCGS) is implemented with a voltage clock gate (VCG) circuit that uses a digital power meter (DPM) and a model of a PDN to predict the voltage droop before its occurrence. Silicon results show PCGS achieves 10% higher clock frequency (F [subscript CLK]) and 5% lower supply voltage (V [subscript DD]) in a 7nm processor. Questions arise about the effectiveness of PCGS over a reactive voltage droop mitigation scheme in the context of a shared PDN. This results in analysis of PCGS and its comparison against a reactive voltage droop mitigation scheme. This work shows the importance of voltage droop mitigation reaction time for a split grid shared PDN and highlights benefits of PCGS in its ability to provide better V [subscript MIN] of the entire split grid shared PDN. The silicon results from power-stress tests shows the possibility of the high-power processor system exceeding board or power-supply card current capacity and thermal violations. This requires designing a limiting system that can adapt processor performance. This limiting system is expected to meet the stringent system latency of 1 µs for sustained peak-current violations and react in the order of milli-seconds for thermal mitigation. It is also expected of this system to maintain the desired Quality of Service (QoS) of the multi-threaded processor. This results in implementation of a current and temperature limiting response circuit in a 7nm commercial processor. The randomized pulse modulation (RPM) circuit adapts processor performance and reduces current violations in the system within 1 µs and maintains thread fairness with a 0.4% performance resolution across a wide range of operation from 100% to 0.4%. Hard requirements from SoC software and hardware require the processor systems to be within the TDP and power budgets and processors sharing the split gird PDN. Power consumed by the threads (processors) are now exceeded by added functionality of new threads (processors), which could consume much higher power compared to power of previous generation processors. The threads (processors) operate cohesively in a multi-threaded processor system and though there is a large difference in magnitude of power profiles across threads (processors), the overall performance of the multi-threaded processor is not expected to be compromised. This enforces a need for a power limiting system that can specifically slow down the high-power threads (processors) to meet power-budgets and not affect performance of low-power threads. For this reason, a thread specific multi-thread power limiting (MTPL) mechanism is designed that monitors the processor power consumption using the per thread DPM (PTDPM). Implemented in 7nm for a commercial processor, silicon results demonstrate that the thread specific MTPL does not affect the performance of low power threads during power limiting until the current (power) is limited to very low values. For high power threads and during higher current (power) limiting scenarios, the thread specific MTPL shows similar performance to a conventional global limiting mechanism. Thus, the thread specific MTPL enables the multi-threaded processor system to operate at a higher overall performance compared to a conventional global mechanism across most of the power budget range. For the same power budget, the processor performance can be up to 25% higher using the thread specific MPTL compared to using a global power limiting scheme. In summary, in this dissertation design for power concepts are presented for a processor system on a split-grid shared PDN through various solutions that address challenges in high-power processors and help alleviate potential problems. These solutions range from embedding power-intent, to incorporating voltage droop prediction intelligence through power usage estimation, maintaining quality of service within a stringent system latency, to slowing down specific high-power threads of a multi-threaded processor. All these methods can work cohesively to incorporate power-awareness in the processor systems, making the processors energy efficient and operate reliably within the TDP.
QoS and efficiency for FaaS platforms
(2019-05-10) Kumar, Pranav; Tiwari, Mohit
Serverless computing or function-as-a-service (FaaS) provides a way to write applications composed of scalable and manageable independent tasks communicating seamlessly without developer involvement. Strict performance guarantees or service-level agreements (SLAs) provided by cloud vendors demand predictable performance of serverless applications. Performance predictability in a datacenter environment suffers due to contention for hardware resources. In this study, we evaluate the effects of contention on two FaaS platforms; AWS Lambda, an industry leader in serverless, and the open-source OpenFaaS serverless stack. We develop a complete set of microbenchmarks as well as end-to-end applications composed of multiple functions as a benchmark suite to facilitate our study. We quantify baseline system costs of these applications across both stacks given traditional orchestration mechanisms in an isolated system. We also quantify the same with co-located workloads in datacenter-like setting with Kubernetes orchestration. We show, via experiments, that significant performance slack exists at low to moderate loads and we can intelligently colocate workloads to maximize hardware utilization while still meeting QoS target latencies. Finally, we present a contention-aware static scheduling solution for FaaS platforms with predictable performance and compare it to static versions of baseline related works. We find that an intelligent FaaS orchestrator can be based along similar lines (similar hardware-level features) as a microservices one.
QoS-aware mechanisms for improving cost-efficiency of datacenters
(2018-05) Zhu, Haishan; Erez, Mattan; Pingali, Keshav; Chang, Jichuan; de Veciana, Gustavo; Tiwari, Mohit
Warehouse Scale Computers (WSCs) promise high cost-efficiency by amortizing power, cooling, and management overheads. WSCs today host a large variety of jobs with two broad performance requirements categories: latency-critical (LC) and best-effort (BE). Ideally, to fully utilize all hardware resources, WSC operators can simply fill all the nodes with computing jobs. Unfortunately, because colocated jobs contend for shared resources, systems with high loads often experience performance degradation, which negatively impacts the Quality of Service (QoS) for LC jobs. In fact, service providers usually over-provision resources to avoid any interference with LC jobs, leading to significant resource inefficiencies. In this dissertation, I explore opportunities across different system-abstraction layers to improve the cost-efficiency of dataceters by increasing resource utilization of WSCs with little or no impact on the performance of LC jobs. The dissertation has three main components. First, I explore opportunities to improve the throughput of multicore systems by reducing the performance variation of LC jobs. The main insight is that by reshaping the latency distribution curve, performance headroom of LC jobs can be effectively converted to improved BE throughput. I develop, implement, and evaluate a runtime system that achieves this goal with existing hardware. I leverage the cache partitioning, per-core frequency scaling, and thread masking of server processors. Evaluation results show the proposed solution enables 30% higher system throughput compared to solutions proposed in prior works while maintaining at least as good QoS for LC jobs. Second, I study resource contention in near-future heterogeneous memory architectures (HMA). This study is motivated by recent developments in non-volatile memory (NVM) technologies, which enable higher storage density at the cost of same performance. To understand the performance and QoS impact of HMAs, I design and implement a performance emulator in the Linux kernel that runs unmodified workloads with high accuracy, low overhead, and complete transparency. I further propose and evaluate multiple data and resource management QoS mechanisms, such as locality-aware page admission, occupancy management, and write buffer jailing. Third, I focus on accelerated machine learning (ML) systems. By profiling the performance of production workloads and accelerators, I show that accelerated ML tasks are highly sensitive to main memory interference due to fine-grained interaction between CPU and accelerator tasks. As a result, memory resource contention can significantly decreases the performance and efficiency gains of accelerators. I propose a runtime system that leverages existing hardware capabilities and show 17% higher system efficiency compared to previous approaches. This study further exposes opportunities for future processor architectures
Retrospect on contemporary Internet organization and its challenges in the future
(2011-05) Gutierrez De Lara, Felipe; Bard, William Carl, 1944-; Julien, Christine, D. Sc.
The intent of this report is to expose the audience to the contemporary organization of the Internet and to highlight the challenges it has to deal with in the future as well as the current efforts being made to overcome such threats. This report aims to build a frame of reference for how the Internet is currently structured and how the different layers interact together to make it possible for the Internet to exist as we know it. Additionally, the report explores the challenges the current Internet architecture design is facing, the reasons why these challenges are arising, and the multiple efforts taking place to keep the Internet working. In order to reach these objectives I visited multiple sites of organizations whose only reason for existence is to support the Internet and keep it functioning. The approach used to write this report was to research the topic by accessing multiple technical papers extracted from the IEEE database and network conferences reviews and to analyze and expose their findings. This report utilizes this vii information to elaborate on how network engineers are handling the challenges of keeping the Internet functional while supporting dynamic requirements. This report exposes the challenges the Internet is facing with scalability, the existence of debugging tools, security, mobility, reliability, and quality of service. It is explained in brief how each of these challenges are affecting the Internet and the strategies in place to vanquish them. The final objectives are to inform the reader of how the Internet is working with a set of ever changing and growing requirements, give an overview of the multiple institutions dedicated to reinforcing the Internet and provide a list of current challenges and the actions being taken to overcome them.