Browsing by Subject "SoC"

Now showing 1 - 7 of 7

Analysis of high performance interconnect in SoC with distributed switches and multiple issue bus protocols
(2011-05) Narayanasetty, Bhargavi; John, Lizy Kurian; Korson, Steve
In a System on a Chip (SoC), interconnect is the factor limiting Performance, Power, Area and Schedule (PPAS). Distributed crossbar switches also called as Switching Central Resources (SCR) are often used to implement high performance interconnect in a SoC – Network on a Chip (NoC). Multiple issue bus protocols like AXI (from ARM), VBUSM (from TI) are used in paths critical to the performance of the whole chip. Experimental analysis of effects on PPAS by architectural modifications to the SCRs is carried out, using synthesis tools and Texas Instruments (TI) in house power estimation tools. The effects of scaling of SCR sizes are discussed in this report. These results provide a quick means of estimation for architectural changes in the early design phase. Apart from SCR design, the other major domain, which is a concern, is deadlocks. Deadlocks are situations where the network resources are suspended waiting for each other. In this report various kinds of deadlocks are classified and their respective mitigations in such networks are provided. These analyses are necessary to qualify distributed SCR interconnect, which uses multiple issue protocols, across all scenarios of transactions. The entire analysis in this report is carried out using a flagship product of Texas Instruments. This ASIC SoC is a complex wireless base station developed in 2010- 2011, having 20 major cores. Since the parameters of crossbar switches with multiple issue bus protocols are commonly used in SoCs across the semiconductor industry, this reports provides us a strong basis for architectural/design selection and validation of all such high performance device interconnects. This report can be used as a seed for the development of an interface tool for architects. For a given architecture, the tool suggests architectural modifications, and reports deadlock situations. This new tool will aid architects to close design problems and bring provide a competitive specification very early in the design cycle. A working algorithm for the tool development is included in this report.
Built-in self test of RF subsystems
(2008-12) Zhang, Chaoming, 1980-; Abraham, Jacob A.
With the rapid development of wireless and wireline communications, a variety of new standards and applications are emerging in the marketplace. In order to achieve higher levels of integration, RF circuits are frequently embedded into System on Chip (SoC) or System in Package (SiP) products. These developments, however, lead to new challenges in manufacturing test time and cost. Use of traditional RF test techniques requires expensive high frequency test instruments and long test time, which makes test one of the bottlenecks for reducing IC costs. This research is in the area of built-in self test technique for RF subsystems. In the test approach followed in this research, on-chip detectors are used to calculate circuits specifications, and data converters are used to collect the data for analysis by an on-chip processor. A novel on-chip amplitude detector has been designed and optimized for RF circuit specification test. By using on-chip detectors, both the system performance and specifications of the individual components can be accurately measured. On-chip measurement results need to be collected by Analog to Digital Converters (ADCs). A novel time domain, low power ADC has been designed for this purpose. The ADC architecture is based on a linear voltage controlled delay line. Using this structure results in a linear transfer function for the input dependent delay. The time delay difference is then compared to a reference to generate a digital code. Two prototype test chips were fabricated in commercial CMOS processes. One is for the RF transceiver front end with on-chip detectors; the other is for the test ADC. The 940MHz RF transceiver front-end was implemented with on-chip detectors in a 0.18 [micrometer] CMOS technology. The chips were mounted onto RF Printed Circuit Boards (PCBs), with tunable power supply and biasing knobs. The detector was characterized with measurements which show that the detector keeps linear performance over a wide input amplitude range of 500mV. Preliminary simulation and measurements show accurate transceiver performance prediction under process variations. A 300MS/s 6 bit ADC was designed using the novel time domain architecture in a 0.13 [micrometer] standard digital CMOS process. The simulation results show 36.6dB Signal to Noise Ratio (SNR), 34.1dB Signal to Noise and Distortion Ratio (SNDR) for 99MHz input, Differential Non-Linearity (DNL)<0.2 Least Significant Bit (LSB), and Integral Non-Linearity (INL)<0.5LSB. Overall chip power is 2.7mW with a 1.2V power supply. The built-in detector RF test was extended to a full transceiver RF front end test with a loop-back setup, so that measurements can be made to verify the benefits of the technique. The application of the approach to testing gain, linearity and noise figure was investigated. New detector types are also evaluated. In addition, the low-power delay-line based ADC was characterized and improved to facilitate gathering of data from the detector. Several improved ADC structures at the system level are also analyzed. The built-in detector based RF test technique enables the cost-efficient test for SoCs.
Characterization of smartphone governor strategies and making of a workload aware governor
(2018-05-04) Banerjee, Sarbartha; John, Lizy Kurian
This thesis presents the importance of workload characterization towards governing the operational voltage and frequency of a smartphone processor by running a series of workload on an ARM v8 processor. The idea of finishing a task as fast as possible to return to idle state(race-to-idle) versus the idea of choosing the correct frequency for time deltas(pace-to-idle) is studied in detail. Android governors either statically use a single frequency for the entire active time or determines the voltage and frequency dynamically based on the load average on the processor. Similar load averaging strategies are used for other blocks in SoC (System on Chip) like the GPU or the media processor. However, the different blocks of a SoC draw power from the same current source. Owing to lack of fine-grained workload characterization, the power is redirected to the not-so-important unit providing poor performance and energy efficiency. The behavior of different existing governors is explored by running on a variety of workload and analyze the optimal strategy for energy efficiency satisfying an acceptable user performance. Crucial traits of active user applications are inferred from scheduler to fine tune the optimal voltage and frequency across different blocks under constrained power source to build a system-wide governor.
Core-characteristic-aware off-chip memory management in a multicore system-on-chip
(2012-12) Jeong, Min Kyu; Erez, Mattan; John, Lizy K.; Chiou, Derek; Lin, Calvin; Schulte, Michael J.
Future processors will integrate an increasing number of cores because the scaling of single-thread performance is limited and because smaller cores are more power efficient. Off-chip memory bandwidth that is shared between those many cores, however, scales slower than the transistor (and core) count does. As a result, in many future systems, off-chip bandwidth will become the bottleneck of heavy demand from multiple cores. Therefore, optimally managing the limited off-chip bandwidth is critical to achieving high performance and efficiency in future systems. In this dissertation, I will develop techniques to optimize the shared use of limited off-chip memory bandwidth in chip-multiprocessors. I focus on issues that arise from the sharing and exploit the differences in memory access characteristics, such as locality, bandwidth requirement, and latency sensitivity, between the applications running in parallel and competing for the bandwidth. First, I investigate how the shared use of memory by many cores can result in reduced spatial locality in memory accesses. I propose a technique that partitions the internal memory banks between cores in order to isolate their access streams and eliminate locality interference. The technique compensates for the reduced bank-level parallelism of each thread by employing memory sub-ranking to effectively increase the number of independent banks. For three different workload groups that consist of benchmarks with high spatial locality, low spatial locality, and mixes of the two, the average system efficiency improves by 10%, 7%, 9% for 2-rank systems, and 18%, 25%, 20% for 1-rank systems, respectively, over the baseline shared-bank system. Next, I improve the performance of a heterogeneous system-on-chip (SoC) in which cores have distinct memory access characteristics. I develop a deadline-aware shared memory bandwidth management scheme for SoCs that have both CPU and GPU cores. I show that statically prioritizing the CPU can severely constrict GPU performance, and propose to dynamically adapt the priority of CPU and GPU memory requests based on the progress of GPU workload. The proposed dynamic bandwidth management scheme provides the target GPU performance while prioritizing CPU performance as much as possible, for any CPU-GPU workload combination with different complexities.
Correct low power design transformations for hardware systems
(2013-08) Viswanath, Vinod; Abraham, Jacob A.
We present a generic proof methodology to automatically prove correctness of design transformations introduced at the Register-Transfer Level (RTL) to achieve lower power dissipation in hardware systems. We also introduce a new algorithm to reduce switching activity power dissipation in microprocessors. We further apply our technique in a completely different domain of dynamic power management of Systems-on-Chip (SoCs). We demonstrate our methodology on real-life circuits. In this thesis, we address the dual problem of transforming hardware systems at higher levels of abstraction to achieve lower power dissipation, and a reliable way to verify the correctness of the afore-mentioned transformations. The thesis is in three parts. The first part introduces Instruction-driven Slicing, a new algorithm to automatically introduce RTL/System level annotations in microprocessors to achieve lower switching power dissipation. The second part introduces Dedicated Rewriting, a rewriting based generic proof methodology to automatically prove correctness of such high-level transformations for lowering power dissipation. The third part implements dedicated rewriting in the context of dynamically managing power dissipation of mobile and hand-held devices. We first present instruction-driven slicing, a new technique for annotating microprocessor descriptions at the Register Transfer Level in order to achieve lower power dissipation. Our technique automatically annotates existing RTL code to optimize the circuit for lowering power dissipated by switching activity. Our technique can be applied at the architectural level as well, achieving similar power gains. We first demonstrate our technique on architectural and RTL models of a 32-bit OpenRISC pipelined processor (OR1200), showing power gains for the SPEC2000 benchmarks. These annotations achieve reduction in power dissipation by changing the logic of the design. We further extend our technique to an out-of-order superscalar core and demonstrate power gains for the same SPEC2000 benchmarks on architectural and RTL models of PUMA, a fixed point out-of-order PowerPC microprocessor. We next present dedicated rewriting, a novel technique to automatically prove the correctness of low power transformations in hardware systems described at the Register Transfer Level. We guarantee the correctness of any low power transformation by providing a functional equivalence proof of the hardware design before and after the transformation. Dedicated rewriting is a highly automated deductive verification technique specially honed for proving correctness of low power transformations. We provide a notion of equivalence and establish the equivalence proof within our dedicated rewriting system. We demonstrate our technique on a non-trivial case study. We show equivalence of a Verilog RTL implementation of a Viterbi decoder, a component of the DRM System-On-Chip (SoC), before and after the application of multiple low power transformations. We next apply dedicated rewriting to a broader context of holistic power management of SoCs. This in turn creates a self-checking system and will automatically flag conflicting constraints or rules. Our system will manage power constraint rules using dedicated rewriting specially honed for dynamic power management of SoC designs. Together, this provides a common platform and representation to seamlessly cooperate between hardware and software constraints to achieve maximum platform power optimization dynamically during execution. We demonstrate our technique in multiple contexts on an SoC design of the state-of-the-art next generation Intel smartphone platform. Finally, we give a proof of instruction-driven slicing. We first prove that the annotations automatically introduced in the OR1200 processor preserve the original functionality of the machine using the ACL2 theorem prover. Then we establish the same proof within our dedicated rewriting system, and discuss the merits of such a technique and a framework. In the context of today's shrinking hardware and mobile internet devices, lowering power dissipation is a key problem. Verifying the correctness of transformations which achieve that is usually a time-consuming affair. Automatic and reliable methods of verification that are easy to use are extremely important. In this thesis we have presented one such transformation, and a generic framework to prove correctness of that and similar transformations. Our methodology is constructed in a manner that easily and seamlessly fits into the design cycle of creating complicated hardware systems. Our technique is also general enough to be applied in a completely different context of dynamic power management of mobile and hand-held devices.
Integrated temperature sensors in deep sub-micron CMOS technologies
(2014-05) Chowdhury, Golam Rasul; Hassibi, Arjang
Integrated temperature sensors play an important role in enhancing the performance of on-chip power and thermal management systems in today's highly-integrated system-on-chip (SoC) platforms, such as microprocessors. Accurate on-chip temperature measurement is essential to maximize the performance and reliability of these SoCs. However, due to non-uniform power consumption by different functional blocks, microprocessors have fairly large thermal gradient (and variation) across their chips. In the case of multi-core microprocessors for example, there are task-specific thermal gradients across different cores on the same die. As a result, multiple temperature sensors are needed to measure the temperature profile at all relevant coordinates of the chip. Subsequently, the results of the temperature measurements are used to take corrective measures to enhance the performance, or save the SoC from catastrophic over-heating situations which can cause permanent damage. Furthermore, in a large multi-core microprocessor, it is also imperative to continuously monitor potential hot-spots that are prone to thermal runaway. The locations of such hot spots depend on the operations and instruction the processor carries out at a given time. Due to practical limitations, it is an overkill to place a big size temperature sensor nearest to all possible hot spots. Thus, an ideal on-chip temperature sensor should have minimal area so that it can be placed non-invasively across the chip without drastically changing the chip floor plan. In addition, the power consumption of the sensors should be very low to reduce the power budget overhead of thermal monitoring system, and to minimize measurement inaccuracies due to self-heating. The objective of this research is to design an ultra-small size and ultra-low power temperature sensor such that it can be placed in the intimate proximity of all possible hot spots across the chip. The general idea is to use the leakage current of a reverse-bias p-n junction diode as an operand for temperature sensing. The tasks within this project are to examine the theoretical aspect of such sensors in both Silicon-On-Insulator (SOI), and bulk Complementary Metal-Oxide Semiconductor (CMOS) technologies, implement them in deep sub-micron technologies, and ultimately evaluate their performances, and compare them to existing solutions.
Power-aware processor system design
(2020-05) Kalyanam, Vijay Kiran; Abraham, Jacob A.; Orshansky , Michael; Pan, David; Touba, Nur; Tupuri, Raghuram
With everyday advances in technology and low-cost economics, processor systems are moving towards split grid shared power delivery networks (PDNs) while providing increased functionality and higher performance capabilities resulting in increased power consumption. Split grid refers to dividing up the power grid resources among various homogeneous and heterogeneous functional modules and processors. When the PDN is shared and common across multiple processors and function blocks, it is called a Shared PDN. In order to keep the power in control on a split-grid shared PDN, the processor system is required to operate when various hardware modules interact with each other while the supply voltage (V [subscript DD]) and clock frequency (F [subscript CLK]) are scaled. Software or hardware assisted power-collapse and low-power retention modes can be automatically engaged in the processor system. The processor system should also operate at maximum performance under power constraints while consuming the full thermal design power (TDP). The processor system should neither violate board and card current limits nor the power management integrated circuit (PMIC) limits or its slew rate requirements for current draw on the shared PDN. It is expected to operate within thermal limits below an operating temperature. The processor system is also required to detect and mitigate current violations within microseconds and temperature violations in milliseconds. The processor system is expected to be robust and should be able to tolerate voltage droops. Its importance is highlighted with the processor system being on shared PDN. Because of the sharing of the PDN, the voltage droop mitigation scheme is expected to be quick and must suppress V [subscript DD] droop propagation at the source while only introducing negligible performance penalties during this mitigation. Without a solution for V [subscript DD] droop in place, the entire V [subscript DD] of shared PDN is forced to be at a higher voltage, increasing overall system power. This can potentially affect the days of use (DoU) of battery-operated systems, and reliability and cooling of wired systems. A multi-threaded processor system is expected to monitor the current, power and voltage violations and react quickly without affecting the performance of its hardware threads while maintaining quality of service (QoS). Early high-level power estimates are a necessity to project how much power will be consumed by a future processor system. These power projections are used to plan for software use cases and to reassign power-domains of processors and function blocks belonging to the shared PDN. Additionally, it helps to re-design boards and power-cards, re-implement the PDN, change PMIC and plan for additional power, current, voltage and temperature violation related mitigation schemes if the existing solutions are insufficient. The split grid shared PDN that is implemented in a system-on-chip (SoC) is driven by low cost electronics and forces multiple voltage rails for a better energy efficiency. To support this, there is a need for incorporation of voltage levels and power-states into a processor behavioral register transfer level (RTL) model. Low power verification is a must in a split-grid PDN. To facilitate these, the RTL is annotated with voltage supplies and isolation circuits that engage and protect during power collapse scenarios across various voltage domains. The power-aware RTL design is verified, identified and corrected for low power circuit and RTL bugs prior to tape-out. The mandatory features to limit current, power, voltage and temperatures in these high performance and power hungry processor systems introduce a need to provide high level power projections for a processor system accounting for various split-grid PDN supplying V [subscript DD] to the processor, the interface bus, various function blocks, and co-processors. To solve this problem, a power prediction solution is provided that has an average-power error of 8% in prediction and works with reasonable accuracy by tracking instantaneous power for unknown software application traces. The compute time to calculate power using the generated prediction model is 100000X faster and uses 100X less compute memory compared to a commercial electronic design automation (EDA) RTL power tool. This solution is also applied to generate a digital power meter (DPM) in hardware for real-time power estimates while the processor is operational. These high-level power estimates project the potential peak-currents in these processor systems. This resulted in a need for new tests to be created and validated on silicon in order to functionally stress the split-grid shared PDN for extreme voltage droop and sustained high current usage scenarios. For this reason, functional test sequences are created for high power and voltage stress testing of multi-threaded processors. The PDN is a complex system and needs different functional test sequences to generate various kinds of high and low power instruction packets that can stress it. These voltage droop stress tests affect V [subscript MIN] margins in various voltage and frequency modes of operation in a commercial multi-threaded processor. These results underscore a need for voltage mitigation solutions. The processor system operating on a split grid shared PDN can have its V [subscript MIN] increased due to voltage stress tests or a power-virus software application. The shared PDN imposes requirements to mitigate the voltage noise at the source and avoid any possibility of increases to the shared PDN V [subscript DD]. This necessitates implementing a proactive system that can mitigate voltage droop before it occurs while lowering the processor’s minimum voltage of operation (V [subscript MIN]) to help in system power reduction. To mitigate the voltage droops, a proactive clock gating system (PCGS) is implemented with a voltage clock gate (VCG) circuit that uses a digital power meter (DPM) and a model of a PDN to predict the voltage droop before its occurrence. Silicon results show PCGS achieves 10% higher clock frequency (F [subscript CLK]) and 5% lower supply voltage (V [subscript DD]) in a 7nm processor. Questions arise about the effectiveness of PCGS over a reactive voltage droop mitigation scheme in the context of a shared PDN. This results in analysis of PCGS and its comparison against a reactive voltage droop mitigation scheme. This work shows the importance of voltage droop mitigation reaction time for a split grid shared PDN and highlights benefits of PCGS in its ability to provide better V [subscript MIN] of the entire split grid shared PDN. The silicon results from power-stress tests shows the possibility of the high-power processor system exceeding board or power-supply card current capacity and thermal violations. This requires designing a limiting system that can adapt processor performance. This limiting system is expected to meet the stringent system latency of 1 µs for sustained peak-current violations and react in the order of milli-seconds for thermal mitigation. It is also expected of this system to maintain the desired Quality of Service (QoS) of the multi-threaded processor. This results in implementation of a current and temperature limiting response circuit in a 7nm commercial processor. The randomized pulse modulation (RPM) circuit adapts processor performance and reduces current violations in the system within 1 µs and maintains thread fairness with a 0.4% performance resolution across a wide range of operation from 100% to 0.4%. Hard requirements from SoC software and hardware require the processor systems to be within the TDP and power budgets and processors sharing the split gird PDN. Power consumed by the threads (processors) are now exceeded by added functionality of new threads (processors), which could consume much higher power compared to power of previous generation processors. The threads (processors) operate cohesively in a multi-threaded processor system and though there is a large difference in magnitude of power profiles across threads (processors), the overall performance of the multi-threaded processor is not expected to be compromised. This enforces a need for a power limiting system that can specifically slow down the high-power threads (processors) to meet power-budgets and not affect performance of low-power threads. For this reason, a thread specific multi-thread power limiting (MTPL) mechanism is designed that monitors the processor power consumption using the per thread DPM (PTDPM). Implemented in 7nm for a commercial processor, silicon results demonstrate that the thread specific MTPL does not affect the performance of low power threads during power limiting until the current (power) is limited to very low values. For high power threads and during higher current (power) limiting scenarios, the thread specific MTPL shows similar performance to a conventional global limiting mechanism. Thus, the thread specific MTPL enables the multi-threaded processor system to operate at a higher overall performance compared to a conventional global mechanism across most of the power budget range. For the same power budget, the processor performance can be up to 25% higher using the thread specific MPTL compared to using a global power limiting scheme. In summary, in this dissertation design for power concepts are presented for a processor system on a split-grid shared PDN through various solutions that address challenges in high-power processors and help alleviate potential problems. These solutions range from embedding power-intent, to incorporating voltage droop prediction intelligence through power usage estimation, maintaining quality of service within a stringent system latency, to slowing down specific high-power threads of a multi-threaded processor. All these methods can work cohesively to incorporate power-awareness in the processor systems, making the processors energy efficient and operate reliably within the TDP.