Browsing by Subject "Hardware accelerators"
Now showing 1 - 3 of 3
- Results Per Page
- Sort Options
Item Generating irregular data-stream accelerators : methodology and applications(2015-05) Lavasani, Maysam; Chiou, Derek; Abraham, Jacob; Chung, Eric; Gerstlauer, Andreas; Pingali, KeshavThis thesis presents Gorilla++, a language and a compiler for generating customized hardware accelerators that process input streams of data. Gorilla++ uses a hierarchical programming model with sequential engines run in parallel and communicate through FIFO interfaces. It also incorporates offload and lock constructs in the language to support safe accesses to global resources. Beside conventional compiler optimizations for regular streaming, the programming model opens up new optimization opportunities including (i) multi-threading to share computation resources by different execution contexts inside an engine, (ii) offload-sharing to share resources between different engines, and (iii) pipe-offloading to pipeline part of a computation that is not efficiently pipelinable as a whole. Due to the dynamic nature of Gorilla++ target applications, closedform formulations are not sufficient for exploring the design space of accelerators. Instead, the design space is explored iteratively using a rule-based refinement process. In each iteration, the rules capture inefficiencies in the design, either bottlenecks or under-utilized resources, and change the design to eliminate the inefficiencies. Gorilla++ is evaluated by generating a set of FPGA-based networking and big-data accelerators. The experimental results demonstrate (i) the expressiveness and generality of Gorilla++ language, (ii) the effectiveness of Gorilla++ compiler optimizations, and (iii) the improvement in the design space exploration (DSE) using rule-based refinement process.Item Guardband management in heterogeneous architectures(2016-12) Leng, Jingwen; Janapa Reddi, Vijay; John, Lizy Kurian; Erez, Mattan; Fussell, Donald S.; Bose, PradipPerformance and power efficiency are two of the most critical aspects of computing systems. Moore's law (the doubling of transistors in a chip every 18 months), coupled with Dennard scaling, enabled a synergy between device, circuit, microarchitecture, and architecture to drive improvements in those two critical aspects. With the recent end of Dennard scaling, on-chip transistor count continues to increase, but the smaller transistor size no longer provides performance per power gain. The divergence between transistor density increases and power efficiency gain decreases results in processor design paradigm shifts from the single-core CPU architecture to the multicore or manycore CPU architecture, and eventually to the heterogeneous architecture. Besides performance and power efficiency, reliability is another crucial computing requirement. However, regardless of how the architecture evolves, processors still need to trade off a significant portion of performance or power efficiency to ensure reliability. When running on the silicon, processors experience continuously varying operating conditions, such as process, voltage, and temperature (PVT) variation. All the variation may slow down circuit speed and cause timing errors. The traditional approach to ensuring the reliable operation in the presence of possible worst-case conditions is to statically assign a large-enough voltage margin (or guardband). But such an approach leads to wasted energy, because the worst-case condition rarely occurs, and the processor could have operated at a lower voltage most of the time [36, 48, 77]. We need to actively manage the voltage guardband to fully unlock the efficiency potential of heterogeneous architectures. However, guardband management in heterogeneous architectures is a particularly challenging problem that has not been studied by prior work. On one hand, as transistors become smaller, the impact of PVT variation relative to the nominal voltage becomes more significant [60]. On the other hand, increasing core count in the processor results in a larger die area and a higher peak power consumption, both of which complicate and enlarge the impact of PVT variation. To this end, this thesis studies cross-layer mechanisms that span from the circuit to (micro)architecture to software runtime for managing the guardband in the heterogeneous architecture. Most prior works have studied guardband management mechanisms only in the circuit or (micro)architecture level. In comparison, my colleagues and I studied cross-layer mechanisms that require lower hardware design complexity and incur less implementation overhead because the software takes a major role in guardband management. Moreover, the cross-layer mechanisms alleviate the need for (micro)architecture-specific optimizations, which make them scalable solutions in the current era of rapidly evolving heterogeneous architectures. This thesis performs such a study in the manycore GPU architecture, which is a representative heterogeneous architecture and has been widely adopted in mainstream computing. The first part of the thesis focuses on the modeling and characterization of PVT variation in the GPU architecture. We first perform a thorough characterization of the underlying PVT variation's impact on the voltage guardband based on hardware measurements. After identifying voltage variation (noise) as the most challenging and necessary factor for guardband management, we study methodologies for how to accurately model voltage noise in the manycore architecture. The insights on how the circuit, microarchitecture, and program interact with each other to affect the PVT variation lay foundations for cross-layer guardband management mechanisms studied in this thesis. The second part of this thesis studies two guardband-management techniques and demonstrates that they can significantly improve the GPU architecture's energy efficiency. We first study how to improve the worst-case guardbanding design by performing voltage smoothing, which effectively mitigates large voltage noise and achieves significant energy savings with less guardband requirement. We then study how to adapt to the program-specific guardband requirement to fully unlock the current GPU's efficiency potential. We propose a mechanism called predictive guardbanding, in which the program directly predicts its voltage requirement. The proposed design leverages cross-layer optimization to minimize hardware complexity and overhead. The last part of this thesis studies reliability optimization when the prediction in the predictive guardbanding fails with an unexpected error margin. We advocate maintaining system-level reliability, and we propose a design paradigm called asymmetric resilience, whose principle is to develop the reliable heterogeneous CPU-GPU system centering around the CPU. This generic design paradigm eases the GPU away from reliability optimization. We present design principles and practices for the heterogeneous system that adopts such design paradigm. Following the principles of asymmetric resilience, we demonstrate how to use the CPU architecture to handle GPU execution errors, which lets the GPU focus on typical case operation for better energy efficiency. We explore the design space and demonstrate that it can be used as the safety-net mechanism in predictive guardbanding with reasonable overhead.Item Weightless neural networks for fast, low-energy inference(2024-08) Susskind, Zachary ; John, Lizy Kurian; Diana Marculescu; Mike O’Connor; Derek Chiou; Mattan Erez; Felipe Maia Galvão FrançaDespite significant advancements in efficient machine learning, deploying models such as deep neural networks (DNNs) on resource-constrained edge devices remains a major challenge. Conventional approaches transform pre-trained models using methods such as pruning and quantization to make better use of limited memory and compute resources. However, these approaches are insufficient when scaling to ultra-low-power “extreme edge” devices, particularly when high throughput and low latency are also desired. This domain demands approaches to machine learning which are designed from first principles to be more efficient in hardware. While some leading approaches, such as binary neural networks (BNNs), are structurally similar to DNNs, others are much more divergent in form. Weightless neural networks (WNNs), a class of machine learning model which perform computation using lookup tables, are interesting candidates in this space due to their inherent nonlinearity, efficiency of operation, and simplicity of construction. In this dissertation, I explore the potential of WNNs to enable fast, efficient inference on the extreme edge. I first discuss BTHOWeN, which combines insights from recent WNN literature with additional algorithmic improvements to create a state-of-the-art weightless model with an accompanying FPGA-based accelerator architecture. I next propose ULEEN, which introduces strategies to further improve the accuracy of WNNs as well as their efficiency in hardware, including a novel multi-pass learning rule and a lookup table pruning strategy. Lastly, I introduce the DWN model architecture, which enables models composed of multiple layers of small, directly-connected lookup tables to be trained using a gradient-based flow. In aggregate, these contributions yield WNNs which are fast, efficient, and readily implemented on low-end microcontrollers or as custom hardware accelerators, achieving for instance >2000× reduction in energy-delay product versus fully-connected BNNs in an FPGA. Overall, this work positions WNNs as a leading approach for tiny devices.