Browsing by Subject "Compiler"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item Accelerating virtualization of accelerators(2021-01-21) Yu, Hangchen; Rossbach, Christopher J.; Witchel, Emmett; Pingali, Keshav; Erez, MattanThe use of specialized accelerators is among the most promising paths to better energy efficiency for computationally heavy workloads. However, current software and system support for accelerators is limited, and no production-ready solutions have yet been provided for accelerators to be efficiently accessed or shared in domains such as cloud infrastructure and kernel space. Complex hardware and proprietary software stacks inhibit efficient accelerator virtualization. We observe that practical virtualization has to choose between interposition at the topmost (user API) and bottom-most (hardware) interfaces, and virtualization based on interposing intermediate stack layers is impractical. Based on these observations, this thesis first presents AvA (Accelerated Virtualization of Accelerators) which exposes practical virtual accelerators in the cloud with strong virtualization properties such as isolation, compatibility, and consolidation. AvA is the first system to show general techniques for API remoting that retain both hypervisor interposition and close-to-native performance, and is the first system for automatic construction of virtual accelerator stacks with hypervisor mediation for arbitrary accelerators. We used AvA to virtualize nine accelerators and eleven framework APIs, with orders-of-magnitude lower programming effort than required to construct hand-built virtualization support. These accelerators include seven for which no virtualization support has been previously explored. Building on AvA, this thesis presents Akatha (Accelerating Kernel Access to Hardware Acceleration), which uses automation to reduce developer effort in building efficient access to accelerators for kernel-level work (e.g., FS encryption or packet processing). Akatha constructs API-remoting-based kernel accelerator stacks with code generation, leveraging kernel knowledge unavailable in user space to improve performance and resource management. This includes transparently modifying virtual memory mappings to avoid data transfer between kernel and user space, and providing a framework and mechanisms to manage contention between user and kernel for accelerator devices. We evaluated Akatha with a range of workloads, showing promising opportunities for OS acceleration.Item Compiler and runtime systems for homomorphic encryption and graph processing on distributed and heterogeneous architectures(2020-05) Dathathri, Roshan; Pingali, Keshav; Musuvathi, Madanlal; Ramachandran, Vijaya; Rossbach, Christopher; Snir, MarcDistributed and heterogeneous architectures are tedious to program because devices such as CPUs, GPUs, and FPGAs provide different programming abstractions and may have disjoint memories, even if they are on the same machine. In this thesis, I present compiler and runtime systems that make it easier to develop efficient programs for privacy-preserving computation and graph processing applications on such architectures. Fully Homomorphic Encryption (FHE) refers to a set of encryption schemes that allow computations on encrypted data without requiring a secret key. Recent cryptographic advances have pushed FHE into the realm of practical applications. However, programming these applications remains a huge challenge, as it requires cryptographic domain expertise to ensure correctness, security, and performance. This thesis introduces a domain-specific compiler for fully-homomorphic deep neural network (DNN) inferencing as well as a general-purpose language and compiler for fully-homomorphic computation: 1. I present CHET, a domain-specific optimizing compiler, that is designed to make the task of programming DNN inference applications using FHE easier. CHET automates many laborious and error prone programming tasks including encryption parameter selection to guarantee security and accuracy of the computation, determining efficient data layouts, and performing scheme-specific optimizations. Our evaluation of CHET on a collection of popular DNNs shows that CHET-generated programs outperform expert-tuned ones by an order of magnitude. 2. I present a new FHE language called Encrypted Vector Arithmetic (EVA), which includes an optimizing compiler that generates correct and secure FHE programs, while hiding all the complexities of the target FHE scheme. Bolstered by our optimizing compiler, programmers can develop efficient general-purpose FHE applications directly in EVA. EVA is designed to also work as an intermediate representation that can be a target for compiling higher-level domain-specific languages. To demonstrate this, we have re-targeted CHET onto EVA. Due to the novel optimizations in EVA, its programs are on average ~ 5.3x faster than those generated by the unmodified version of CHET. These languages and compilers enable a wider adoption of FHE. Applications in several areas like machine learning, bioinformatics, and security need to process and analyze very large graphs. Distributed clusters are essential in processing such graphs in reasonable time. I present a novel approach to building distributed graph analytics systems that exploits heterogeneity in processor types, partitioning policies, and programming models. The key to this approach is Gluon, a domain-specific communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters in the bulk-synchronous parallel (BSP) model and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. We also extend Gluon to support lock-free, non-blocking, bulk-asynchronous execution by introducing the bulk-asynchronous parallel (BASP) model. Our experiments were done on CPU clusters with up to 256 multi-core, multi-socket hosts and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ~ 2.6x on the average. Gluon's BASP-style execution is on average ~ 1.5x faster than its BSP-style execution for graph applications on real-world large-diameter graphs at scale. The D-Galois and D-IrGL systems built using Gluon scale well and are faster than Gemini, the state-of-the-art distributed CPU-only graph analytics system, by factors of ~ 3.9x and ~ 4.9x on average using distributed CPUs and distributed GPUs respectively. The Gluon-based D-IrGL system for distributed GPUs is also on average ~ 12x faster than Lux, the only other distributed GPU-only graph analytics system. The Gluon-based D-IrGL system was one of the first distributed GPU graph analytics systems and is the only asynchronous one.