Browsing by Subject "Runtime system"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item Compiler and runtime systems for homomorphic encryption and graph processing on distributed and heterogeneous architectures(2020-05) Dathathri, Roshan; Pingali, Keshav; Musuvathi, Madanlal; Ramachandran, Vijaya; Rossbach, Christopher; Snir, MarcDistributed and heterogeneous architectures are tedious to program because devices such as CPUs, GPUs, and FPGAs provide different programming abstractions and may have disjoint memories, even if they are on the same machine. In this thesis, I present compiler and runtime systems that make it easier to develop efficient programs for privacy-preserving computation and graph processing applications on such architectures. Fully Homomorphic Encryption (FHE) refers to a set of encryption schemes that allow computations on encrypted data without requiring a secret key. Recent cryptographic advances have pushed FHE into the realm of practical applications. However, programming these applications remains a huge challenge, as it requires cryptographic domain expertise to ensure correctness, security, and performance. This thesis introduces a domain-specific compiler for fully-homomorphic deep neural network (DNN) inferencing as well as a general-purpose language and compiler for fully-homomorphic computation: 1. I present CHET, a domain-specific optimizing compiler, that is designed to make the task of programming DNN inference applications using FHE easier. CHET automates many laborious and error prone programming tasks including encryption parameter selection to guarantee security and accuracy of the computation, determining efficient data layouts, and performing scheme-specific optimizations. Our evaluation of CHET on a collection of popular DNNs shows that CHET-generated programs outperform expert-tuned ones by an order of magnitude. 2. I present a new FHE language called Encrypted Vector Arithmetic (EVA), which includes an optimizing compiler that generates correct and secure FHE programs, while hiding all the complexities of the target FHE scheme. Bolstered by our optimizing compiler, programmers can develop efficient general-purpose FHE applications directly in EVA. EVA is designed to also work as an intermediate representation that can be a target for compiling higher-level domain-specific languages. To demonstrate this, we have re-targeted CHET onto EVA. Due to the novel optimizations in EVA, its programs are on average ~ 5.3x faster than those generated by the unmodified version of CHET. These languages and compilers enable a wider adoption of FHE. Applications in several areas like machine learning, bioinformatics, and security need to process and analyze very large graphs. Distributed clusters are essential in processing such graphs in reasonable time. I present a novel approach to building distributed graph analytics systems that exploits heterogeneity in processor types, partitioning policies, and programming models. The key to this approach is Gluon, a domain-specific communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters in the bulk-synchronous parallel (BSP) model and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. We also extend Gluon to support lock-free, non-blocking, bulk-asynchronous execution by introducing the bulk-asynchronous parallel (BASP) model. Our experiments were done on CPU clusters with up to 256 multi-core, multi-socket hosts and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ~ 2.6x on the average. Gluon's BASP-style execution is on average ~ 1.5x faster than its BSP-style execution for graph applications on real-world large-diameter graphs at scale. The D-Galois and D-IrGL systems built using Gluon scale well and are faster than Gemini, the state-of-the-art distributed CPU-only graph analytics system, by factors of ~ 3.9x and ~ 4.9x on average using distributed CPUs and distributed GPUs respectively. The Gluon-based D-IrGL system for distributed GPUs is also on average ~ 12x faster than Lux, the only other distributed GPU-only graph analytics system. The Gluon-based D-IrGL system was one of the first distributed GPU graph analytics systems and is the only asynchronous one.Item Resilient heterogeneous systems with Containment Domains(2020-02-03) Lee, Kyushick; Erez, Mattan; Touba, Nur A.; Tiwari, Mohit; Rossbach, Christopher J.; Sullivan, Michael B.Resilience is a continuing concern for extreme-scale scientific applications. Tolerating the ever-increasing hardware fault rates demands a scalable end-to-end resilience scheme. The fundamental issue of current system-wide techniques, such as checkpoint-restart, is a one-size-fits-all approach, which globally recovers local failures. The challenges for supporting efficient resilience grow at scale with the trend of adopting accelerators. Exploiting resiliency tailored to an application can offer a potential breakthrough that enables efficient localized recovery, because an individual node maintains low failure rate at scale. I propose a framework realizing Containment Domains (CDs) that addresses the resilience challenges for future-scale heterogeneous systems. My dissertation consists of two parts: tackling the resilience problem for CPU-only systems with CDs and extending CDs to systems with GPUs. In the first, I develop the CDs framework and adapt CDs-based resilience to real-world applications to verify its analytical model and show its feasibility. CDs elevate resilience to a first-class abstraction and exploit application properties to enable hierarchically decomposing applications into local domains to contain errors. Confining the range of errors with such logical domains in a program enables localized recovery. The CDs framework validates the analytical model of CDs, which matches the trend of the efficiency results measured by running CD-enabled applications with error injection. Based on the analytical model, I develop an automated workflow to tune Containment Domains by leveraging different likelihood of failures, storage hierarchy, and application characteristics. The CD-based resiliency estimated by the analytical model projects higher efficiency than the state-of-the-art, and promises scalablility toward exascale computing. In the second part of the dissertation, I extend CDs to CUDA applications on high-performance computing (HPC) systems with GPUs. GPUs offer higher computational power at lower energy and cost than homogeneous CPU-only nodes. The heterogeneous nodes in modern HPC systems show a tendency of high GPU-to-CPU ratio. While an accelerator-rich machine reduces the total number of compute nodes required to achieve a performance target, a single node becomes vulnerable to accelerator failures as well as congested intra-node resources. Preserving a large amount of local state within accelerators for checkpointing incurs significant overheads. Node-level resilience reveals a new challenge at the scale of accelerator density of HPC systems. I apply CDs to isolate and recover GPU failures in HPC CUDA applications (CD-CUDA). The extention of CDs to CUDA programs allows to express logical domains at the kernel boundary. CD-CUDA improves the system-level efficiency for resilience compared to host-only CDs by containing GPU failures. Furthermore, I propose and evaluate hardware component to resolve the bursty device-local preservation traffic within a node which is a new challenge as GPU density grows in the system