Resilient heterogeneous systems with Containment Domains
Access full-text files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Resilience is a continuing concern for extreme-scale scientific applications. Tolerating the ever-increasing hardware fault rates demands a scalable end-to-end resilience scheme. The fundamental issue of current system-wide techniques, such as checkpoint-restart, is a one-size-fits-all approach, which globally recovers local failures. The challenges for supporting efficient resilience grow at scale with the trend of adopting accelerators. Exploiting resiliency tailored to an application can offer a potential breakthrough that enables efficient localized recovery, because an individual node maintains low failure rate at scale. I propose a framework realizing Containment Domains (CDs) that addresses the resilience challenges for future-scale heterogeneous systems. My dissertation consists of two parts: tackling the resilience problem for CPU-only systems with CDs and extending CDs to systems with GPUs. In the first, I develop the CDs framework and adapt CDs-based resilience to real-world applications to verify its analytical model and show its feasibility. CDs elevate resilience to a first-class abstraction and exploit application properties to enable hierarchically decomposing applications into local domains to contain errors. Confining the range of errors with such logical domains in a program enables localized recovery. The CDs framework validates the analytical model of CDs, which matches the trend of the efficiency results measured by running CD-enabled applications with error injection. Based on the analytical model, I develop an automated workflow to tune Containment Domains by leveraging different likelihood of failures, storage hierarchy, and application characteristics. The CD-based resiliency estimated by the analytical model projects higher efficiency than the state-of-the-art, and promises scalablility toward exascale computing. In the second part of the dissertation, I extend CDs to CUDA applications on high-performance computing (HPC) systems with GPUs. GPUs offer higher computational power at lower energy and cost than homogeneous CPU-only nodes. The heterogeneous nodes in modern HPC systems show a tendency of high GPU-to-CPU ratio. While an accelerator-rich machine reduces the total number of compute nodes required to achieve a performance target, a single node becomes vulnerable to accelerator failures as well as congested intra-node resources. Preserving a large amount of local state within accelerators for checkpointing incurs significant overheads. Node-level resilience reveals a new challenge at the scale of accelerator density of HPC systems. I apply CDs to isolate and recover GPU failures in HPC CUDA applications (CD-CUDA). The extention of CDs to CUDA programs allows to express logical domains at the kernel boundary. CD-CUDA improves the system-level efficiency for resilience compared to host-only CDs by containing GPU failures. Furthermore, I propose and evaluate hardware component to resolve the bursty device-local preservation traffic within a node which is a new challenge as GPU density grows in the system