Browsing by Subject "Fault tolerance"

Now showing 1 - 10 of 10

Active replication vs. fusion as fault tolerance mechanisms
(2016-05) Boyd, Jeremy J.; Garg, Vijay K. (Vijay Kumar), 1963-; Aziz, Adnan
This report compares two strategies for crash fault tolerance of nodes in distributed systems: active replication and fusion. To tolerate f crash faults, active replication maintains f backup servers for each primary. Fusion, however, maintains a set of f backup servers that contain the replicated data for all primaries in coded form. If n primaries each contain m data to be backed up, then, active replication requires O(nmf) space, while fusion requires only O(mf) space. These savings come at the cost of additional time during the recovery process due to additional messages and computation. For this report, we have implemented an application in which primary nodes maintain increasingly large data structures and periodically crash. Both active replication and fusion are evaluated as recovery mechanisms for the crashed nodes. The mechanisms are evaluated for performance across three metrics: backup size, time spent during updates to the backup, and recovery time. Our results validate theoretical expectations put forward in the literature – that fusion claims significant space savings at the cost of high recovery time. In the most extreme measured case, fusion costs approximately 83% of the space that replication does, while recovery time is three orders of magnitude more expensive in fusion (3.4s) than in replication (0.0037s). However, we also find that the gap between fusion and replication grows as nodes are introduced to the system. We find furthermore that fusion performs more consistently in update time due to the high variability of multicasting within active replication systems.
Bit error tolerance of Deep Neural Network accelerators
(2020-05-07) Radhakrishnan, Vignesh; Touba, Nur A.
The resurgence of machine learning in various applications and it's inherent compute-intensive nature require hardware accelerators in the edge devices. The underlying process technology is prone to faults. Hence, there is a need to make these hardware accelerators dependable. Deep Convolutional Neural Networks perform well for machine learning applications like image classification. This report presents the impact of bit errors on the DNN's performance. Most accelerators are designed with a one data type that fits all approach. The sensitivity of the DNNs with a single-precision floating-point format is studied. Due to the high sensitivity of the deep layers to critical bit errors and rapid performance degradation with increasing BER, several potential techniques to actively improve fault tolerance are discussed.
Exploring scaling limits and computational paradigms for next generation embedded systems
(2009-12) Zykov, Andrey V.; De Veciana, Gustavo
It is widely recognized that device and interconnect fabrics at the nanoscale will be characterized by a higher density of permanent defects and increased susceptibility to transient faults. This appears to be intrinsic to nanoscale regimes and fundamentally limits the eventual benefits of the increased device density, i.e., the overheads associated with achieving fault-tolerance may counter the benefits of increased device density -- density-reliability tradeoff. At the same time, as devices scale down one can expect a higher proportion of area to be associated with interconnection, i.e., area is wire dominated. In this work we theoretically explore density-reliability tradeoffs in wire dominated integrated systems. We derive an area scaling model based on simple assumptions capturing the salient features of hierarchical design for high performance systems, along with first order assumptions on reliability, wire area, and wire length across hierarchical levels. We then evaluate overheads associated with using basic fault-tolerance techniques at different levels of the design hierarchy. This, albeit simplified model, allows us to tackle several interesting theoretical questions: (1) When does it make sense to use smaller less reliable devices? (2) At what scale of the design hierarchy should fault tolerance be applied in high performance integrated systems? In the second part of this thesis we explore perturbation-based computational models as a promising choice for implementing next generation ubiquitous information technology on unreliable nanotechnologies. We show the inherent robustness of such computational models to high defect densities and performance uncertainty which, when combined with low manufacturing precision requirements, makes them particularly suitable for emerging nanoelectronics. We propose a hybrid eNano-CMOS perturbation-based computing platform relying on a new style of configurability that exploits the computational model's unique form of unstructured redundancy. We consider the practicality and scalability of perturbation-based computational models by developing and assessing initial foundations for engineering such systems. Specifically, new design and decomposition principles exploiting task specific contextual and temporal scales are proposed and shown to substantially reduce complexity for several benchmark tasks. Our results provide strong evidence for the relevance and potential of this class of computational models when targeted at emerging unreliable nanoelectronics.
Fault tolerance in distributed systems : a coding-theoretic approach
(2012-08) Balasubramanian, Bharath; Garg, Vijay K. (Vijay Kumar), 1963-; Chase, Craig M.; Julien, Christine L.; Plaxton, Greg; Touba, Nur A.; Vishwanath, Sriram
Distributed systems are rapidly increasing in importance due to the need for scalable computations on huge volumes of data. This fact is reflected in many real-world distributed applications such as Amazon's EC2 cloud computing service, Facebook's Cassandra key-value store or Apache's Hadoop MapReduce framework. Multi-core architectures developed by companies such as Intel and AMD have further brought this to prominence, since workloads can now be distributed across many individual cores. The nodes or entities in such systems are often built using commodity hardware and are prone to physical failures and security vulnerabilities. Achieving fault tolerance in such systems is a challenging task, since it is not easy to observe and control these distributed entities. Replication is a standard approach for fault tolerance in distributed systems. The main advantage of this approach is that the backups incur very little overhead in terms of the time taken for normal operation or recovery. However, replication is grossly wasteful in terms of the number of backups required for fault tolerance. The large number of backups has two major implications. First, the total space or memory required for fault tolerance is considerably high. Second, there is a significant cost of resources such as the power required to run the backup processes. Given the large number of distributed servers employed in real-world applications, it is a hard task to provide fault tolerance while achieving both space and operational efficiency. In the world of data fault tolerance and communication, coding theory is used as the space efficient alternate for replication. A direct application of coding theory to distributed servers, treating the servers as blocks of data, is very inefficient in terms of the updates to the backups. This is primarily because each update to the server will affect many blocks in memory, all of which have to be re-encoded at the backups. This leads us to the following thesis statement: Can we design a mechanism for fault tolerance in distributed systems that combines the space efficiency of coding theory with the low operational overhead of replication? We present a new paradigm to solve this problem, broadly referred to as fusion. We provide fusion-based solutions for two models of computation that are representative of a large class of applications: (i) Systems modeled as deterministic finite state machines and, (ii) Systems modeled as programs containing data structures. For finite state machines, we use the notion of Hamming distances to present a polynomial time algorithm to generate efficient backup state machines. For programs hosting data structures, we use a combination of erasure codes and selective replication to generate efficient backups for most commonly used data structures such as queues, array lists, linked lists, vectors and maps. We present theoretical and experimental results that demonstrate the efficiency of our schemes over replication. Finally, we use our schemes to design an efficient solution for fault tolerance in two real-world applications: Amazons Dynamo key-value store, and Google's MapReduce framework.
Fusion-based Hadoop MapReduce job for fault tolerance in distributed systems
(2013-05) Ho, Iat-Kei; Garg, Vijay K. (Vijay Kumar), 1963-
Standard recovery solution on a failed task in Hadoop systems is to execute the task again. After retrying for a configured number of times, it is marked as failure. With significant amount of data, complicated Map and Reduce functions, recovering corrupted or unfinished data from a failed job can be more efficient than re-executing the same job. This paper is an extension of [1] by applying fusion-based technique [7][8] in Hadoop MapReduce tasks execution to enhance its fault tolerance. Multiple data sets are executed through Hadoop MapReduce with and without fusion in various pre-defined failure scenarios for comparison. As the complexity of the Map and Reduce function relative to the Recover function increases, it becomes more efficient to utilize fusion and users can tolerate faults by incurring less than ten percent of extra execution time.
Nearly free resilient memory architectures that balance resilience, performance, and cost
(2017-08-29) Kim, Dong Wan; Erez, Mattan; Touba, Nur A.; Fussell, Donald S.; Reddi, Vijay Janapa; Tsai, Timothy K.
Memory reliability has been a major design constraint for mission-critical and large-scale systems for many years. Continued innovation is still necessary because the rate of faults, and the errors they lead to, grows with system size and because some faults become more likely as fabrication technology advances. Furthermore, recent field studies have shown that more severe permanent/intermittent and multi-bit faults are roughly as frequent as single-bit and transient ones. Therefore, strong error checking and correcting (ECC) schemes that can correct multi-bit errors have been developed and are in use. However, using ECC to correct the numerous recurring errors from permanent faults forces trading off cost, performance, and reliability. Firstly, a permanent fault is likely to result in numerous erroneous accesses, each requiring possibly high correction overhead. Secondly, once redundancy is used for correction, further errors may go uncorrected leading to data loss, which called detected uncorrectable error (DUE), or worse, go undetected and result in silent data corruption (SDC). Strong ECC can then be used to tolerate more errors, but at higher overhead. The straightforward solution to addressing this issue of repeated costly corrections and reduced coverage is to replace faulty memory devices, however, doing so is expensive, and requires either increased system down time or increased storage and bandwidth overheads. An economical alternative is to retire and possibly remap just the faulty memory regions. The existing retirement techniques, however, either require sophisticated software support, impact capacity, reliability, and/or performance, or introduce additional storage and hardware structures. Implementing a strong ECC such as Single Device Data Correction (SDDC) ECC (or chipkill-level ECC) is typically expensive in terms of storage and complexity. It is even challenging to implement a SDDC level ECC in emerging high bandwidth memories such as HBM2. This is because a single ECC codeword is transferred from one memory device in HBM2 for instance, and thus simply adding a redundant device results in high overhead in storage, energy, and bandwidth. Such wide data width memories are, however, widely used in graphics processing units (GPUs) and Intel’s Xeon Phi processors (e.g. Knight Landing) to exploit high memory bandwidth. As GPUs are popular to build large scale high performance systems, improving resilience of GPU memory systems has been an important issue, but the current GPU memory ECC is limited to single-bit error correction double-bit error detection (SECDED). To improve GPU memory resilience further, therefore, multiple techniques are coordinated. One interesting addition is to use a software driven memory repair technique, which retires affected pages with virtual memory support. This approach reduces the risk of uncorrected or even undetected memory errors in the future. However, it can be unsafe to avoid a DUE because the underlying ECC code is weak. Therefore, it needs to proactively retire susceptible memory blocks to avoid the potential threat of a system failure in the future. In this dissertation, I develop strong, and low-cost memory fault tolerance mechanisms that improve both system resilience and availability without wasting resources, increasing memory access complexity, and compromising the fault tolerance of existing resilience schemes. I first identify two interesting characteristics of DRAM failures in the field. First, permanent faults are as frequent as transient faults. Second, most faults affect small memory regions. Based on such analysis of DRAM failure patterns, I propose and evaluate two novel hardware-only memory repair mechanisms that improve memory system reliability significantly without compromising performance and increasing overhead. I also develop a strong, low-cost GPU memory fault tolerance mechanism based on three insights. First, ECC for GPUs should not interfere with the high-bandwidth DRAM system that uses possibly just one DRAM device for each data access. The second insight is that the way in which GPUs are used as accelerators offers unique opportunities to tolerate even severe memory errors without relying on SDDC ECC. The third insight is that nearly all permanent memory faults can be repaired with low overhead techniques. According to these observations, I propose and evaluate a multi-tier, strong GPU memory fault tolerance mechanism where the techniques at each level closely work together to significantly improve accelerator memory system resilience with low overhead.
A new approach to detecting failures in distributed systems
(2015-08) Leners, Joshua Blaise; Alvisi, Lorenzo; Aguilera, Marcos K; Shmatikov, Vitaly; Walfish, Michael; Witchel, Emmett
Fault-tolerant distributed systems often handle failures in two steps: first, detect the failure and, second, take some recovery action. A common approach to detecting failures is end-to-end timeouts, but using timeouts brings problems. First, timeouts are inaccurate: just because a process is unresponsive does not mean that process has failed. Second, choosing a timeout is hard: short timeouts can exacerbate the problem of inaccuracy, and long timeouts can make the system wait unnecessarily. In fact, a good timeout value—one that balances the choice between accuracy and speed—may not even exist, owing to the variance in a system’s end-to-end delays. ƃis dissertation posits a new approach to detecting failures in distributed systems: use information about failures that is local to each component, e.g., the contents of an OS’s process table. We call such information inside information, and use it as the basis in the design and implementation of three failure reporting services for data center applications, which we call Falcon, Albatross, and Pigeon. Falcon deploys a network of software modules to gather inside information in the system, and it guarantees that it never reports a working process as crashed by sometimes terminating unresponsive components. ƃis choice helps applications by making reports of failure reliable, meaning that applications can treat them as ground truth. Unfortunately, Falcon cannot handle network failures because guaranteeing that a process has crashed requires network communication; we address this problem in Albatross and Pigeon. Instead of killing, Albatross blocks suspected processes from using the network, allowing applications to make progress during network partitions. Pigeon renounces interference altogether, and reports inside information to applications directly and with more detail to help applications make better recovery decisions. By using these services, applications can improve their recovery from failures both quantitatively and qualitatively. Quantitatively, these services reduce detection time by one to two orders of magnitude over the end-to-end timeouts commonly used by data center applications, thereby reducing the unavailability caused by failures. Qualitatively, these services provide more specific information about failures, which can reduce the logic required for recovery and can help applications better decide when recovery is not necessary.
Replicating multithreaded services
(2014-12) Kapritsos, Emmanouil; Alvisi, Lorenzo
For the last 40 years, the systems community has invested a lot of effort in designing techniques for building fault tolerant distributed systems and services. This effort has produced a massive list of results: the literature describes how to design replication protocols that tolerate a wide range of failures (from simple crashes to malicious "Byzantine" failures) in a wide range of settings (e.g. synchronous or asynchronous communication, with or without stable storage), optimizing various metrics (e.g. number of messages, latency, throughput). These techniques have their roots in ideas, such as the abstraction of State Machine Replication and the Paxos protocol, that were conceived when computing was very different than it is today: computers had a single core; all processing was done using a single thread of control, handling requests sequentially; and a collection of 20 nodes was considered a large distributed system. In the last decade, however, computing has gone through some major paradigm shifts, with the advent of multicore architectures and large cloud infrastructures. This dissertation explains how these profound changes impact the practical usefulness of traditional fault tolerant techniques and proposes new ways to architect these solutions to fit the new paradigms.
Safe machine learning accelerator and interconnect design
(2019-08) Xu, Zheng (Ph. D. in electrical and computer engineering); Abraham, Jacob A.; Touba, Nur A; Orshansky, Michael E; Gerstlauer, Andreas M; Kanawati, Ghani
Recently Machine Learning (ML) accelerator has grown into prominence with significant power-performance efficiency improvements over CPU or GPU. Network-on-chip (NoC) was proposed to deliver fast, reliable and scalable communication between various on-chip IPs including CPU, GPU and hardware accelerators. Both ML accelerator and NoC are critical components of the heterogeneous computing system design. On the other hand, Aggressive technology scaling poses new problems of permanent and transient faults. Functional safety is the top priority for automotive and other mission-critical system design. Safety designs emphasize on high error detection coverage with the safe state to recover within Fault Tolerant Time Interval (FTTI). With more functionality and performance features packed into a safety chip design, die size grows, and power consumption becomes a limiting factor in meeting system thermal requirements. Various Concurrent Error Detection (CED) techniques were proposed to detect permanent and transient errors of hardware design with different strength and capability which can be characterized with a set of defined metrics including coverage, latency, localization and Power Performance Area (PPA) efficiency. However, few of them were proven or applicable for safety design with stringent error detection coverage requirements. Therefore, Dual Modular Redundancy (DMR) is still commonly used in the industry to provide robust error detection but with a large area and power penalty. In this dissertation, we studied area and power efficient safety design for robust CED and error recovery of ML accelerator and NoC interconnect. In Chapter 2 and 3, we proposed algorithm based error checking and correction techniques for a Convolutional Neural Network (CNN) accelerator and demonstrated that we could achieve high Diagnostic Coverage (DC) safety goal with minimal area, power and run-time error recovery overhead and without performance degradation. In Chapter 4, we developed a new partition optimized packet-level run-time error detection technique for NoC network, and in Chapter 5, we applied a combination of partial duplication, Error Detection Code (EDC) and invariance checking techniques for Network Interface (NI) to meet high DC safety requirement with minimal power and area overhead.
UpRight fault tolerance
(2010-12) Clement, Allen Grogan; Alvisi, Lorenzo; Dahlin, Mike; Druschel, Peter; Walfish, Michael; Witchell, Emmett
Experiences with computer systems indicate an inconvenient truth: computers fail and they fail in interesting ways. Although using redundancy to protect against fail-stop failures is common practice, non-fail-stop computer and network failures occur for a variety of reasons including power outage, disk or memory corruption, NIC malfunction, user error, operating system and application bugs or misconfiguration, and many others. The impact of these failures can be dramatic, ranging from service unavailability to stranding airplane passengers on the runway to companies closing. While high-stakes embedded systems have embraced Byzantine fault tolerant techniques, general purpose computing continues to rely on techniques that are fundamentally crash tolerant. In a general purpose environment, the current best practices response to non-fail-stop failures can charitably be described as pragmatic: identify a root cause and add checksums to prevent that error from happening again in the future. Pragmatic responses have proven effective for patching holes and protecting against faults once they have occurred; unfortunately the initial damage has already been done, and it is difficult to say if the patches made to address previous faults will protect against future failures. We posit that an end-to-end solution based on Byzantine fault tolerant (BFT) state machine replication is an efficient and deployable alternative to current ad hoc approaches favored in general purpose computing. The replicated state machine approach ensures that multiple copies of the same deterministic application execute requests in the same order and provides end-to-end assurance that independent transient failures will not lead to unavailability or incorrect responses. An efficient and effective end-to-end solution covers faults that have already been observed as well as failures that have not yet occurred, and it provides structural confidence that developers won't have to track down yet another failure caused by some unpredicted memory, disk, or network behavior. While the promise of end-to-end failure protection is intriguing, significant technical and practical challenges currently prevent adoption in general purpose computing environments. On the technical side, it is important that end-to-end solutions maintain the performance characteristics of deployed systems: if end-to-end solutions dramatically increase computing requirements, dramatically reduce throughput, or dramatically increase latency during normal operation then end-to-end techniques are a non-starter. On the practical side, it is important that end-to-end approaches be both comprehensible and easy to incorporate: if the cost of end-to-end solutions is rewriting an application or trusting intricate and arcane protocols, then end-to-end solutions will not be adopted. In this thesis we show that BFT state machine replication can and be used in deployed systems. Reaching this goal requires us to address both the technical and practical challenges previously mentioned. We revisiting disparate research results from the last decade and tweak, refine, and revise the core ideas to fit together into a coherent whole. Addressing the practical concerns requires us to simplify the process of incorporating BFT techniques into legacy applications.