Techniques for analyzing distributed computations
Access full-text files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Inherent non-determinism in distributed programs and presence of multiple threads of control makes it difficult to write correct distributed software. Not surprisingly, distributed systems are particularly vulnerable to software faults. To build a distributed system capable of tolerating software faults, two important problems need to be addressed: fault detection and fault recovery. The fault detection problem requires finding a (consistent) global state of the computation that satisfies certain predicate (e.g., violation of mutual exclusion). To prevent a fault from causing any serious damage such as corrupting stable storage, it is essential that it be detected in a timely manner. However, we prove that detecting a predicate in 2-CNF, even when no two clauses contain variables from the same process, is an NP-complete problem. We develop a technique, based on computation slicing, to reduce the size of the computation and thus the number of global states to be examined for detecting a predicate. Slicing can be used to throw away the extraneous global states of the computation in an efficient manner, and focus on only those that are currently relevant for our purpose. To detect a fault, therefore, rather than searching the state-space of the computation, it is much more efficient to search the state-space of the slice. We identify several useful classes of predicates for which the slice can be computed efficiently. Our experimental results indicate that slicing can lead to an exponential reduction over existing techniques both in terms of time as well as space for fault detection. To recover from faults, we consider rollback recovery approach, which involves restoring the system to a previous state and then re-executing. We focus on rollback recovery using controlled re-execution, which is useful and effective for tolerating synchronization faults. Unlike other approaches which depend on chance and do not ensure that the re-execution is fault-free, the controlled re-execution method avoids synchronization faults during re-execution in a deterministic fashion. Specifically, it selectively adds synchronization dependencies during re-execution to ensure that the previously detected synchronization faults do not occur again. We provide efficient algorithms to solve the problem for two important classes of synchronization faults.