A new approach to detecting failures in distributed systems

Leners, Joshua Blaise

A new approach to detecting failures in distributed systems

Access full-text files

LENERS-DISSERTATION-2015.pdf (690.09 KB)

Date

2015-08

Authors

Leners, Joshua Blaise

Abstract

Fault-tolerant distributed systems often handle failures in two steps: first, detect the failure and, second, take some recovery action. A common approach to detecting failures is end-to-end timeouts, but using timeouts brings problems. First, timeouts are inaccurate: just because a process is unresponsive does not mean that process has failed. Second, choosing a timeout is hard: short timeouts can exacerbate the problem of inaccuracy, and long timeouts can make the system wait unnecessarily. In fact, a good timeout value—one that balances the choice between accuracy and speed—may not even exist, owing to the variance in a system’s end-to-end delays. ƃis dissertation posits a new approach to detecting failures in distributed systems: use information about failures that is local to each component, e.g., the contents of an OS’s process table. We call such information inside information, and use it as the basis in the design and implementation of three failure reporting services for data center applications, which we call Falcon, Albatross, and Pigeon. Falcon deploys a network of software modules to gather inside information in the system, and it guarantees that it never reports a working process as crashed by sometimes terminating unresponsive components. ƃis choice helps applications by making reports of failure reliable, meaning that applications can treat them as ground truth. Unfortunately, Falcon cannot handle network failures because guaranteeing that a process has crashed requires network communication; we address this problem in Albatross and Pigeon. Instead of killing, Albatross blocks suspected processes from using the network, allowing applications to make progress during network partitions. Pigeon renounces interference altogether, and reports inside information to applications directly and with more detail to help applications make better recovery decisions. By using these services, applications can improve their recovery from failures both quantitatively and qualitatively. Quantitatively, these services reduce detection time by one to two orders of magnitude over the end-to-end timeouts commonly used by data center applications, thereby reducing the unavailability caused by failures. Qualitatively, these services provide more specific information about failures, which can reduce the logic required for recovery and can help applications better decide when recovery is not necessary.