A new approach to detecting failures in distributed systems

Leners, Joshua Blaise

A new approach to detecting failures in distributed systems

dc.contributor.advisor	Alvisi, Lorenzo	en
dc.contributor.committeeMember	Aguilera, Marcos K	en
dc.contributor.committeeMember	Shmatikov, Vitaly	en
dc.contributor.committeeMember	Walfish, Michael	en
dc.contributor.committeeMember	Witchel, Emmett	en
dc.creator	Leners, Joshua Blaise	en
dc.creator.orcid	0000-0002-5937-3237	en
dc.date.accessioned	2015-09-18T16:43:28Z	en
dc.date.issued	2015-08	en
dc.date.submitted	August 2015	en
dc.date.updated	2015-09-18T16:43:29Z	en
dc.description	text	en
dc.description.abstract	Fault-tolerant distributed systems often handle failures in two steps: first, detect the failure and, second, take some recovery action. A common approach to detecting failures is end-to-end timeouts, but using timeouts brings problems. First, timeouts are inaccurate: just because a process is unresponsive does not mean that process has failed. Second, choosing a timeout is hard: short timeouts can exacerbate the problem of inaccuracy, and long timeouts can make the system wait unnecessarily. In fact, a good timeout value—one that balances the choice between accuracy and speed—may not even exist, owing to the variance in a system’s end-to-end delays. ƃis dissertation posits a new approach to detecting failures in distributed systems: use information about failures that is local to each component, e.g., the contents of an OS’s process table. We call such information inside information, and use it as the basis in the design and implementation of three failure reporting services for data center applications, which we call Falcon, Albatross, and Pigeon. Falcon deploys a network of software modules to gather inside information in the system, and it guarantees that it never reports a working process as crashed by sometimes terminating unresponsive components. ƃis choice helps applications by making reports of failure reliable, meaning that applications can treat them as ground truth. Unfortunately, Falcon cannot handle network failures because guaranteeing that a process has crashed requires network communication; we address this problem in Albatross and Pigeon. Instead of killing, Albatross blocks suspected processes from using the network, allowing applications to make progress during network partitions. Pigeon renounces interference altogether, and reports inside information to applications directly and with more detail to help applications make better recovery decisions. By using these services, applications can improve their recovery from failures both quantitatively and qualitatively. Quantitatively, these services reduce detection time by one to two orders of magnitude over the end-to-end timeouts commonly used by data center applications, thereby reducing the unavailability caused by failures. Qualitatively, these services provide more specific information about failures, which can reduce the logic required for recovery and can help applications better decide when recovery is not necessary.	en
dc.description.department	Computer Sciences	en
dc.format.mimetype	application/pdf	en
dc.identifier.uri	http://hdl.handle.net/2152/31376	en
dc.language.iso	en	en
dc.subject	Computer science	en
dc.subject	Fault tolerance	en
dc.subject	Distributed systems	en
dc.subject	Failure detection	en
dc.title	A new approach to detecting failures in distributed systems	en
dc.type	Thesis	en
thesis.degree.department	Computer Sciences	en
thesis.degree.discipline	Computer science	en
thesis.degree.grantor	The University of Texas at Austin	en
thesis.degree.level	Doctoral	en
thesis.degree.name	Doctor of Philosophy	en

Access full-text files

Original bundle

Now showing 1 - 1 of 1

Name:: LENERS-DISSERTATION-2015.pdf
Size:: 690.09 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: LICENSE.txt
Size:: 1.84 KB
Format:: Plain Text
Description:

Download

Collections

UT Electronic Theses and Dissertations