A new approach to detecting failures in distributed systems

dc.contributor.advisorAlvisi, Lorenzoen
dc.contributor.committeeMemberAguilera, Marcos Ken
dc.contributor.committeeMemberShmatikov, Vitalyen
dc.contributor.committeeMemberWalfish, Michaelen
dc.contributor.committeeMemberWitchel, Emmetten
dc.creatorLeners, Joshua Blaiseen
dc.creator.orcid0000-0002-5937-3237en
dc.date.accessioned2015-09-18T16:43:28Zen
dc.date.issued2015-08en
dc.date.submittedAugust 2015en
dc.date.updated2015-09-18T16:43:29Zen
dc.descriptiontexten
dc.description.abstractFault-tolerant distributed systems often handle failures in two steps: first, detect the failure and, second, take some recovery action. A common approach to detecting failures is end-to-end timeouts, but using timeouts brings problems. First, timeouts are inaccurate: just because a process is unresponsive does not mean that process has failed. Second, choosing a timeout is hard: short timeouts can exacerbate the problem of inaccuracy, and long timeouts can make the system wait unnecessarily. In fact, a good timeout value—one that balances the choice between accuracy and speed—may not even exist, owing to the variance in a system’s end-to-end delays. ƃis dissertation posits a new approach to detecting failures in distributed systems: use information about failures that is local to each component, e.g., the contents of an OS’s process table. We call such information inside information, and use it as the basis in the design and implementation of three failure reporting services for data center applications, which we call Falcon, Albatross, and Pigeon. Falcon deploys a network of software modules to gather inside information in the system, and it guarantees that it never reports a working process as crashed by sometimes terminating unresponsive components. ƃis choice helps applications by making reports of failure reliable, meaning that applications can treat them as ground truth. Unfortunately, Falcon cannot handle network failures because guaranteeing that a process has crashed requires network communication; we address this problem in Albatross and Pigeon. Instead of killing, Albatross blocks suspected processes from using the network, allowing applications to make progress during network partitions. Pigeon renounces interference altogether, and reports inside information to applications directly and with more detail to help applications make better recovery decisions. By using these services, applications can improve their recovery from failures both quantitatively and qualitatively. Quantitatively, these services reduce detection time by one to two orders of magnitude over the end-to-end timeouts commonly used by data center applications, thereby reducing the unavailability caused by failures. Qualitatively, these services provide more specific information about failures, which can reduce the logic required for recovery and can help applications better decide when recovery is not necessary.en
dc.description.departmentComputer Sciencesen
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttp://hdl.handle.net/2152/31376en
dc.language.isoenen
dc.subjectComputer scienceen
dc.subjectFault toleranceen
dc.subjectDistributed systemsen
dc.subjectFailure detectionen
dc.titleA new approach to detecting failures in distributed systemsen
dc.typeThesisen
thesis.degree.departmentComputer Sciencesen
thesis.degree.disciplineComputer scienceen
thesis.degree.grantorThe University of Texas at Austinen
thesis.degree.levelDoctoralen
thesis.degree.nameDoctor of Philosophyen

Access full-text files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
LENERS-DISSERTATION-2015.pdf
Size:
690.09 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.84 KB
Format:
Plain Text
Description: