Compiler-assisted staggered checkpointing

dc.contributor.advisorLin, Yun Calvinen
dc.contributor.committeeMemberChoi, Sung-Eunen
dc.contributor.committeeMemberAlvisi, Lorenzoen
dc.contributor.committeeMemberMcKinley, Kathryn S.en
dc.contributor.committeeMemberPingali, Keshaven
dc.creatorNorman, Alison Nicholasen
dc.date.accessioned2010-11-23T22:28:19Zen
dc.date.available2010-11-23T22:28:19Zen
dc.date.available2010-11-23T22:28:25Zen
dc.date.issued2010-08en
dc.date.submittedAugust 2010en
dc.date.updated2010-11-23T22:28:25Zen
dc.descriptiontexten
dc.description.abstractTo make progress in the face of failures, long-running parallel applications need to save their state, known as a checkpoint. Unfortunately, current checkpointing techniques are becoming untenable on large-scale supercomputers. Many applications checkpoint all processes simultaneously--a technique that is easy to implement but often saturates the network and file system, causing a significant increase in checkpoint overhead. This thesis introduces compiler-assisted staggered checkpointing, where processes checkpoint at different places in the application text, thereby reducing contention for the network and file system. This checkpointing technique is algorithmically challenging since the number of possible solutions is enormous and the number of desirable solutions is small, but we have developed a compiler algorithm that both places staggered checkpoints in an application and ensures that the solution is desirable. This algorithm successfully places staggered checkpoints in parallel applications configured to use tens of thousands of processes. For our benchmarks, this algorithm successfully finds and places useful recovery lines that are up to 37% faster for all configurations than recovery lines where all processes write their data at approximately the same time. We also analyze the success of staggered checkpointing by investigating sets of application and system characteristics for which it reduces network and file system contention. We find that for many configurations, staggered checkpointing reduces both checkpointing time and overall execution time. To perform these analyses, we develop an event-driven simulator for large-scale systems that estimates the behavior of the network, global file system, and local hardware using predictive models. Our simulator allows us to accurately study applications that have thousands of processes; it on average predicts execution times as 83% of their measured value.en
dc.description.departmentComputer Sciencesen
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttp://hdl.handle.net/2152/ETD-UT-2010-08-1746en
dc.language.isoengen
dc.subjectSupercomputingen
dc.subjectCheckpointingen
dc.subjectSimulatoren
dc.subjectLarge-scale parallel applicationsen
dc.titleCompiler-assisted staggered checkpointingen
dc.type.genrethesisen
thesis.degree.departmentComputer Sciencesen
thesis.degree.disciplineComputer Sciencesen
thesis.degree.grantorUniversity of Texas at Austinen
thesis.degree.levelDoctoralen
thesis.degree.nameDoctor of Philosophyen

Access full-text files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
NORMAN-DISSERTATION.pdf
Size:
1.02 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.12 KB
Format:
Plain Text
Description: