Diagnosing and tolerating bugs in deployed systems

Bond, Michael David

Diagnosing and tolerating bugs in deployed systems

Access full-text files

bondm80767.pdf (2.26 MB)

Date

2008-12

Authors

Bond, Michael David

Abstract

Deployed software is never free of bugs. These bugs cause software to fail, wasting billions of dollars and sometimes causing injury or death. Bugs are pervasive in modern software, which is increasingly complex due to demand for features, extensibility, and integration of components. Complete validation and exhaustive testing are infeasible for substantial software systems, and therefore deployed software exhibits untested and unanalyzed behaviors. Software behaves differently after deployment due to different environments and inputs, so developers cannot find and fix all bugs before deploying software, and they cannot easily reproduce post-deployment bugs outside of the deployed setting. This dissertation argues that post-deployment is a compelling environment for diagnosing and tolerating bugs, and it introduces a general approach called post-deployment debugging. Techniques in this class are efficient enough to go unnoticed by users and accurate enough to find and report the sources of errors to developers. We demonstrate that they help developers find and fix bugs and help users get more functionality out of failing software. To diagnose post-deployment failures, programmers need to understand the program operations--control and data flow--responsible for failures. Prior approaches for widespread tracking of control and data flow often slow programs by two times or more and increase memory usage significantly, making them impractical for online use. We present novel techniques for representing control and data flow that add modest overhead while still providing diagnostic information directly useful for fixing bugs. The first technique, probabilistic calling context (PCC), provides low-overhead context sensitivity to dynamic analyses that detect new or anomalous deployed behavior. Second, Bell statistically correlates control flow with data, and it reconstructs program locations associated with data. We apply Bell to leak detection, where it tracks and reports program locations responsible for real memory leaks. The third technique, origin tracking, tracks the originating program locations of unusable values such as null references, by storing origins in place of unusable values. These origins are cheap to track and are directly useful for diagnosing real-world null pointer exceptions. Post-deployment diagnosis helps developers find and fix bugs, but in the meantime, users need help with failing software. We present techniques that tolerate memory leaks, which are particularly difficult to diagnose since they have no immediate symptoms and may take days or longer to materialize. Our techniques effectively narrow the gap between reachability and liveness by providing the illusion that dead but reachable objects do not consume resources. The techniques identify stale objects not used in a while and remove them from the application and garbage collector’s working set. The first technique, Melt, relocates stale memory to disk, so it can restore objects if the program uses them later. Growing leaks exhaust the disk eventually, and some embedded systems have no disk. Our second technique, leak pruning, addresses these limitations by automatically reclaiming likely leaked memory. It preserves semantics by waiting until heap exhaustion to reclaim memory--then intercepting program attempts to access reclaimed memory. We demonstrate the utility and efficiency of post-deployment debugging on large, real-world programs--where they pinpoint bug causes and improve software availability. Post-deployment debugging efficiently exposes and exploits programming language semantics and opens up a promising direction for improving software robustness.