Techniques to improve the hard and soft error reliability of distributed architectures

dc.contributor.advisorKeckler, Stephen W.en
dc.creatorShivakumar, Premkishoreen
dc.description.abstractAggressive technology scaling, rising on-chip integration, and the continued increase in microprocessor power and thermal density threaten both the hard and soft error reliability of future microprocessor designs. Therefore, designing low overhead mechanisms for improving reliability will be a critical requirement at future technologies. Technology constraints of wire-delay and power consumption, and limits on deep pipelining, have impelled a shift to distributed architectures that rely on modularity in design, and on-chip interconnection networks for communication, and place a greater burden on software for exploiting concurrency from the application to achieve high performance on the distributed substrate [1]. The focus of this dissertation is on architectural techniques for improving the hard and soft error reliability of future technology-scalable distributed architectures. We make the key observation that these underlying principles of distributed architectures have important synergies that can be exploited to improve the hard and soft error reliability of microprocessors at low overhead. Using a detailed end-to-end model for chip yield, we demonstrate that with just redundant rows and columns in memory arrays and caches the yield of chip multiprocessors drops substantially from 85% at 250nm to 60% at 50nm. We exploit the three principles of modern and future distributed architectures: the abundant microarchitectural redundancy provided by modular design, the natural redundancy in communication paths provided by multi-hop, routed, on-chip networks, and the availability of greater software assistance; for efficiently managing the redundancy to improve yield at low performance overhead. Using just modular redundancy at the intra- and inter-processor granularity, we improve the yield of chip multiprocessors to 99.6% at 50nm, with a maximum reduction in performance in any chip of less than 20%. Further, we extend this technique to take advantage of the block-atomic, and static-placement-dynamic-issue execution model in the TRIPS architecture to efficiently manage the redundancy provided by modular design and on-chip networks. Our evaluation of this compiler-assisted yield enhancement technique in the TRIPS architecture shows significant yield improvement with less than 4% impact on performance. This dissertation also quantitatively demonstrates through detailed modeling that the raw soft error rate, especially that of combinational logic, will increase substantially at future technologies. This emphasizes the need for innovative solutions that extend soft error protection to latches, and combinational logic, while appropriately balancing the power consumption, area, and complexity overhead. We propose a new class of better-than-worst-case soft error reliability techniques called AVF throttling, that trade concurrency for reducing the amount of processor state vulnerable to soft errors. Since future architectures must increasingly rely on exploiting concurrency for achieving high performance, they aggressively bring future program state into the processor and mine them for available parallelism, thus increasing the amount of vulnerable state. AVF throttling is based on the key observation that while exploiting concurrency on the critical path can significantly improve performance, the majority of the program has abundant slack and can be deferred to substantially reduce the amount of vulnerable state with negligible effect on the execution time. Our evaluation in the TRIPS architecture shows that around 90% of the vulnerable state is due to slack. We design a hybrid AVF throttling technique that uses the compiler to estimate slack and the hardware to dynamically exploit it. Using the compiler for static slack estimation considerably reduces the complexity of the technique. Further, it takes advantage of the TRIPS execution model and on-chip networks to exploit slack more efficiently, and significantly improves reliability by 25-42% for a set of SPEC and EEMBC benchmarks. We also present a detailed comparison of AVF throttling with prior approaches including redundant execution, and selective redundant execution. Based on the comparison, we argue that while AVF throttling may provide a smaller absolute reliability improvement, it significantly reduces the power consumption and complexity overhead, making the three techniques appropriate in systems with different reliability requirements. Overall, this dissertation establishes that distributed architectures provide a good foundation for building a reliable system from unreliable components, and our results set a good starting point for further innovative research in this area.
dc.description.departmentComputer Sciencesen
dc.rightsCopyright is held by the author. Presentation of this material on the Libraries' web site by University Libraries, The University of Texas at Austin was made possible under a limited license grant from the author who has retained all copyrights in the works.en
dc.subject.lcshElectronic digital computers--Reliabilityen
dc.subject.lcshComputer architectureen
dc.titleTechniques to improve the hard and soft error reliability of distributed architecturesen
dc.type.genreThesisen Sciencesen Sciencesen University of Texas at Austinen of Philosophyen

Access full-text files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
1.45 MB
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
1.65 KB
Plain Text