Low-cost assertion-based fault tolerance in hardware and software
MetadataShow full item record
In the recent past, there has been an increasing demand for low-cost safety critical applications. Custom-off-the-shelf (COTS) processors are preferred for usage in these applications due to their low cost. The reliability provided by these processors, however, is not sufficient to meet the safety requirements of these applications. Furthermore, due to the trends followed by the processor industry to enhance the performance of processors, the reliability provided by these processors is projected to decrease in the future. Traditional techniques for enhancing the reliability of computer systems are not viable for these applications due to the high overheads (and hence cost) incurred by them. This thesis describes fault tolerance techniques tailored for these applications, adhering to the tight overhead constraints in the area, memory, and performance dimensions. Techniques at both the hardware level (to be used by the processor manufacturers) and the software level (to be used by the application developers) are presented. At the hardware level, this thesis presents a technique for detecting faults in the processor control logic, for which techniques proposed previously incur very high overheads. Rather than detect all modeled faults, the technique protects against a subset of faults such that the best possible overall protection is achieved while incurring only permissible overheads. This subset of faults is selected depending on the probability of each individual fault causing damage to the architectural state of the processor and the overhead incurred in protecting against the fault. The technique is validated on control logic modules of an industrial processor. At the software level, this thesis concentrates on a category of errors called control flow errors. We describe an error detection technique which incurs lower overheads than any of the previously proposed techniques while at the same time detecting more errors than all of them. Even these low overheads may be too restrictive for some applications. For such applications, we present a technique for providing the best error detection capability possible at the overheads allowed. Once an error is detected, error recovery actions need to be performed. In this thesis, we present an error correction technique which automatically performs error recovery with a very low latency. The technique reuses the information available from the error detection technique to perform the recovery and hence, does not incur any additional performance penalty. All the techniques proposed at the software level have been integrated with GCC, a commonly used software compiler. This permits the fault tolerance to be incorporated into the application automatically as part of the compilation process itself. Evaluations are performed on SPEC and MiBench benchmark programs using an in-house software error injection framework.