Efficient fault tolerance for pipelined structures and its application to superscalar and dataflow machines
Silicon reliability has reemerged as a very important problem in digital system design. As voltage and device dimensions shrink, combinational logic is becoming sensitive to temporary errors caused by single event upsets, transistor and interconnect aging and circuit variability. In particular, computational functional units are very challenging to protect because current redundant execution techniques have a high power and area overhead, cannot guarantee detection of some errors and cause a substantial performance degradation. As traditional worst-case design rules that guarantee error avoidance become too conservative to be practical, new microarchitectures need to be investigated to address this problem. To this end, this dissertation introduces Self-Imposed Temporal Redundancy (SITR), a speculative microarchitectural temporal redundancy technique suitable for pipelined computational functional units. SITR is able to detect most temporary errors, is area and energy-efficient and can be easily incorporated in an out-of-order microprocessor. SITR can also be used as a throttling mechanism against thermal viruses and, in some cases, allows designers to design very aggressive bypass networks capable of achieving high instruction throughput, by tolerating timing violations. To address the performance degradation caused by redundant execution, this dissertation proposes using a tiled-data ow model of computation because it enables the design of scalable, resource-rich computational substrates. Starting with the WaveScalar tiled-data flow architecture, we enhance the reliability of its datapath, including computational logic, interconnection network and storage structures. Computations are performed speculatively using SITR while traditional information redundancy techniques are used to protect data transmission and storage. Once a value has been verified, confirmation messages are transmitted to consumer instructions. Upon error detection, nullification messages are sent to the instructions affected by the error. Our experimental results demonstrate that the slowdown due to redundant computation and error recovery on the tiled-data flow machine is consistently smaller than on a superscalar von Neumann architecture. However, the number of additional messages required to support SITR execution is substantial, increasing power consumption. To reduce this overhead without significantly affecting performance, we introduce wave-based speculation, a mechanism targeted for data flow architectures that enables speculation only when it is likely to benefit performance.