Reducing critical path execution time by breaking critical loops

Access full-text files




Brown, Mary Douglass

Journal Title

Journal ISSN

Volume Title



Increasing bandwidth and decreasing latency are two orthogonal techniques for improving program performance. However, many studies have shown that microarchitectural designs that improve one of these may have a negative effect on the other. For example, bypass latency, or the time it takes to forward a result from one functional unit to another, may increase as the number of functional units increases because of longer wires and multiplexor delays. As another example, techniques used to exploit ILP such as out-of-order execution with large instruction windows will increase dynamic instruction scheduling latency. While exploiting ILP allows the number of instructions processed per cycle (IPC) to be increased, the increased scheduling latency may lower clock frequency, increase power consumption, or even negatively impact IPC if additional cycles are needed to schedule instructions for execution. This dissertation addresses microarchitectural techniques that allow ILP to be increased but have traditionally had a negative impact on the latency of the execution core. Critical loops are pieces of logic that must be evaluated serially by dependent instructions. A program’s critical path is the set of instructions that determine the execution time of a program, and this set of instructions is determined in part by critical loop latency. The length of a program’s critical path can be shortened by breaking critical data dependence loops, thereby improving performance. This dissertation introduces ways to break three critical data dependence loops: the execute-bypass loop, the steering loop, and the scheduling loop. The execute-bypass loop is reduced by means of clustering and steering. The steering loop is broken by introducing scalable steering algorithms that perform as well as the best previously published algorithms that did not scale with issue width. The scheduling loop is broken into multiple smaller loops, thus reducing the critical path through the scheduling logic.