Design and evaluation of a technology-scalable architecture for instruction-level parallelism
Future performance improvements must come from the exploitation of concurrency at all levels. Recent approaches that focus on thread-level and data-level concurrency are a natural fit for certain application domains, but it is unclear whether they can be adapted efficiently to eliminate serial bottlenecks. Conventional superscalar hardware that instead focuses on instruction-level parallelism (ILP) is limited by power inefficiency, on-chip wire latency, and design complexity. Ultimately, poor single-thread performance and Amdahl’s law will inhibit the overall performance growth even on parallel workloads. To address this problem, we undertook the challenge of designing a scalable, wide-issue, large-window processor that mitigates complexity, reduces power overheads, and exploits ILP to improve single-thread performance at future wire-delay dominated technologies. This dissertation describes the design and evaluation of the TRIPS architecture for exploiting ILP. The TRIPS architecture belongs to a new class of instruction set architectures called Explicit Data Graph Execution (EDGE) architectures that use large dataflow graphs of computation and explicit producer-consumer communication to express concurrency to the hardware. We describe how these architectures match the characteristics of future sub-45 nm CMOS technologies to mitigate complexity and improve concurrency at reduced overheads. We describe the architectural and microarchitectural principles of the TRIPS architecture, which exploits ILP by issuing instructions widely, in dynamic dataflow fashion, from a large distributed window of instructions. We then describe our specific contributions to the development of the TRIPS prototype chip, which was implemented in a 130 nm ASIC technology and consists of more than 170 million transistors. In particular, we describe the implementation of the distributed control protocols that offer various services for executing a single program in the hardware. Finally, we describe a detailed evaluation of the TRIPS architecture and identify the key determinants of its performance. In particular, we describe the development of the infrastructure required for a detailed analysis, including a validated performance model, a highly optimized suite of benchmarks, and critical path models that identify various architectural and microarchitectural bottlenecks at a fine level of granularity. On a set of highly optimized benchmark kernels, the manufactured TRIPS parts out-perform a conventional superscalar processor by a factor of 3× on average. We find that the automatically compiled versions of the same kernels are yet to reap the benefits of the high-ILP TRIPS core, but exceed the performance of the superscalar processor in many cases. Our results indicate that the overhead of various control protocols that manage the overall execution in the processor have only a modest effect on performance. However, operand communication between various components in the distributed microarchitecture contributes to nearly a third of the execution cycles. Fanout instructions, which are necessitated by limited, fixed-width encoding in the dataflow instruction set, also contribute to non-trivial performance overheads. Our results point to an exciting line of future research to overcome these limitations and achieve low-overhead distributed dataflow execution.