Polymorphous architectures: a unified approach for extracting concurrency of different granularities
Abstract
Processor architects today are faced by two daunting challenges: emerging
applications with heterogeneous
computation needs and technology limitations
of power, wire delay, and process variation. Designing multiple application-specific
processors or specialized architectures introduces design complexity, a software programmability problem, and reduces economies of scale. There is a pressing need for design methodologies that
can provide support for heterogeneous
applications, combat processor
complexity, and achieve economies
of scale. In this dissertation, we introduce the notion of architectural polymorphism
to build such scalable processors that provide support for heterogeneous
computation by supporting different granularities of parallelism. Polymorphism
configures coarse-grained microarchitecture blocks to provide anadaptive and
flexible processor substrate. Technology scalability is achieved by designing an architecture using scalable and modular microarchitecture
blocks.
We use the dataflow graph as the unifying abstraction layer across three
granularities of parallelism-instruction-level, thread-level, and data-level. To
first order, this granularity of parallelism is the main difference between different
classes of applications. All programs are expressed in terms of dataflow
graphs and directly mapped to the hardware, appropriately partitioned as required
by the granularity of parallelism. We introduce Explicit Data Graph
Execution (EDGE) ISAs, a
class of ISAs as an architectural solution for effciently expressing parallelism for building technology scalable architectures.
We developed the TRIPS architecture implementing an EDGE ISA
using a heavily partitioned and distributed microarchitecture to achieve technology
scalability. The two most signicant features of the TRIPS microarchitecture are its heavily partitioned and modular design, and the use of microarchitecture networks for
communication across modules. We have also built aprototype TRIPS
chip in 130nm ASIC technology
composed of two processor
cores and a distributed 1MB Non-Uniform Cache Access Architecture (NUCA)
on-chip memory system.
Our performance results show that the TRIPS microarchitecture which
provides a 16-issue machine with a 1024-entry instruction window can sustain
good instruction-level parallelism. On a set of hand-optimized kernels IPCs in
the range of 4 to 6 are seen, and on a set of benchmarks with ample data-level
parallelism (DLP),
compiler generated
code produces IPCs in the range of 1
to 4. On the EEMBC and SPEC CPU2000 benchmarks we see IPCs in the
range of 0.5 to 2.3. Comparing performance to the Alpha 21264, which is a
high performance architecture tuned for ILP, TRIPS is up to 3.4 times better
on the hand optimized kernels. However,
compiler generated binaries for the
DLP, EEMBC, and SPEC CPU2000 benchmarks perform worse on TRIPS
compared to an Alpha 21264. With more aggressive
compiler optimization we
expect the performance of the
compiler produced binaries to improve.
The polymorphous mechanisms proposed in this dissertation are effective at exploiting thread-level parallelism and data-level parallelism. When
executing four threads on a single processor, significantly high levels of processor utilization are seen; IPCs are in the range of 0.7 to 3.9 for an application mix consisting of EEMBC and SPEC CPU2000 workloads. When executing
programs with DLP, the polymorphous mechanisms we propose provide harmonic
mean speedups of 2.1X across a set of DLP workloads, compared to an
execution model of extra
ting only ILP. Compared to specialized architectures,
these mechanisms provide
competitive performance using a single execution
substrate.
Department
Description
text