Designing on-chip memory systems for throughput architectures
MetadataShow full item record
Driven by the high arithmetic intensity and embarrassingly parallel nature of real time computer graphics, GPUs became the first wide spread throughput architecture. With the end of Dennard scaling and the plateau of single thread performance, nearly all computer chips at all scales have now become explicitly parallel, containing a hierarchy of cores and threads. Initially, these individual cores were imagined to be no different from traditional uniprocessors, and parallel programs no different than traditional parallel programs. Like GPUs, these modern chips share finite on-chip resources between threads. This results in novel performance and optimization issues at any granularity of parallelism, from cell phones to GPUs. ￼ Unfortunately, the performance characteristics of these systems tend to be non-linear and counter-intuitive. The programmer’s software stack has been slow in adapting to this paradigm shift. Compilers still focus primarily on optimizing single thread performance at the expense of throughput. Existing parallel applications are not a perfect match for modern multicore, multithreaded processors. And existing methodologies for performance analysis and simulation are not aligned with multicore issues. This dissertation begins with a mathematical analysis of throughput performance in the presence of shared on-chip resources. When cache hit rates begin to fall, there is a steep drop off in throughput performance. An optimistic view of this regime is that even small improvements to cache efficiency offer significant benefits. This motivates the exploration of general throughput optimizations in both hardware and software that apply to both coarse-grained and fine-grained parallel architectures, requiring no programmer intervention or tuning. This dissertation provides two such solutions. The first solution is a compiler optimization called “loop microfission” that can boost throughput performance by up to 50%. In the context of the intrachip scalability of supercomputing applications, we demonstrate the failings of conven- tional performance tuning software and compiler algorithms in the presence of shared resources. We introduce a new approach to throughput optimization, including a memory friendly performance analysis tool, and show that techniques for throughput optimization are similar to traditional optimizations, but require new priorities. The second solution is a hardware optimization called Arbitrary Modulus In- dexing (AMI), a technique that generalizes efficient implementations of the DIV/- MOD operation from Mersenne Primes to all integers. We show that the primary performance bottlenecks in modern GPUs for regular, memory intensive applications are bank and set conflicts in the shared on-chip memory system. AMI completely eliminates conflicts in all facets of the memory system at negligible hardware cost, and has even broader potential for optimizations throughout computer architecture.