Architectural techniques to accelerate multimedia applications on general-purpose processors
MetadataShow full item record
General-purpose processors (GPPs) have been augmented with multimedia extensions to improve performance on multimedia-rich workloads. These extensions operate in a single instruction multiple data (SIMD) fashion to extract data level parallelism in multimedia and digital signal processing (DSP) applications. This dissertation consists of a comprehensive evaluation of the execution characteristics of multimedia applications on SIMD enhanced GPPs, detection of bottlenecks in the execution of multimedia applications on SIMD enhanced GPPs, and the design and implementation of architectural techniques to eliminate and alleviate the impact of the various bottlenecks to accelerate multimedia applications. This dissertation identifies several bottlenecks in the processing of SIMD enhanced multimedia and DSP applications on GPPs. It is found that approximately 75-85% of instructions in the dynamic instruction stream of media workloads are not performing useful computations but merely supporting the useful computations by performing address generation, address transformation/data reorganization, loads/stores, and loop branches. This leads to an underutilization of the SIMD computation units with only 1-12% of the peak SIMD throughput being achieved. This dissertation proposes the use of hardware support to efficiently execute the overhead/supporting instructions by overlapping them with the useful computation instructions. A 2-way GPP with SIMD extensions augmented with the proposed MediaBreeze hardware significantly outperforms a 16-way SIMD GPP without MediaBreeze hardware on multimedia kernels. On multimedia applications, a 2-/4-way SIMD GPP augmented with MediaBreeze hardware is superior to a 4-/8-way SIMD GPP without MediaBreeze hardware. The performance improvements are achieved at an area cost that is less than 0.3% of current GPPs and power consumption that is less than 1% of the total processor power without elongating the critical path of the processor.