Lightweight offload engines for worklist management and worklist-directed prefetching
MetadataShow full item record
The importance of irregular applications such as graph analytics is rapidly growing with the rise of Big Data. However, parallel graph workloads tend to perform poorly on general-purpose chip multiprocessors (CMPs) due to poor cache locality, low compute intensity, frequent synchronization, uneven task sizes, and dynamic task generation. At high thread counts, execution time is dominated by worklist synchronization overhead and cache misses. Researchers have proposed hardware worklist accelerators to address scheduling costs, but these proposals often harden a specific scheduling policy and do not address high cache miss rates. This thesis presents Minnow, a technique that addresses these bottlenecks by augmenting each core in a CMP with a memory throughput-optimized lightweight engine connected through an accelerator interface. These engines offload worklist operations from worker threads, reducing synchronization costs and improving scalability. The engines also perform worklist-directed prefetching, a software prefetching technique that exploits knowledge of upcoming tasks to perform nearly perfectly accurate and timely prefetch operations. In this thesis, we first characterize several graph applications within a popular graph analytics framework to determine their performance and bottlenecks. Next, Minnow and worklist-directed prefetching are discussed in detail, including the Minnow accelerator interface, microarchitecture, and prefetch flow control mechanism. Finally, the benefits of Minnow and worklist-directed prefetching are evaluated within a cycle-accurate microarchitectural simulator.