Hardware-assisted data movement optimizations for heterogeneous system architectures
Heterogeneous systems have emerged as state-of-the-art computing solutions. Such systems consist of acceleration units that provide massive compute capabilities within limited power budgets. Compute and/or memory intensive regions of applications are often offloaded to these accelerators and as such they naturally exchange data with the host core. This complex scenario poses a key challenge: how do we optimize data movement between the host core and accelerators from a holistic system-level perspective?
My research focuses on addressing the above question. Data movement optimizations can be explored in two flavors: 1) Maximizing locality and keeping the data close to its compute, and 2) moving the actual computation itself close to the data. Such optimizations fundamentally depend on applications as well as their interaction with underlying architectures. Exploring associated tradeoffs first and foremost requires an accurate modelling infrastructure.
To that end, I first propose a systematic simulator calibration methodology to provide a faithful baseline for accurately modeling targeted system architectures. Results show that an unrepresentative baseline can cause misleading conclusions in heterogeneous architecture studies. Using this calibrated simulator, I then study accelerator integration and respective data movement tradeoffs under consideration of different on- and off-chip coupling scenarios. From this study, I observe that applications can benefit from integrating accelerator closer to the chip, providing up to 20% better performance with 17% less total energy consumption over an off-chip integration. However, significant software modifications are required to fully unlock such benefits. Furthermore, traditional software overheads for accelerator invocation and offload can further limit acceleration benefits. To address such challenges, I propose three hardware-assisted approaches that enable transparent optimization of data movement in heterogeneous architectures with little to no software or programmer overhead. To perform automatic, software-transparent fine-grain data staging and synchronization between on-chip integrated components, I first introduce a novel Cache-Managed, Fine-Grain Accelerator Staging and Pipelining in On-Chip Heterogeneous Architectures (CASPHAr). CASPHAr tracks and synchronizes producer and consumer accesses at cache line granularity. As soon as some fraction of shared data is produced and becomes ready in the LLC, the data will be delivered for processing in the waiting consumer, reducing data spills due to unnecessarily long lifetimes of shared data in the cache. Results show that CASPHAr can boost performance by up to 23% and achieve energy savings of up to 22% over baseline accelerations. I further introduce Flock, a shared cache management scheme that targets improved system-wide performance benefits among heterogeneous components when running independent kernels/applications. Different from prior work, Flock employs holistic performance proxies that capture cache misses/hits in each level of the memory hierarchy when selecting a cache partitioning. This enables Flock to increase the system throughput by taking a global view of a core’s performance instead of focusing on the LLC misses/hits in isolation. Moreover, Flock applies a new shared cache replacement scheme that adapts to the varying access rates of different individual cores, preventing domination of a core with high access intensity. Results show that Flock improves performance by up to 12.5% over state-of-the-art solutions. Finally, I present the Non-Uniform Compute Device (NUCD) system architecture for low-latency and generic accelerator offload to move computations closer to the data. Different from conventional offload mechanisms that rely primarily on device drivers and software queues, the NUCD system architecture extends a host core micro-architecture to enable a low-latency out-of-order task offload to heterogeneous devices. Results demonstrate that the NUCD system architecture can achieve an average performance improvement of 21%-128% over a conventional driver-based offload mechanism. This in turn enables whole new forms of fine-grain task offloading that would otherwise not see any performance benefits.