The Parla runtime
Access full-text files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Writing high-performance programs to target heterogeneous compute nodes poses many challenges associated with properly managing data-level and task-level parallelism across various processing units. Parla is a heterogeneous task-based programming framework which simplifies writing portable multi-device code by enabling programmers to leverage task-level parallelism with simple decorator annotations while fully utilizing Python’s rich scientific programming stack.
The underlying runtime system of Parla must support the efficient execution of a variety of task graphs on complex heterogeneous nodes. This runtime is divided into three phases: mapper, scheduler, and launcher. I present the design of each phase and discuss the motivation behind design decisions, with particular attention to the performant treatment of GPU tasks. I show that the current runtime’s heuristic-based mapping policies run similarly well to optimal user-specified mappings on a variety of workloads. Lastly, I detail many areas of future work to further improve the runtime performance.