TACCSTER 2022 Proceedings
Permanent URI for this collectionhttps://hdl.handle.net/2152/115318
Browse
Browsing TACCSTER 2022 Proceedings by Author "Dawson, Clint"
Now showing 1 - 1 of 1
- Results Per Page
- Sort Options
Item TACC Job Manager - A Lightweight HPC Application and Job Management Library for TACC Systems.(2022-09-29) del-Castillo-Negrete, Carlos; Pachev, Benjamin; Dawson, ClintA major challenge in many HPC workflows is the amount of manual work required to interact with HPC systems. Actions such as setting up data and code for an HPC job, executing and monitoring the status of HPC jobs, and downloading or uploading data to HPC systems can often be the most time consuming parts of conducting computational science research. These complications make it painfully difficult to reproduce results, share code, or simply run slight variations on the same workflow. For example, in hurricane storm-surge modeling, ensemble simulations are common in many applications from parameter estimation to flood risk assessment, yet they are very hard to configure and reproduce, especially when dealing with a large complex domain, as is the case in many real-world modeling scenarios. We introduce TACC Job Manager - a lightweight job management library for HPC applications that solves the issues of reproducibility and code sharing. It allows for job setup, submission, and monitoring - all via an open-source python library. This allows workflows to be scripted and easily shared. In addition, the library facilitates modular workflow development, with a clear distinction between applications and jobs. Although it is initially targeted for TACC systems, it can be easily extended to work on other supercomputing resources. We demonstrate the use of TACC Job Manager with several real workflows. In each case, the library significantly sped up the development cycle - from initial benchmarking to the final simulation.