Understanding the Software Needs of High Performance Computer Users with XALT

Date

2015

Authors

McLay, Robert
Fahey, Mark

Journal Title

Journal ISSN

Volume Title

Publisher

Texas Advanced Computing Center

Abstract

The dataset is produced by the software XALT, which tracks executables and libraries installed on the High Performance Computing (HPC) resource Stampede (https://www.tacc.utexas.edu/stampede/) at the Texas Advanced Computing Center (TACC) (https://www.tacc.utexas.edu/). XALT software tracks and collects information about community codes and libraries used on MPI-based jobs on open-science HPC systems, also known as supercomputers. To conduct large scale data, analysis and simulations, researchers submit jobs to supercomputers. These resources are maintained in HPC centers such as TACC, where XALT data is used to determine the software libraries that are most often utilized by researchers, to debug software libraries, to measure job performance, and to conduct cost analysis based on metrics gathered, such as start time and job duration (See the article describing how XALT data helps making supercomputers be more efficient: http://dx.doi.org/10.1109/HUST.2014.6). Sociologists and scientific software producers have also identified possible reuses for this data such as inferring collaborations between different domain sciences based on usage of the same community libraries. Illustrations of proof of concept research done around these themes with XALT data can be seen in the file, proof _of_concept_images.pdf, available for download. The public XALT dataset, in JSON format, contains information on the number of nodes, libraries, and executables used by each user running a given computational job on Stampede (https://www.tacc.utexas.edu/stampede/) a supercomputer deployed and maintained at TACC. As part of the curation process, personal identifying information is anonymized assigning a unique user id. Also personal codes, which may include the names of users, are anonymized with a hash. Additional documentation is available to better understand the dataset and enhance its reuse. Documents include: the data dictionary describing each data element recorded in the dataset per job, a copy of the CC-BY license for the dataset, and a listing of the most common community codes identified from the data. To make the data useful to a broader audience we asked users to provide feedback about how they would like the data to be presented to them in terms of size, format, content, and availability modes. To understand their needs we used the questionnaire in the interview_protocol.pdf which can be downloaded from this repository. Since we want to continue receiving feedback from users, we posted a survey that can be completed in less than 3 minutes at the following link: https://utexas.qualtrics.com/SE/?SID=SV_cOB4pHrOiDHZoLX Information gathered from this survey will help us improve the dataset.

The first XALT data generated on Stampede was issued in September of 2015. The dataset will continue to be generated and published until Stampede is decommissioned. The data may be downloaded as a quarterly zipped package containing the three files with data (one per month). Users can also download the data dictionary, the community codes dictionary, and a metadata file for the data from: http://web.corral.tacc.utexas.edu/XALT/. A paper describing the process by which we curated the data that is: Maria Esteva, Sandra Sweat, Robert McLay, Weijia Xu, Sivamar Kulaskeran (2016) Data Curation with a Focus on Reuse. Proceedings of the Joint Conference on Digital Libraries, June 19 – 23, Newark New Jersey. dc.description.other The graphs in proof _of_concept_images.pdf illustrate proof of concept analysis done with XALT data dc.description.other A survey for continued use of the data can be completed at the following link: https://utexas.qualtrics.com/SE/?SID=SV_cOB4pHrOiDHZoLX

Description

To access and download xalt data and metadata, please highlight the link and paste it into your address bar: http://web.corral.tacc.utexas.edu/XALT/ . In order to accommodate the anticipated growth over time for this data set, the data is not hosted at this location. The description of the data elements, copy of the CC-BY license, catalogue metadata file, and a listing of the software libraries at time of initial publication are available for download.
http://dx.doi.org/10.15781/T2PP4P

LCSH Subject Headings

Citation

McLay, Robert; Fahey, Mark R.; (2015): Understanding the Software Needs of High End Computer Users with XALT; Texas Advanced Computing Center. http://dx.doi.org/10.15781/T2PP4P