Browsing by Subject "data curation"
Now showing 1 - 3 of 3
- Results Per Page
- Sort Options
Item Modeling Natural Hazards Engineering Data to Cyberinfrastructure(SciDataCon2016, 2016-09-13) Esteva, Maria; Brandenberg, Scott, J.; Eslami, Mohammad, M.; Adair, Ashley; Kulasekaran, Sivakumar, A.DesignSafe-CI is an end-to-end data lifecycle management, analysis, and publication cloud platform for natural hazards engineering. To facilitate ongoing data curation and sharing in a cloud environment that is intuitive to the end users, developers and curators teamed with experts in the different hazards to design data models and vocabularies that map their research workflows and domain terminology. The experimental data models - six - emphasize provenance through relationships between research processes, data and their documentation, and highlight commonalities between experiment types. They mediate between the user interface and the repository layers of the cyberinfrastructure to automate tasks such as organizing data and facilitating its description. Using data from triaxial experiments, we conducted a user evaluation of the geotechnical data model, both for its fitness to real data and for purposes of data understandability during reuse. The results of the evaluation guided testing and selection of the Fedora 4 repository backend to enhance data discovery and reuse.Item A Service to Manage Data Identity Over Time(Open Repositories 2018 http://www.or2018.net/, 2018-06-07) Esteva, Maria; Walls, Ramona; Magill, Andrew, B.Much of the burden of sustaining an open data environment is borne by researchers who must curate and publish the datasets they create. This is especially taxing for research that spans many years and team members, and has data distributed across different locations and publication stages. Funded by NSF, Identifier Services (IDS) is a prototype to explore the technical feasibility and community response to services that help manage the identity of large, ramified, genomics datasets stored across distributed resources. IDS uses open cyberinfrastructure resources and tools to track and verify the location, integrity, content changes, and metadata of datasets that have both active and published components. It can be used as an independent service or integrated with open repositories to track the evolution of published and active datasets over time. IDS provides landing pages where the datasets with global identifiers are represented as graphs reflecting the files’ provenance to facilitate reuse. Results point to a non-distant future in which open repositories and private storage systems are interconnected through services that enable users to have a complete map of their datasets independent of location and publication status.Item Understanding the Software Needs of High Performance Computer Users with XALT(Texas Advanced Computing Center, 2015) McLay, Robert; Fahey, MarkThe dataset is produced by the software XALT, which tracks executables and libraries installed on the High Performance Computing (HPC) resource Stampede (https://www.tacc.utexas.edu/stampede/) at the Texas Advanced Computing Center (TACC) (https://www.tacc.utexas.edu/). XALT software tracks and collects information about community codes and libraries used on MPI-based jobs on open-science HPC systems, also known as supercomputers. To conduct large scale data, analysis and simulations, researchers submit jobs to supercomputers. These resources are maintained in HPC centers such as TACC, where XALT data is used to determine the software libraries that are most often utilized by researchers, to debug software libraries, to measure job performance, and to conduct cost analysis based on metrics gathered, such as start time and job duration (See the article describing how XALT data helps making supercomputers be more efficient: http://dx.doi.org/10.1109/HUST.2014.6). Sociologists and scientific software producers have also identified possible reuses for this data such as inferring collaborations between different domain sciences based on usage of the same community libraries. Illustrations of proof of concept research done around these themes with XALT data can be seen in the file, proof _of_concept_images.pdf, available for download. The public XALT dataset, in JSON format, contains information on the number of nodes, libraries, and executables used by each user running a given computational job on Stampede (https://www.tacc.utexas.edu/stampede/) a supercomputer deployed and maintained at TACC. As part of the curation process, personal identifying information is anonymized assigning a unique user id. Also personal codes, which may include the names of users, are anonymized with a hash. Additional documentation is available to better understand the dataset and enhance its reuse. Documents include: the data dictionary describing each data element recorded in the dataset per job, a copy of the CC-BY license for the dataset, and a listing of the most common community codes identified from the data. To make the data useful to a broader audience we asked users to provide feedback about how they would like the data to be presented to them in terms of size, format, content, and availability modes. To understand their needs we used the questionnaire in the interview_protocol.pdf which can be downloaded from this repository. Since we want to continue receiving feedback from users, we posted a survey that can be completed in less than 3 minutes at the following link: https://utexas.qualtrics.com/SE/?SID=SV_cOB4pHrOiDHZoLX Information gathered from this survey will help us improve the dataset. The first XALT data generated on Stampede was issued in September of 2015. The dataset will continue to be generated and published until Stampede is decommissioned. The data may be downloaded as a quarterly zipped package containing the three files with data (one per month). Users can also download the data dictionary, the community codes dictionary, and a metadata file for the data from: http://web.corral.tacc.utexas.edu/XALT/. A paper describing the process by which we curated the data that is: Maria Esteva, Sandra Sweat, Robert McLay, Weijia Xu, Sivamar Kulaskeran (2016) Data Curation with a Focus on Reuse. Proceedings of the Joint Conference on Digital Libraries, June 19 – 23, Newark New Jersey. dc.description.other The graphs in proof _of_concept_images.pdf illustrate proof of concept analysis done with XALT data dc.description.other A survey for continued use of the data can be completed at the following link: https://utexas.qualtrics.com/SE/?SID=SV_cOB4pHrOiDHZoLX