We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Analyzing and predicting job failures from HPC system log.
- Authors
Park, Ju-Won; Huang, Xin; Lee, Chul-Ho
- Abstract
In this paper, we analyze the scheduler log of a production supercomputer that contains complete job information, which is in contrast to many existing (publicly available) HPC logs that only have largely limited job information. We not only provide an in-depth statistical analysis of failed jobs from the scheduler log, but also demonstrate how the scheduler log, which is available in a detailed form, can be leveraged to predict job failures. For the latter, we first conduct a feature analysis based on the framework of 'weight of evidence' and 'information value' to uncover the impact of each workload attribute (feature) on the failure or success of a job, thereby enabling us to identify key features. We then conduct a comparative performance study of six data-driven machine learning models for predicting job failures in a HPC system based on the scheduler log. Our experiment results show that tree-based models exhibit superior performance in terms of both prediction accuracy and computational cost. We also demonstrate that our feature analysis improves the computational efficiency of each machine learning model without losing its prediction performance.
- Subjects
MACHINE learning; SUPERCOMPUTERS; HIGH performance computing; SYSTEM failures; JOB analysis; OCCUPATIONAL achievement
- Publication
Journal of Supercomputing, 2024, Vol 80, Issue 1, p435
- ISSN
0920-8542
- Publication type
Article
- DOI
10.1007/s11227-023-05482-y