We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Efficient detection of silent data corruption in HPC applications with synchronization-free message verification.
- Authors
Zhang, Guozhen; Liu, Yi; Yang, Hailong; Qian, Depei
- Abstract
Nowadays, high-performance computing (HPC) is stepping forward to exascale era. However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous consequences for scientific computation, which jeopardizes the reliability of HPC at large scale. The most commonly used methods to address SDC are based on modular redundancy, which usually requires keeping execution progress consistent between replicas by synchronization and performing additional message transmission and comparison during program execution. Although such methods can detect SDC with high recall, they can introduce significant performance overhead and even stall the execution progress at a large scale. To our knowledge, this paper proposes the first solution of SDC detection without requiring synchronization and additional message transmission between replicas. It combines message logging with an innovative asynchronous message comparison mechanism, which uses specialized service routines (Data-Analytic-Service, DAS) to perform progress comparison without interfering target program execution. Besides, our solution adopts a distributed parallel architecture to perform DAS and utilizes an innovative reference mechanism based on single non-deterministic event to guarantee the consistent execution of different replicas. We implemented a user-level prototype, termed as synchronization-free SDC detection (SFSD). The experimental results on the Tianhe-2 supercomputer show that SFSD is effective in detecting SDC, with low-performance overhead (within 10%) and an acceptable recall rate. Moreover, SFSD exhibits good scalability when applied to large-scale program executions.
- Subjects
DATA corruption; PARALLEL processing
- Publication
Journal of Supercomputing, 2022, Vol 78, Issue 1, p1381
- ISSN
0920-8542
- Publication type
Article
- DOI
10.1007/s11227-021-03892-4