We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
McrEngine: A scalable checkpointing system using data-aware aggregation and compression.
- Authors
Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh; Moody, Adam; de Supinski, Bronis R.; Eigenmann, Rudolf
- Abstract
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.
- Subjects
HIGH performance computing research; PARALLEL file systems (Computer science); DATA compression; STATISTICAL matching; ELECTRONIC file management
- Publication
Scientific Programming, 2013, Vol 21, Issue 4, p149
- ISSN
1058-9244
- Publication type
Article
- DOI
10.1155/2013/341672