%0 Generic %D 2013 %T On the Combination of Silent Error Detection and Checkpointing %A Guillaume Aupy %A Anne Benoit %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %K checkpointing %K error recovery %K High-performance computing %K silent data corruption %K verification %X In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters. %B UT-CS-13-710 %I University of Tennessee Computer Science Technical Report %8 2013-06 %G eng %U http://www.netlib.org/lapack/lawnspdf/lawn278.pdf %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale %A George Bosilca %A Aurelien Bouteiller %A Elisabeth Brunet %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %X In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation. %B Concurrency and Computation: Practice and Experience %8 2013-11 %G eng %R 10.1002/cpe.3173 %0 Generic %D 2012 %T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale %A George Bosilca %A Aurelien Bouteiller %A Elisabeth Brunet %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %B University of Tennessee Computer Science Technical Report (also LAWN 269) %8 2012-06 %G eng