%0 Journal Article %J International Journal of Networking and Computing %D 2019 %T Checkpointing Strategies for Shared High-Performance Computing Platforms %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Dorian Arnold %A Kurt Ferreira %A George Bosilca %A Jack Dongarra %X Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Overall, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem. %B International Journal of Networking and Computing %V 9 %P 28–52 %G eng %U http://www.ijnc.org/index.php/ijnc/article/view/195 %0 Journal Article %J Parallel Computing %D 2019 %T Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms %A Valentin Le Fèvre %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Atsushi Hori %A George Bosilca %A Jack Dongarra %B Parallel Computing %V 85 %P 1–12 %8 2019-07 %G eng %R https://doi.org/10.1016/j.parco.2019.02.002 %0 Journal Article %J International Journal of Networking and Computing %D 2015 %T Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Yves Robert %A Jack Dongarra %K ABFT %K checkpoint %K fault-tolerance %K High-performance computing %K model %K performance evaluation %K resilience %X Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFT- aware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic program- ming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint. %B International Journal of Networking and Computing %V 5 %P 2-15 %8 2015-01 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Correlated Set Coordination in Fault Tolerant Message Logging Protocols %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes. %B Concurrency and Computation: Practice and Experience %V 25 %P 572-585 %8 2013-03 %G eng %N 4 %R 10.1002/cpe.2859 %0 Conference Proceedings %B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award) %D 2012 %T A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %E Christos Kaklamanis %E Theodore Papatheodorou %E Paul Spirakis %B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award) %I Springer-Verlag %C Rhodes, Greece %8 2012-08 %G eng %0 Conference Proceedings %B Proceedings of 17th International Conference, Euro-Par 2011, Part II %D 2011 %T Correlated Set Coordination in Fault Tolerant Message Logging Protocols %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %E Emmanuel Jeannot %E Raymond Namyst %E Jean Roman %K ftmpi %B Proceedings of 17th International Conference, Euro-Par 2011, Part II %I Springer %C Bordeaux, France %V 6853 %P 51-64 %8 2011-08 %G eng