%0 Journal Article %J Computing %D 2013 %T An evaluation of User-Level Failure Mitigation support in MPI %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A Joshua Hursey %A George Bosilca %A Jack Dongarra %K Fault tolerance %K MPI %K User-level fault mitigation %X As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures. %B Computing %V 95 %P 1171-1184 %8 2013-12 %G eng %N 12 %R 10.1007/s00607-013-0331-3 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X Most predictions of exascale machines picture billion ways parallelism, encompassing not only millions of cores but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: (i) traditional checkpoint-based approaches incur a steep overhead on failure free operations and (ii) the dominant programming paradigm for parallel applications (the message passing interface (MPI) Standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI Standard, to enable advanced forward recovery techniques, without incurring the overhead of customary periodic checkpointing. With our approach, when failure strikes, applications regain control to make a checkpoint before quitting execution. This checkpoint is in reaction to the failure occurrence rather than periodic. This checkpoint is reloaded in a new MPI application, which restores a sane environment for the forward, application-based recovery technique to repair the failure-damaged dataset. The validity and performance of this approach are evaluated on large-scale systems, using the QR factorization as an example. Published 2013. This article is a US Government work and is in the public domain in the USA. %B Concurrency and Computation: Practice and Experience %8 2013-07 %G eng %U http://doi.wiley.com/10.1002/cpe.3100 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.3100 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2013 %T Post-failure recovery of MPI communication capability: Design and rationale %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery. %B International Journal of High Performance Computing Applications %V 27 %P 244 - 254 %8 2013-01 %G eng %U http://hpc.sagepub.com/cgi/doi/10.1177/1094342013488238 %N 3 %! International Journal of High Performance Computing Applications %R 10.1177/1094342013488238 %0 Conference Proceedings %B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award) %D 2012 %T A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %E Christos Kaklamanis %E Theodore Papatheodorou %E Paul Spirakis %B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award) %I Springer-Verlag %C Rhodes, Greece %8 2012-08 %G eng %0 Conference Proceedings %B 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing %D 2012 %T Enabling Application Resilience With and Without the MPI Standard %A Wesley Bland %B 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing %C Ottawa, Canada %8 2012-05 %G eng %0 Conference Proceedings %B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012 %D 2012 %T An Evaluation of User-Level Failure Mitigation Support in MPI %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A Joshua Hursey %A George Bosilca %A Jack Dongarra %B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012 %I Springer %C Vienna, Austria %8 2012-09 %G eng %0 Generic %D 2012 %T Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %K ftmpi %B University of Tennessee Computer Science Technical Report %8 2012-00 %G eng %0 Generic %D 2012 %T A Proposal for User-Level Failure Mitigation in the MPI-3 Standard %A Wesley Bland %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Jack Dongarra %K ftmpi %B University of Tennessee Electrical Engineering and Computer Science Technical Report %I University of Tennessee %8 2012-02 %G eng %0 Conference Proceedings %B Euro-Par 2012: Parallel Processing Workshops %D 2012 %T User Level Failure Mitigation in MPI %A Wesley Bland %E Ioannis Caragiannis %E Michael Alexander %E Rosa M. Badia %E Mario Cannataro %E Alexandru Costan %E Marco Danelutto %E Frederic Desprez %E Bettina Krammer %E Sahuquillo, J. %E Stephen L. Scott %E J. Weidendorfer %K ftmpi %B Euro-Par 2012: Parallel Processing Workshops %I Springer Berlin Heidelberg %C Rhodes Island, Greece %V 7640 %P 499-504 %8 2012-08 %G eng