%0 Journal Article %J International Journal of High Performance Computing Applications %D 2013 %T Post-failure recovery of MPI communication capability: Design and rationale %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery. %B International Journal of High Performance Computing Applications %V 27 %P 244 - 254 %8 2013-01 %G eng %U http://hpc.sagepub.com/cgi/doi/10.1177/1094342013488238 %N 3 %! International Journal of High Performance Computing Applications %R 10.1177/1094342013488238