User Level Failure Mitigation (ULFM) is a set of new interfaces for MPI that enables Message Passing programs to restore MPI functionality affected by process failures. The MPI implementation is spared the expense of internally taking protective and corrective actions against failures. Instead, it reports operations whose completions were rendered impossible by failures.
Using the constructs defined by ULFM, applications and libraries drive the recovery of the MPI state. Consistency issues resulting from failures are addressed according to an application’s needs and the recovery actions are limited to the necessary MPI communication objects. Therefore, the recovery scheme is more efficient than a generic, automatic recovery technique, and can achieve both goals of enabling applications to resume communication after failure and maintaining extreme communication performance outside of recovery periods.
Find out more at http://fault-tolerance.org/