User Level Failure Mitigation (ULFM) is a set of new interfaces for the Message Passing Interface (MPI) that enables message passing applications to restore MPI functionality affected by process failures. The MPI implementation is spared the expense of internally taking protective and corrective automatic actions against failures. Instead, it can prevent any fault-related deadlock situation by reporting operations whose completions were rendered impossible by failures.
Using the constructs defined by ULFM, applications and libraries drive the recovery of the MPI state. Consistency issues resulting from failures are addressed according to an application’s needs, and the recovery actions are limited to the necessary MPI communication objects. Therefore, the recovery scheme is more efficient than a generic, automatic recovery technique, and can achieve the goals of enabling applications to resume communication after failure and maintaining extreme communication performance outside of recovery periods. A wide range of application types and middlewares are already building on top of ULFM to deliver scalable and user friendly fault tolerance.
Find out more at http://fault-tolerance.org/