Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery

TitlePlan B: Interruption of Ongoing MPI Operations to Support Failure Recovery
Publication TypeConference Paper
Year of Publication2015
AuthorsBouteiller, A., G. Bosilca, and J. Dongarra
Conference Name22nd European MPI Users' Group Meeting
Date Published09-2015
PublisherACM
Conference LocationBordeaux, France
AbstractAdvanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods.
DOI10.1145/2802658.2802668
Project Tags: