TitleCorrelated Set Coordination in Fault Tolerant Message Logging Protocols
Publication TypeJournal Article
Year of Publication2013
AuthorsBouteiller, A., T. Herault, G. Bosilca, and J. Dongarra
JournalConcurrency and Computation: Practice and Experience
Date Published03-2013

With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes.

