Assessing the Impact of ABFT and Checkpoint Composite Strategies

TitleAssessing the Impact of ABFT and Checkpoint Composite Strategies
Publication TypeConference Paper
Year of Publication2014
AuthorsBosilca, G., A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra
Conference Name16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014
Date Published05-2014
PublisherIEEE
Conference LocationPhoenix, AZ
KeywordsABFT, checkpoint, fault-tolerance, High-performance computing, resilience
Abstract

Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFTunaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed.