%0 Journal Article
%J Parallel Computing
%D 2018
%T Computing the Expected Makespan of Task Graphs in the Presence of Silent Errors
%A Henri Casanova
%A Julien Herrmann
%A Yves Robert
%K Expected makespan
%K fault-tolerance
%K scheduling
%K Scientific workflows
%K silent errors
%K Task graphs
%X Applications structured as Directed Acyclic Graphs (DAGs) of tasks occur in many domains, including popular scientific workflows. DAG scheduling has thus received an enormous amount of attention. Many of the popular DAG scheduling heuristics make scheduling deci- sions based on path lengths. At large scale compute platforms are subject to various types of failures with non-negligible probabilities of occurrence. Failures that have recently re- ceived increased attention are “silent errors,” which cause data corruption. Tolerating silent errors is done by checking the validity of computed results and possibly re-executing a task from scratch. The execution time of a task then becomes a random variable, and so do path lengths in a DAG. Unfortunately, computing the expected makespan of a DAG (and equivalently computing expected path lengths in a DAG) is a computationally dif- ficult problem. Consequently, designing effective scheduling heuristics in this context is challenging. In this work, we propose an algorithm that computes a first order approxi- mation of the expected makespan of a DAG when tasks are subject to silent errors. We find that our proposed approximation outperforms previously proposed approaches both in terms of approximation error and of speed.
%B Parallel Computing
%V 75
%P 41–60
%8 07-2018
%G eng
%R https://doi.org/10.1016/j.parco.2018.03.004
%0 Conference Proceedings
%B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16)
%D 2016
%T Failure Detection and Propagation in HPC Systems
%A George Bosilca
%A Aurelien Bouteiller
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Pierre Sens
%A Jack Dongarra
%K failure detection
%K fault-tolerance
%K MPI
%B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16)
%I IEEE Press
%C Salt Lake City, Utah
%P 27:1-27:11
%8 11-2016
%@ 978-1-4673-8815-3
%G eng
%U http://dl.acm.org/citation.cfm?id=3014904.3014941
%0 Journal Article
%J International Journal of Networking and Computing
%D 2016
%T Scheduling Computational Workflows on Failure-prone Platforms
%A Guillaume Aupy
%A Anne Benoit
%A Henri Casanova
%A Yves Robert
%K checkpointing
%K fault-tolerance
%K reliability
%K scheduling
%K workflow
%X We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomialtime optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.
%B International Journal of Networking and Computing
%V 6
%P 2-26
%G eng
%0 Journal Article
%J ACM Transactions on Parallel Computing
%D 2015
%T Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures, and Accuracy
%A Aurelien Bouteiller
%A Thomas Herault
%A George Bosilca
%A Peng Du
%A Jack Dongarra
%E Phillip B. Gibbons
%K ABFT
%K algorithms
%K fault-tolerance
%K High Performance Computing
%K linear algebra
%X Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable node and the possibility of losing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead decreases inversely to the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. Applicability to tolerate multiple failures and accuracy after multiple recovery is also considered.
%B ACM Transactions on Parallel Computing
%V 1
%P 10:1-10:28
%8 01-2015
%G eng
%N 2
%R 10.1145/2686892
%0 Journal Article
%J International Journal of Networking and Computing
%D 2015
%T Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%K ABFT
%K checkpoint
%K fault-tolerance
%K High-performance computing
%K model
%K performance evaluation
%K resilience
%X Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFT- aware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic program- ming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint.
%B International Journal of Networking and Computing
%V 5
%P 2-15
%8 01-2015
%G eng
%0 Conference Paper
%B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014
%D 2014
%T Assessing the Impact of ABFT and Checkpoint Composite Strategies
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%K ABFT
%K checkpoint
%K fault-tolerance
%K High-performance computing
%K resilience
%X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFTunaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed.
%B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 05-2014
%G eng
%0 Generic
%D 2013
%T Assessing the impact of ABFT and Checkpoint composite strategies
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%K ABFT
%K checkpoint
%K fault-tolerance
%K High-performance computing
%K resilience
%X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFT-unaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed.
%B University of Tennessee Computer Science Technical Report
%G eng
%0 Generic
%D 2013
%T Revisiting the Double Checkpointing Algorithm
%A Jack Dongarra
%A Thomas Herault
%A Yves Robert
%K checkpoint algorithm
%K communication overlap
%K fault-tolerance
%K performance model
%K resilience
%X Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé in terms of both performance and risk. We also extend the model that they have proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
%B University of Tennessee Computer Science Technical Report (LAWN 274)
%8 01-2013
%G eng