%0 Conference Proceedings %B IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing %D 2022 %T Checkpointing à la Young/Daly: An Overview %A Anne Benoit %A Yishu Du %A Thomas Herault %A Loris Marchal %A Guillaume Pallez %A Lucas Perotin %A Yves Robert %A Hongyang Sun %A Frederic Vivien %X The Young/Daly formula provides an approximation of the optimal checkpoint period for a parallel application executing on a supercomputing platform. The Young/Daly formula was originally designed for preemptible tightly-coupled applications. We provide some background and survey various application scenarios to assess the usefulness and limitations of the formula. %B IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing %I ACM Press %C Noida, India %P 701-710 %8 2022-08 %@ 9781450396752 %G eng %U https://dl.acm.org/doi/fullHtml/10.1145/3549206.3549328 %R 10.1145/3549206 %0 Journal Article %J Int. J. of Networking and Computing %D 2021 %T Resilient scheduling heuristics for rigid parallel jobs %A Anne Benoit %A Valentin Le Fèvre %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B Int. J. of Networking and Computing %V 11 %P 2-26 %G eng %0 Conference Paper %B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020) %D 2020 %T Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs %A Anne Benoit %A Valentin Le Fèvre %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020) %I IEEE Computer Society Press %C New Orleans, LA %8 2020-05 %G eng %0 Conference Paper %B 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2020) %D 2020 %T Reservation and Checkpointing Strategies for Stochastic Jobs %A Ana Gainaru %A Brice Goglin %A Valentin Honoré %A Padma Raghavan %A Guillaume Pallez %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2020) %I IEEE Computer Society Press %C New Orleans, LA %8 2020-05 %G eng %0 Conference Paper %B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019) %D 2019 %T Reservation Strategies for Stochastic Jobs %A Guillaume Aupy %A Ana Gainaru %A Valentin Honoré %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019) %I IEEE Computer Society Press %C Rio de Janeiro, Brazil %8 2019-05 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2018 %T Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing %A Anne Benoit %A Aurelien Cavelan %A Franck Cappello %A Padma Raghavan %A Yves Robert %A Hongyang Sun %K checkpointing %K fail-stop errors %K Fault tolerance %K High-performance computing %K Replication %K silent errors %X This paper provides a model and an analytical study of replication as a technique to cope with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale platforms. Compared with fail-stop errors that are immediately detected when they occur, silent errors require a detection mechanism. To detect silent errors, many application-specific techniques are available, either based on algorithms (e.g., ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject to only silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to even correct some errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. Otherwise, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model. %B Journal of Parallel and Distributed Computing %V 122 %P 209–225 %8 2018-12 %G eng %R https://doi.org/10.1016/j.jpdc.2018.08.002 %0 Journal Article %J Journal of Computational Science %D 2018 %T Multi-Level Checkpointing and Silent Error Detection for Linear Workflows %A Anne Benoit %A Aurelien Cavelan %A Yves Robert %A Hongyang Sun %X We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for fail-stop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also show how to combine all these techniques and solve the problem with both fail-stop and silent errors. Simulation results demonstrate that these extensions lead to significantly improved performance compared to the standard single-level checkpointing algorithm. %B Journal of Computational Science %V 28 %P 398–415 %8 2018-09 %G eng %0 Conference Paper %B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale %D 2017 %T Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale %A Anne Benoit %A Franck Cappello %A Aurelien Cavelan %A Yves Robert %A Hongyang Sun %X This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint. We provide a detailed analytical study of both scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that corroborates the analytical model. %B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale %I ACM %C Washington, DC %8 2017-06 %G eng %R 10.1145/3086157.3086162 %0 Journal Article %J IEEE Transactions on Computers %D 2017 %T Towards Optimal Multi-Level Checkpointing %A Anne Benoit %A Aurelien Cavelan %A Valentin Le Fèvre %A Yves Robert %A Hongyang Sun %K checkpointing %K Dynamic programming %K Error analysis %K Heuristic algorithms %K Optimized production technology %K protocols %K Shape %B IEEE Transactions on Computers %V 66 %P 1212–1226 %8 2017-07 %G eng %N 7 %R 10.1109/TC.2016.2643660 %0 Journal Article %J ACM Transactions on Parallel Computing %D 2016 %T Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors %A Anne Benoit %A Aurelien Cavelan %A Yves Robert %A Hongyang Sun %K checkpoint %K fail-stop error %K failure %K HPC %K resilience %K silent data corruption %K silent error %K verification %X In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bi-criteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via dynamic voltage and frequency scaling (DVFS). We determine in this scenario the optimal checkpointing and verification locations, as well as the optimal speed pairs. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds. %B ACM Transactions on Parallel Computing %8 2016-08 %G eng %R 10.1145/2897189 %0 Conference Paper %B 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2016 %T Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors %A Anne Benoit %A Aurelien Cavelan %A Yves Robert %A Hongyang Sun %K fail-stop errors %K multilevel checkpoint %K optimal pattern %K resilience %K silent errors %K verification %X This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop errors. Many others deal with silent errors (or silent data corruptions). But very few papers deal with fail-stop and silent errors simultaneously. However, HPC applications will obviously have to cope with both error sources. This paper presents a unified framework and optimal algorithmic solutions to this double challenge. Silent errors are handled via verification mechanisms (either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, and a full characterization of the optimal pattern. Our results nicely extend several published solutions and demonstrate how to make use of different techniques to solve the double threat of fail-stop and silent errors. Extensive simulations based on real data confirm the accuracy of the model, and show that patterns that combine all resilience mechanisms are required to provide acceptable overheads. %B 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Chicago, IL %8 2016-05 %G eng %R 10.1109/IPDPS.2016.39