%0 Journal Article
%J ACM Transactions on Parallel Computing
%D 2016
%T Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors
%A Anne Benoit
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%K checkpoint
%K fail-stop error
%K failure
%K HPC
%K resilience
%K silent data corruption
%K silent error
%K verification
%X In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bi-criteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via dynamic voltage and frequency scaling (DVFS). We determine in this scenario the optimal checkpointing and verification locations, as well as the optimal speed pairs. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.
%B ACM Transactions on Parallel Computing
%8 2016-08
%G eng
%R 10.1145/2897189
%0 Journal Article
%J International Journal on High Performance Computing Applications
%D 2015
%T Efficient Checkpoint/Verification Patterns
%A Anne Benoit
%A Saurabh K. Raina
%A Yves Robert
%K checkpointing
%K Fault tolerance
%K High Performance Computing
%K silent data corruption
%K silent error
%K verification
%X Errors have become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (Mean Time Between Failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared to the base algorithm that always perform a verification just before taking a checkpoint (p = q = 1), and we exhibit gains of up to 19%.
%B International Journal on High Performance Computing Applications
%8 2015-07
%G eng
%R 10.1177/1094342015594531