%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2018
%T Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing
%A Anne Benoit
%A Aurelien Cavelan
%A Franck Cappello
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%K checkpointing
%K fail-stop errors
%K Fault tolerance
%K High-performance computing
%K Replication
%K silent errors
%X This paper provides a model and an analytical study of replication as a technique to cope with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale platforms. Compared with fail-stop errors that are immediately detected when they occur, silent errors require a detection mechanism. To detect silent errors, many application-specific techniques are available, either based on algorithms (e.g., ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject to only silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to even correct some errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. Otherwise, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model.
%B Journal of Parallel and Distributed Computing
%V 122
%P 209–225
%8 12-2018
%G eng
%R https://doi.org/10.1016/j.jpdc.2018.08.002
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2018
%T Co-Scheduling Amdhal Applications on Cache-Partitioned Systems
%A Guillaume Aupy
%A Anne Benoit
%A Sicheng Dai
%A Loïc Pottier
%A Padma Raghavan
%A Yves Robert
%A Manu Shantharam
%K cache partitioning
%K co-scheduling
%K complexity results
%X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are as follows: (i) which proportion of cache and (ii) how many processors should be given to each application? In this article, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.
%B International Journal of High Performance Computing Applications
%V 32
%P 123–138
%8 01-2018
%G eng
%N 1
%R https://doi.org/10.1177/1094342017710806
%0 Conference Paper
%B Cluster 2018
%D 2018
%T Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms
%A Guillaume Aupy
%A Anne Benoit
%A Brice Goglin
%A Loïc Pottier
%A Yves Robert
%B Cluster 2018
%I IEEE Computer Society Press
%C Belfast, UK
%8 09-2018
%G eng
%0 Journal Article
%J Journal of Computational Science
%D 2018
%T Multi-Level Checkpointing and Silent Error Detection for Linear Workflows
%A Anne Benoit
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%X We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for fail-stop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also show how to combine all these techniques and solve the problem with both fail-stop and silent errors. Simulation results demonstrate that these extensions lead to significantly improved performance compared to the standard single-level checkpointing algorithm.
%B Journal of Computational Science
%V 28
%P 398–415
%8 09-2018
%G eng
%0 Conference Paper
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%D 2018
%T A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures
%A Anne Benoit
%A Swann Perarnau
%A Loïc Pottier
%A Yves Robert
%X This work presents a realistic performance model to execute scientific workflows on high-bandwidth memory architectures such as the Intel Knights Landing. We provide a detailed analysis of the execution time on such platforms, taking into account transfers from both fast and slow memory and their overlap with computations. We discuss several scheduling and mapping strategies: not only tasks must be assigned to computing resource, but also one has to decide which fraction of input and output data will reside in fast memory, and which will have to stay in slow memory. Extensive simulations allow us to assess the impact of the mapping strategies on performance. We also conduct actual experiments for a simple 1D Gauss-Seidel kernel, which assess the accuracy of the model and further demonstrate the importance of a tuned memory management. Altogether, our model and results lay the foundations for further studies and experiments on dual-memory systems.
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%I IEEE Computer Society Press
%C Eugene, OR
%8 08-2018
%G eng
%0 Conference Paper
%B 19th Workshop on Advances in Parallel and Distributed Computational Models
%D 2017
%T Co-Scheduling Algorithms for Cache-Partitioned Systems
%A Guillaume Aupy
%A Anne Benoit
%A Loïc Pottier
%A Padma Raghavan
%A Yves Robert
%A Manu Shantharam
%K Computational modeling
%K Degradation
%K Interference
%K Mathematical model
%K Program processors
%K Supercomputers
%K Throughput
%X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? Here, we assign rational numbers of processors to each application, since they can be shared across applications through multi-threading. In this paper, we provide answers to (i) and (ii) for perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for general applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.
%B 19th Workshop on Advances in Parallel and Distributed Computational Models
%I IEEE Computer Society Press
%C Orlando, FL
%8 05-2017
%G eng
%R 10.1109/IPDPSW.2017.60
%0 Conference Paper
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%D 2017
%T Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale
%A Anne Benoit
%A Franck Cappello
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%X This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint. We provide a detailed analytical study of both scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that corroborates the analytical model.
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%I ACM
%C Washington, DC
%8 06-2017
%G eng
%R 10.1145/3086157.3086162
%0 Conference Paper
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%D 2017
%T Optimal Checkpointing Period with replicated execution on heterogeneous platforms
%A Anne Benoit
%A Aurelien Cavelan
%A Valentin Le Fèvre
%A Yves Robert
%X In this paper, we design and analyze strategies to replicate the execution of an application on two different platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~W for a periodic checkpointing strategy where both platforms concurrently try and execute W units of work before checkpointing. The first platform that completes its pattern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use first or second-order approximations to compute overheads and optimal pattern sizes, and show through extensive simulations that these models are very accurate. The simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively different speeds: in average, over a wide range of scenarios, the overhead is reduced by 30%. The simulations also demonstrate that the periodic checkpointing strategy is globally more efficient, unless platform speeds are quite close.
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%I IEEE Computer Society Press
%C Washington, DC
%8 06-2017
%G eng
%R 10.1145/3086157.3086165
%0 Journal Article
%J International Journal of High Performance Computing Applications (IJHPCA)
%D 2017
%T Resilient Co-Scheduling of Malleable Applications
%A Anne Benoit
%A Loïc Pottier
%A Yves Robert
%K co-scheduling
%K complexity results
%K heuristics
%K Redistribution
%K resilience
%K simulations
%X Recently, the benefits of co-scheduling several applications have been demonstrated in a fault-free context, both in terms of performance and energy savings. However, large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between applications and significantly degrade performance. In this article, we aim at minimizing the expected completion time of a set of co-scheduled applications. We propose to redistribute the resources assigned to each application upon the occurrence of failures, and upon the completion of some applications, in order to achieve this goal. First, we introduce a formal model and establish complexity results. The problem is NP-complete for malleable applications, even in a fault-free context. Therefore, we design polynomial-time heuristics that perform redistributions and account for processor failures. A fault simulator is used to perform extensive simulations that demonstrate the usefulness of redistribution and the performance of the proposed heuristics.
%B International Journal of High Performance Computing Applications (IJHPCA)
%8 05-2017
%G eng
%R 10.1177/1094342017704979
%0 Journal Article
%J IEEE Transactions on Computers
%D 2017
%T Towards Optimal Multi-Level Checkpointing
%A Anne Benoit
%A Aurelien Cavelan
%A Valentin Le Fèvre
%A Yves Robert
%A Hongyang Sun
%K checkpointing
%K Dynamic programming
%K Error analysis
%K Heuristic algorithms
%K Optimized production technology
%K protocols
%K Shape
%B IEEE Transactions on Computers
%V 66
%P 1212–1226
%8 07-2017
%G eng
%N 7
%R 10.1109/TC.2016.2643660
%0 Journal Article
%J ACM Transactions on Parallel Computing
%D 2016
%T Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors
%A Anne Benoit
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%K checkpoint
%K fail-stop error
%K failure
%K HPC
%K resilience
%K silent data corruption
%K silent error
%K verification
%X In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bi-criteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via dynamic voltage and frequency scaling (DVFS). We determine in this scenario the optimal checkpointing and verification locations, as well as the optimal speed pairs. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.
%B ACM Transactions on Parallel Computing
%8 08-2016
%G eng
%R 10.1145/2897189
%0 Conference Paper
%B 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2016
%T Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors
%A Anne Benoit
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%K fail-stop errors
%K multilevel checkpoint
%K optimal pattern
%K resilience
%K silent errors
%K verification
%X This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop errors. Many others deal with silent errors (or silent data corruptions). But very few papers deal with fail-stop and silent errors simultaneously. However, HPC applications will obviously have to cope with both error sources. This paper presents a unified framework and optimal algorithmic solutions to this double challenge. Silent errors are handled via verification mechanisms (either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, and a full characterization of the optimal pattern. Our results nicely extend several published solutions and demonstrate how to make use of different techniques to solve the double threat of fail-stop and silent errors. Extensive simulations based on real data confirm the accuracy of the model, and show that patterns that combine all resilience mechanisms are required to provide acceptable overheads.
%B 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 05-2016
%G eng
%R 10.1109/IPDPS.2016.39
%0 Journal Article
%J International Journal of Networking and Computing
%D 2016
%T Scheduling Computational Workflows on Failure-prone Platforms
%A Guillaume Aupy
%A Anne Benoit
%A Henri Casanova
%A Yves Robert
%K checkpointing
%K fault-tolerance
%K reliability
%K scheduling
%K workflow
%X We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomialtime optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.
%B International Journal of Networking and Computing
%V 6
%P 2-26
%G eng
%0 Journal Article
%J International Journal on High Performance Computing Applications
%D 2015
%T Efficient Checkpoint/Verification Patterns
%A Anne Benoit
%A Saurabh K. Raina
%A Yves Robert
%K checkpointing
%K Fault tolerance
%K High Performance Computing
%K silent data corruption
%K silent error
%K verification
%X Errors have become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (Mean Time Between Failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared to the base algorithm that always perform a verification just before taking a checkpoint (p = q = 1), and we exhibit gains of up to 19%.
%B International Journal on High Performance Computing Applications
%8 07-2015
%G eng
%R 10.1177/1094342015594531
%0 Generic
%D 2014
%T Efficient checkpoint/verification patterns for silent error detection
%A Anne Benoit
%A Yves Robert
%A Saurabh K. Raina
%X Resilience has become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their particularities is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (Mean Time Between Failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared to the base algorithm that always perform a verification just before taking a checkpoint (p = q = 1), and we exhibit gains of up to 19%.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 05-2014
%G eng
%9 LAWN 287
%0 Generic
%D 2013
%T On the Combination of Silent Error Detection and Checkpointing
%A Guillaume Aupy
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%K checkpointing
%K error recovery
%K High-performance computing
%K silent data corruption
%K verification
%X In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
%B UT-CS-13-710
%I University of Tennessee Computer Science Technical Report
%8 06-2013
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn278.pdf
%0 Generic
%D 2013
%T Optimal Checkpointing Period: Time vs. Energy
%A Guillaume Aupy
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report (also LAWN 281)
%I University of Tennessee
%8 10-2013
%G eng