%0 Journal Article %J Int. J. of Networking and Computing %D 2021 %T Resilient scheduling heuristics for rigid parallel jobs %A Anne Benoit %A Valentin Le Fèvre %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B Int. J. of Networking and Computing %V 11 %P 2-26 %G eng %0 Conference Paper %B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020) %D 2020 %T Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs %A Anne Benoit %A Valentin Le Fèvre %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020) %I IEEE Computer Society Press %C New Orleans, LA %8 2020-05 %G eng %0 Journal Article %J International Journal of Networking and Computing %D 2019 %T Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors %A Anne Benoit %A Aurelien Cavelan %A Florina M. Ciorba %A Valentin Le Fèvre %A Yves Robert %K checkpoint %K fail-stop error; silent error %K HPC %K linear workflow %K Replication %X Large-scale platforms currently experience errors from two di?erent sources, namely fail-stop errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt data). This work combines checkpointing and replication for the reliable execution of linear work?ows on platforms subject to these two error types. While checkpointing and replication have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear work?ows in error-prone environments. Moreover, combined checkpointing and replication has not yet been studied in the presence of both fail-stop and silent errors. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it to ensure its reliable execution. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques, lead to improved performance. %B International Journal of Networking and Computing %V 9 %P 2-27 %8 2019 %G eng %U http://www.ijnc.org/index.php/ijnc/article/view/194 %0 Journal Article %J Parallel Computing %D 2019 %T Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms %A Valentin Le Fèvre %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Atsushi Hori %A George Bosilca %A Jack Dongarra %B Parallel Computing %V 85 %P 1–12 %8 2019-07 %G eng %R https://doi.org/10.1016/j.parco.2019.02.002 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2019 %T A Generic Approach to Scheduling and Checkpointing Workflows %A Li Han %A Valentin Le Fèvre %A Louis-Claude Canon %A Yves Robert %A Frederic Vivien %K checkpoint %K fail-stop error %K resilience %K workflow %B International Journal of High Performance Computing Applications %V 33 %P 1255-1274 %8 2019-11 %G eng %N 6 %R https://doi.org/10.1177/1094342019866891 %0 Conference Paper %B The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19) %D 2019 %T Replication is More Efficient Than You Think %A Anne Benoit %A Thomas Herault %A Valentin Le Fèvre %A Yves Robert %B The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19) %I ACM Press %C Denver, CO %8 2019-11 %G eng %0 Generic %D 2018 %T Distributed Termination Detection for HPC Task-Based Environments %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Valentin Le Fèvre %A Yves Robert %A Jack Dongarra %X This paper revisits distributed termination detection algorithms in the context of high-performance computing applications in task systems. We first outline the need to efficiently detect termination in workflows for which the total number of tasks is data dependent and therefore not known statically but only revealed dynamically during execution. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). On the theoretical side, we analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. On the practical side, we provide a highly tuned implementation of each termination detection algorithm within PaRSEC and compare their performance for a variety of benchmarks, extracted from scientific applications that exhibit dynamic behaviors. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-06 %G eng %0 Conference Paper %B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids %D 2018 %T Do moldable applications perform better on failure-prone HPC platforms? %A Valentin Le Fèvre %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Atsushi Hori %A Yves Robert %A Jack Dongarra %X This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) Rigid applications, which use a constant number of processors throughout execution; (ii) Moldable applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GridShaped applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage. %B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids %S LNCS %I Springer Verlag %C Turin, Italy %8 2018-08 %G eng %0 Conference Paper %B The 47th International Conference on Parallel Processing (ICPP 2018) %D 2018 %T A Generic Approach to Scheduling and Checkpointing Workflows %A Li Han %A Valentin Le Fèvre %A Louis-Claude Canon %A Yves Robert %A Frederic Vivien %X This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target failstop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows. %B The 47th International Conference on Parallel Processing (ICPP 2018) %I IEEE Computer Society Press %C Eugene, OR %8 2018-08 %G eng %0 Conference Paper %B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale %D 2017 %T Optimal Checkpointing Period with replicated execution on heterogeneous platforms %A Anne Benoit %A Aurelien Cavelan %A Valentin Le Fèvre %A Yves Robert %X In this paper, we design and analyze strategies to replicate the execution of an application on two different platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~W for a periodic checkpointing strategy where both platforms concurrently try and execute W units of work before checkpointing. The first platform that completes its pattern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use first or second-order approximations to compute overheads and optimal pattern sizes, and show through extensive simulations that these models are very accurate. The simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively different speeds: in average, over a wide range of scenarios, the overhead is reduced by 30%. The simulations also demonstrate that the periodic checkpointing strategy is globally more efficient, unless platform speeds are quite close. %B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale %I IEEE Computer Society Press %C Washington, DC %8 2017-06 %G eng %R 10.1145/3086157.3086165 %0 Journal Article %J IEEE Transactions on Computers %D 2017 %T Towards Optimal Multi-Level Checkpointing %A Anne Benoit %A Aurelien Cavelan %A Valentin Le Fèvre %A Yves Robert %A Hongyang Sun %K checkpointing %K Dynamic programming %K Error analysis %K Heuristic algorithms %K Optimized production technology %K protocols %K Shape %B IEEE Transactions on Computers %V 66 %P 1212–1226 %8 2017-07 %G eng %N 7 %R 10.1109/TC.2016.2643660