%0 Conference Paper
%B 49th International Conference on Parallel Processing (ICPP 2020)
%D 2020
%T Energy-Aware Strategies for Reliability-Oriented Real-Time Task Allocation on Heterogeneous Platforms
%A Li Han
%A Yiqin Gao
%A Jing Liu
%A Yves Robert
%A Frederic Vivien
%B 49th International Conference on Parallel Processing (ICPP 2020)
%I ACM Press
%C Edmonton, AB, Canada
%G eng

%0 Conference Paper
%B 40th IEEE Real-Time Systems Symposium (RTSS 2019)
%D 2020
%T Improved Energy-Aware Strategies for Periodic Real-Time Tasks under Reliability Constraints
%A Li Han
%A Louis-Claude Canon
%A Jing Liu
%A Yves Robert
%A Frederic Vivien
%B 40th IEEE Real-Time Systems Symposium (RTSS 2019)
%I IEEE Press
%C York, UK
%8 2020-02
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2019
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Li Han
%A Valentin Le Fèvre
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%K checkpoint
%K fail-stop error
%K resilience
%K workflow
%B International Journal of High Performance Computing Applications
%V 33
%P 1255-1274
%8 2019-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342019866891

%0 Journal Article
%J IEEE Transactions on Computers
%D 2018
%T Checkpointing Workflows for Fail-Stop Errors
%A Li Han
%A Louis-Claude Canon
%A Henri Casanova
%A Yves Robert
%A Frederic Vivien
%K checkpoint
%K fail-stop error
%K resilience
%K workflow
%X We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS), which is relevant to many real-world workflow applications. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide how to checkpoint these sub-graphs. We assess the performance of our algorithm for production workflow configurations, comparing it to an approach in which all application data is checkpointed and an approach in which no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures.
%B IEEE Transactions on Computers
%V 67
%P 1105–1120
%8 2018-08
%G eng
%U http://ieeexplore.ieee.org/document/8279499/
%N 8

%0 Conference Paper
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%D 2018
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Li Han
%A Valentin Le Fèvre
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%X This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target failstop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows.
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%I IEEE Computer Society Press
%C Eugene, OR
%8 2018-08
%G eng

%0 Conference Paper
%B IEEE Cluster
%D 2017
%T Checkpointing Workflows for Fail-Stop Errors
%A Li Han
%A Louis-Claude Canon
%A Henri Casanova
%A Yves Robert
%A Frederic Vivien
%X We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class  of graphs, Minimal Series-Parallel Graphs (M-SPGS). It turns out that many real-world workflow applications are naturally structured as M-SPGS. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide which tasks in these sub-graphs should be checkpointed. Furthermore, it is possible to efficiently compute the expected makespan for the solution produced by this algorithm, using a first-order approximation of task weights and existing evaluation algorithms for 2-state probabilistic DAGs. We assess the performance of our algorithm for production workflow configurations, comparing it to (i) an approach in which all application data is checkpointed, which corresponds to the standard way in which most production workflows are executed today; and (ii) an approach in which no application data is checkpointed. Our results demonstrate that our algorithm strikes a good compromise between these two approaches, leading to lower checkpointing overhead than the former and to better resilience to failure than the latter.
%B IEEE Cluster
%I IEEE
%C Honolulu, Hawaii
%8 2017-09
%G eng