%0 Journal Article %J IEEE Transactions on Computers %D 2018 %T Checkpointing Workflows for Fail-Stop Errors %A Li Han %A Louis-Claude Canon %A Henri Casanova %A Yves Robert %A Frederic Vivien %K checkpoint %K fail-stop error %K resilience %K workflow %X We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS), which is relevant to many real-world workflow applications. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide how to checkpoint these sub-graphs. We assess the performance of our algorithm for production workflow configurations, comparing it to an approach in which all application data is checkpointed and an approach in which no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures. %B IEEE Transactions on Computers %V 67 %P 1105–1120 %8 2018-08 %G eng %U http://ieeexplore.ieee.org/document/8279499/ %N 8 %0 Journal Article %J Parallel Computing %D 2018 %T Computing the Expected Makespan of Task Graphs in the Presence of Silent Errors %A Henri Casanova %A Julien Herrmann %A Yves Robert %K Expected makespan %K fault-tolerance %K scheduling %K Scientific workflows %K silent errors %K Task graphs %X Applications structured as Directed Acyclic Graphs (DAGs) of tasks occur in many domains, including popular scientific workflows. DAG scheduling has thus received an enormous amount of attention. Many of the popular DAG scheduling heuristics make scheduling deci- sions based on path lengths. At large scale compute platforms are subject to various types of failures with non-negligible probabilities of occurrence. Failures that have recently re- ceived increased attention are “silent errors,” which cause data corruption. Tolerating silent errors is done by checking the validity of computed results and possibly re-executing a task from scratch. The execution time of a task then becomes a random variable, and so do path lengths in a DAG. Unfortunately, computing the expected makespan of a DAG (and equivalently computing expected path lengths in a DAG) is a computationally dif- ficult problem. Consequently, designing effective scheduling heuristics in this context is challenging. In this work, we propose an algorithm that computes a first order approxi- mation of the expected makespan of a DAG when tasks are subject to silent errors. We find that our proposed approximation outperforms previously proposed approaches both in terms of approximation error and of speed. %B Parallel Computing %V 75 %P 41–60 %8 2018-07 %G eng %R https://doi.org/10.1016/j.parco.2018.03.004 %0 Conference Paper %B IEEE Cluster %D 2017 %T Checkpointing Workflows for Fail-Stop Errors %A Li Han %A Louis-Claude Canon %A Henri Casanova %A Yves Robert %A Frederic Vivien %X We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS). It turns out that many real-world workflow applications are naturally structured as M-SPGS. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide which tasks in these sub-graphs should be checkpointed. Furthermore, it is possible to efficiently compute the expected makespan for the solution produced by this algorithm, using a first-order approximation of task weights and existing evaluation algorithms for 2-state probabilistic DAGs. We assess the performance of our algorithm for production workflow configurations, comparing it to (i) an approach in which all application data is checkpointed, which corresponds to the standard way in which most production workflows are executed today; and (ii) an approach in which no application data is checkpointed. Our results demonstrate that our algorithm strikes a good compromise between these two approaches, leading to lower checkpointing overhead than the former and to better resilience to failure than the latter. %B IEEE Cluster %I IEEE %C Honolulu, Hawaii %8 2017-09 %G eng %0 Journal Article %J International Journal of Networking and Computing %D 2016 %T Scheduling Computational Workflows on Failure-prone Platforms %A Guillaume Aupy %A Anne Benoit %A Henri Casanova %A Yves Robert %K checkpointing %K fault-tolerance %K reliability %K scheduling %K workflow %X We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomialtime optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations. %B International Journal of Networking and Computing %V 6 %P 2-26 %G eng %0 Journal Article %J International Journal of Parallel Programming %D 2005 %T New Grid Scheduling and Rescheduling Methods in the GrADS Project %A Francine Berman %A Henri Casanova %A Andrew Chien %A Keith Cooper %A Holly Dail %A Anshuman Dasgupta %A Wei Deng %A Jack Dongarra %A Lennart Johnsson %A Ken Kennedy %A Charles Koelbel %A Bo Liu %A Xu Liu %A Anirban Mandal %A Gabriel Marin %A Mark Mazina %A John Mellor-Crummey %A Celso Mendes %A A. Olugbile %A Jignesh M. Patel %A Dan Reed %A Zhiao Shi %A Otto Sievert %A H. Xia %A Asim YarKhan %K grads %B International Journal of Parallel Programming %I Springer %V 33 %P 209-229 %8 2005-06 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2004 %T The Virtual Instrument: Support for Grid-enabled Scientific Simulations %A Henri Casanova %A Thomas Bartol %A Francine Berman %A Adam Birnbaum %A Jack Dongarra %A Mark Ellisman %A Marcio Faerman %A Erhan Gockay %A Michelle Miller %A Graziano Obertelli %A Stuart Pomerantz %A Terry Sejnowski %A Joel Stiles %A Rich Wolski %B International Journal of High Performance Computing Applications %V 18 %P 3-17 %8 2004-01 %G eng %0 Conference Proceedings %B Proceedings of the IPDPS 2003, NGS Workshop %D 2003 %T Optimizing Performance and Reliability in Distributed Computing Systems Through Wide Spectrum Storage %A James Plank %A Micah Beck %A Jack Dongarra %A Rich Wolski %A Henri Casanova %B Proceedings of the IPDPS 2003, NGS Workshop %C Nice, France %P 209 %8 2003-01 %G eng %0 Journal Article %J Resource Management in the Grid %D 2003 %T Scheduling in the Grid Application Development Software Project %A Holly Dail %A Otto Sievert %A Francine Berman %A Henri Casanova %A Asim YarKhan %A Sathish Vadhiyar %A Jack Dongarra %A Chuang Liu %A Lingyun Yang %A Dave Angulo %A Ian Foster %K grads %B Resource Management in the Grid %I Kluwer Publishers %8 2003-03 %G eng %0 Journal Article %J International Journal of Supercomputer Applications and High-Performance Computing %D 2002 %T Adaptive Scheduling for Task Farming with Grid Middleware %A Henri Casanova %A Myung Ho Kim %A James Plank %A Jack Dongarra %B International Journal of Supercomputer Applications and High-Performance Computing %V 13 %P 231-240 %8 2002-10 %G eng %0 Generic %D 2002 %T GridRPC: A Remote Procedure Call API for Grid Computing %A Keith Seymour %A Hidemoto Nakada %A Satoshi Matsuoka %A Jack Dongarra %A Craig Lee %A Henri Casanova %B ICL Technical Report %8 2002-11 %G eng %0 Journal Article %J Concurrency: Practice and Experience %D 2002 %T Innovations of the NetSolve Grid Computing System %A Dorian Arnold %A Henri Casanova %A Jack Dongarra %K netsolve %B Concurrency: Practice and Experience %V 14 %P 1457-1479 %8 2002-01 %G eng %0 Journal Article %J Parallel Computing %D 2002 %T Middleware for the Use of Storage in Communication %A Micah Beck %A Dorian Arnold %A Alessandro Bassi %A Francine Berman %A Henri Casanova %A Jack Dongarra %A Terry Moore %A Graziano Obertelli %A James Plank %A Martin Swany %A Sathish Vadhiyar %A Rich Wolski %K netsolve %B Parallel Computing %V 28 %P 1773-1788 %8 2002-08 %G eng %0 Conference Proceedings %B Proceedings of the Third International Workshop on Grid Computing %D 2002 %T Overview of GridRPC: A Remote Procedure Call API for Grid Computing %A Keith Seymour %A Hidemoto Nakada %A Satoshi Matsuoka %A Jack Dongarra %A Craig Lee %A Henri Casanova %E Manish Parashar %B Proceedings of the Third International Workshop on Grid Computing %P 274-278 %8 2002-01 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2002 %T Stochastic Performance Prediction for Iterative Algorithms in Distributed Environments %A Henri Casanova %A Michael G. Thomason %A Jack Dongarra %B Journal of Parallel and Distributed Computing %V 98 %P 68-91 %8 2002-10 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing (submitted) %D 2002 %T The Virtual Instrument: Support for Grid-enabled Scientific Simulations %A Henri Casanova %A Thomas Bartol %A Francine Berman %A Adam Birnbaum %A Jack Dongarra %A Mark Ellisman %A Marcio Faerman %A Erhan Gockay %A Michelle Miller %A Graziano Obertelli %A Stuart Pomerantz %A Terry Sejnowski %A Joel Stiles %A Rich Wolski %B Journal of Parallel and Distributed Computing (submitted) %8 2002-10 %G eng %0 Journal Article %J submitted to SC2001 %D 2001 %T Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication %A Micah Beck %A Dorian Arnold %A Alessandro Bassi %A Francine Berman %A Henri Casanova %A Jack Dongarra %A Terry Moore %A Graziano Obertelli %A James Plank %A Martin Swany %A Sathish Vadhiyar %A Rich Wolski %K netsolve %B submitted to SC2001 %C Denver, Colorado %8 2001-11 %G eng %0 Conference Proceedings %B 2001 High Performance Computing Symposium (HPC'01), part of the Advance Simulation Technologies Conference %D 2001 %T Network-Enabled Server Systems: Deploying Scientific Simulations on the Grid %A Henri Casanova %A Satoshi Matsuoka %A Jack Dongarra %B 2001 High Performance Computing Symposium (HPC'01), part of the Advance Simulation Technologies Conference %C Seattle, Washington %8 2001-04 %G eng %0 Journal Article %J Future Generation Computer Systems %D 1999 %T Deploying Fault-tolerance and Task Migration with NetSolve %A Henri Casanova %A James Plank %A Micah Beck %A Jack Dongarra %K netsolve %B Future Generation Computer Systems %I Elsevier %V 15 %P 745-755 %8 1999-10 %G eng %0 Journal Article %J Computer Communications %D 1999 %T Logistical Quality of Service in NetSolve %A Micah Beck %A Henri Casanova %A Jack Dongarra %A Terry Moore %A James Plank %A Francine Berman %A Rich Wolski %K netsolve %B Computer Communications %V 22 %P 1034-1044 %8 1999-01 %G eng %0 Journal Article %J SIAM Annual Meeting %D 1999 %T A Numerical Linear Algebra Problem Solving Environment Designer's Perspective (LAPACK Working Note 139) %A Antoine Petitet %A Henri Casanova %A Clint Whaley %A Jack Dongarra %A Yves Robert %B SIAM Annual Meeting %C Atlanta, GA %8 1999-05 %G eng %0 Journal Article %J Handbook on Parallel and Distributed Processing %D 1999 %T Parallel and Distributed Scientific Computing: A Numerical Linear Algebra Problem Solving Environment Designer's Perspective %A Antoine Petitet %A Henri Casanova %A Jack Dongarra %A Yves Robert %A Clint Whaley %B Handbook on Parallel and Distributed Processing %8 1999-01 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 1999 %T Stochastic Performance Prediction for Iterative Algorithms in Distributed Environments %A Henri Casanova %A Myung Ho Kim %A James Plank %A Jack Dongarra %B Journal of Parallel and Distributed Computing %V 98 %P 68-91 %8 1999-01 %G eng