%0 Journal Article %J Int. J. of Networking and Computing %D 2021 %T Resilient scheduling heuristics for rigid parallel jobs %A Anne Benoit %A Valentin Le Fèvre %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B Int. J. of Networking and Computing %V 11 %P 2-26 %G eng %0 Conference Paper %B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020) %D 2020 %T Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs %A Anne Benoit %A Valentin Le Fèvre %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020) %I IEEE Computer Society Press %C New Orleans, LA %8 2020-05 %G eng %0 Conference Paper %B 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2020) %D 2020 %T Reservation and Checkpointing Strategies for Stochastic Jobs %A Ana Gainaru %A Brice Goglin %A Valentin Honoré %A Padma Raghavan %A Guillaume Pallez %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2020) %I IEEE Computer Society Press %C New Orleans, LA %8 2020-05 %G eng %0 Conference Paper %B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019) %D 2019 %T Reservation Strategies for Stochastic Jobs %A Guillaume Aupy %A Ana Gainaru %A Valentin Honoré %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019) %I IEEE Computer Society Press %C Rio de Janeiro, Brazil %8 2019-05 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2018 %T Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing %A Anne Benoit %A Aurelien Cavelan %A Franck Cappello %A Padma Raghavan %A Yves Robert %A Hongyang Sun %K checkpointing %K fail-stop errors %K Fault tolerance %K High-performance computing %K Replication %K silent errors %X This paper provides a model and an analytical study of replication as a technique to cope with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale platforms. Compared with fail-stop errors that are immediately detected when they occur, silent errors require a detection mechanism. To detect silent errors, many application-specific techniques are available, either based on algorithms (e.g., ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject to only silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to even correct some errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. Otherwise, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model. %B Journal of Parallel and Distributed Computing %V 122 %P 209–225 %8 2018-12 %G eng %R https://doi.org/10.1016/j.jpdc.2018.08.002 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2018 %T Co-Scheduling Amdhal Applications on Cache-Partitioned Systems %A Guillaume Aupy %A Anne Benoit %A Sicheng Dai %A Loïc Pottier %A Padma Raghavan %A Yves Robert %A Manu Shantharam %K cache partitioning %K co-scheduling %K complexity results %X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are as follows: (i) which proportion of cache and (ii) how many processors should be given to each application? In this article, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. %B International Journal of High Performance Computing Applications %V 32 %P 123–138 %8 2018-01 %G eng %N 1 %R https://doi.org/10.1177/1094342017710806 %0 Conference Paper %B 19th Workshop on Advances in Parallel and Distributed Computational Models %D 2017 %T Co-Scheduling Algorithms for Cache-Partitioned Systems %A Guillaume Aupy %A Anne Benoit %A Loïc Pottier %A Padma Raghavan %A Yves Robert %A Manu Shantharam %K Computational modeling %K Degradation %K Interference %K Mathematical model %K Program processors %K Supercomputers %K Throughput %X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? Here, we assign rational numbers of processors to each application, since they can be shared across applications through multi-threading. In this paper, we provide answers to (i) and (ii) for perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for general applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. %B 19th Workshop on Advances in Parallel and Distributed Computational Models %I IEEE Computer Society Press %C Orlando, FL %8 2017-05 %G eng %R 10.1109/IPDPSW.2017.60 %0 Conference Proceedings %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %D 2000 %T A New Recursive Implementation of Sparse Cholesky Factorization %A Jack Dongarra %A Padma Raghavan %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %C Lausanne, Switzerland %8 2000-08 %G eng