%0 Journal Article %J International Journal of High Performance Computing Applications %D 2019 %T Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms %A Guillaume Aupy %A Anne Benoit %A Brice Goglin %A Loïc Pottier %A Yves Robert %K cache partitioning %K chip multiprocessor %K co-scheduling %K HPC application %X With the recent advent of many-core architectures such as chip multiprocessors (CMPs), the number of processing units accessing a global shared memory is constantly increasing. Co-scheduling techniques are used to improve application throughput on such architectures, but sharing resources often generates critical interferences. In this article, we focus on the interferences in the last level of cache (LLC) and use the Cache Allocation Technology (CAT) recently provided by Intel to partition the LLC and give each co-scheduled application their own cache area. We consider m iterative HPC applications running concurrently and answer to the following questions: (i) How to precisely model the behavior of these applications on the cache-partitioned platform? and (ii) how many cores and cache fractions should be assigned to each application to maximize the platform efficiency? Here, platform efficiency is defined as maximizing the performance either globally, or as guaranteeing a fixed ratio of iterations per second for each application. Through extensive experiments using CAT, we demonstrate the impact of cache partitioning when multiple HPC applications are co-scheduled onto CMP platforms. %B International Journal of High Performance Computing Applications %V 33 %P 1221-1239 %8 2019-11 %G eng %N 6 %R https://doi.org/10.1177/1094342019846956 %0 Conference Paper %B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019) %D 2019 %T Reservation Strategies for Stochastic Jobs %A Guillaume Aupy %A Ana Gainaru %A Valentin Honoré %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019) %I IEEE Computer Society Press %C Rio de Janeiro, Brazil %8 2019-05 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2018 %T Co-Scheduling Amdhal Applications on Cache-Partitioned Systems %A Guillaume Aupy %A Anne Benoit %A Sicheng Dai %A Loïc Pottier %A Padma Raghavan %A Yves Robert %A Manu Shantharam %K cache partitioning %K co-scheduling %K complexity results %X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are as follows: (i) which proportion of cache and (ii) how many processors should be given to each application? In this article, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. %B International Journal of High Performance Computing Applications %V 32 %P 123–138 %8 2018-01 %G eng %N 1 %R https://doi.org/10.1177/1094342017710806 %0 Conference Paper %B Cluster 2018 %D 2018 %T Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms %A Guillaume Aupy %A Anne Benoit %A Brice Goglin %A Loïc Pottier %A Yves Robert %B Cluster 2018 %I IEEE Computer Society Press %C Belfast, UK %8 2018-09 %G eng %0 Book Section %B Topics in Parallel and Distributed Computing %D 2018 %T Scheduling for Fault-Tolerance: An Introduction %A Guillaume Aupy %A Yves Robert %B Topics in Parallel and Distributed Computing %I Springer International Publishing %P 143–170 %@ 978-3-319-93108-1 %G eng %R 10.1007/978-3-319-93109-8 %0 Conference Paper %B The 3rd International Workshop on Fault Tolerant Systems (FTS) %D 2017 %T Assuming failure independence: are we right to be wrong? %A Guillaume Aupy %A Yves Robert %A Frederic Vivien %X This paper revisits the failure temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is incorrect, and we propose a new method to detect failure cascades, i.e., series of non-independent consecutive failures. We use this new method to assess whether public archive failure logs contain failure cascades. Then we design and compare several cascadeaware checkpointing algorithms to quantify the maximum gain that could be obtained, and we report extensive simulation results with archive and synthetic failure logs. Altogether, there are a few logs that contain cascades, but we show that the gain that can be achieved from this knowledge is not significant. The conclusion is that we can wrongly, but safely, assume failure independence! %B The 3rd International Workshop on Fault Tolerant Systems (FTS) %I IEEE %C Honolulu, Hawaii %8 2017-09 %G eng %0 Conference Paper %B 19th Workshop on Advances in Parallel and Distributed Computational Models %D 2017 %T Co-Scheduling Algorithms for Cache-Partitioned Systems %A Guillaume Aupy %A Anne Benoit %A Loïc Pottier %A Padma Raghavan %A Yves Robert %A Manu Shantharam %K Computational modeling %K Degradation %K Interference %K Mathematical model %K Program processors %K Supercomputers %K Throughput %X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? Here, we assign rational numbers of processors to each application, since they can be shared across applications through multi-threading. In this paper, we provide answers to (i) and (ii) for perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for general applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. %B 19th Workshop on Advances in Parallel and Distributed Computational Models %I IEEE Computer Society Press %C Orlando, FL %8 2017-05 %G eng %R 10.1109/IPDPSW.2017.60 %0 Journal Article %J International Journal of Networking and Computing %D 2016 %T Scheduling Computational Workflows on Failure-prone Platforms %A Guillaume Aupy %A Anne Benoit %A Henri Casanova %A Yves Robert %K checkpointing %K fault-tolerance %K reliability %K scheduling %K workflow %X We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomialtime optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations. %B International Journal of Networking and Computing %V 6 %P 2-26 %G eng %0 Generic %D 2015 %T Scheduling for fault-tolerance: an introduction %A Guillaume Aupy %A Yves Robert %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2015-01 %G eng %0 Generic %D 2013 %T On the Combination of Silent Error Detection and Checkpointing %A Guillaume Aupy %A Anne Benoit %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %K checkpointing %K error recovery %K High-performance computing %K silent data corruption %K verification %X In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters. %B UT-CS-13-710 %I University of Tennessee Computer Science Technical Report %8 2013-06 %G eng %U http://www.netlib.org/lapack/lawnspdf/lawn278.pdf %0 Generic %D 2013 %T Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC %A Guillaume Aupy %A Mathieu Faverge %A Yves Robert %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %X This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter-node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures %B Lawn 277 %8 2013-05 %G eng %0 Generic %D 2013 %T Optimal Checkpointing Period: Time vs. Energy %A Guillaume Aupy %A Anne Benoit %A Thomas Herault %A Yves Robert %A Jack Dongarra %B University of Tennessee Computer Science Technical Report (also LAWN 281) %I University of Tennessee %8 2013-10 %G eng