%0 Conference Paper
%B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)
%D 2021
%T Distributed-Memory Multi-GPU Block-Sparse Tensor Contraction for Electronic Structure
%A Thomas Herault
%A Yves Robert
%A George Bosilca
%A Robert Harrison
%A Cannada Lewis
%A Edward Valeev
%A Jack Dongarra
%K block-sparse matrix multiplication
%K distributed-memory
%K Electronic structure
%K multi-GPU node
%K parsec
%K tensor contraction
%X Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.
%B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)
%I IEEE
%C Portland, OR
%8 2021-05
%G eng
%U https://hal.inria.fr/hal-02970659/document
%0 Generic
%D 2021
%T Interim Report on Benchmarking FFT Libraries on High Performance Systems
%A Ayala, Alan
%A Tomov, Stanimire
%A Luszczek, Piotr
%A Cayrols, Sebastien
%A Ragghianti, Gerald
%A Dongarra, Jack
%X The Fast Fourier Transform (FFT) is used in many applications such as molecular dynamics, spectrum estimation, fast convolution and correlation, signal modulation, and many wireless multimedia applications. FFTs are also heavily used in ECP applications, such as EXAALT, Copa, ExaSky-HACC, ExaWind, WarpX, and many others. As these applications’ accuracy and speed depend on the performance of the FFTs, we designed an FFT benchmark to mea- sure performance and scalability of currently available FFT packages and present the results from a pre-Exascale platform. Our benchmarking also stresses the overall capacity of system interconnect; thus, it may be considered as an indicator of the bisection bandwidth, communication contention noise, and the software overheads in MPI collectives that are of interest to many other ECP applications and libraries. This FFT benchmarking project aims to show the strengths and weaknesses of multiple FFT libraries and to indicate what can be done to improve their performance. In particular, we believe that the benchmarking results could help design and implement a fast and robust FFT library for 2D and 3D inputs, while targeting large-scale heterogeneous systems with multicore processors and hardware accelerators that are a co-designed in tandem with ECP applications. Our work involves studying and analyzing state-of-the-art FFT software both from vendors and available as open-source codes to better understand their performance.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2021-07
%G eng
%9 ICL Tech Report
%0 Generic
%D 2020
%T CEED ECP Milestone Report: Improve Performance and Capabilities of CEED-Enabled ECP Applications on Summit/Sierra
%A Kolev, Tzanio
%A Fischer, Paul
%A Abdelfattah, Ahmad
%A Ananthan, Shreyas
%A Barra, Valeria
%A Beams, Natalie
%A Bleile, Ryan
%A Brown, Jed
%A Carson, Robert
%A Camier, Jean-Sylvain
%A Churchfield, Matthew
%A Dobrev, Veselin
%A Dongarra, Jack
%A Dudouit, Yohann
%A Karakus, Ali
%A Kerkemeier, Stefan
%A Lan, YuHsiang
%A Medina, David
%A Merzari, Elia
%A Min, Misun
%A Parker, Scott
%A Ratnayaka, Thilina
%A Smith, Cameron
%A Sprague, Michael
%A Stitt, Thomas
%A Thompson, Jeremy
%A Tomboulides, Ananias
%A Tomov, Stanimire
%A Tomov, Vladimir
%A Vargas, Arturo
%A Warburton, Tim
%A Weiss, Kenneth
%B ECP Milestone Reports
%I Zenodo
%8 2020-05
%G eng
%U https://doi.org/10.5281/zenodo.3860804
%R https://doi.org/10.5281/zenodo.3860804
%0 Conference Paper
%B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020)
%D 2020
%T Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs
%A Anne Benoit
%A Valentin Le Fèvre
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020)
%I IEEE Computer Society Press
%C New Orleans, LA
%8 2020-05
%G eng
%0 Conference Paper
%B 49th International Conference on Parallel Processing (ICPP 2020)
%D 2020
%T Energy-Aware Strategies for Reliability-Oriented Real-Time Task Allocation on Heterogeneous Platforms
%A Li Han
%A Yiqin Gao
%A Jing Liu
%A Yves Robert
%A Frederic Vivien
%B 49th International Conference on Parallel Processing (ICPP 2020)
%I ACM Press
%C Edmonton, AB, Canada
%G eng
%0 Journal Article
%J Journal of Open Source Software
%D 2020
%T Ginkgo: A High Performance Numerical Linear Algebra Library
%A Hartwig Anzt
%A Terry Cojean
%A Yen-Chen Chen
%A Fritz Goebel
%A Thomas Gruetzmacher
%A Pratik Nayak
%A Tobias Ribizel
%A Yu-Hsiang Tsai
%X Ginkgo is a production-ready sparse linear algebra library for high performance computing on GPU-centric architectures with a high level of performance portability and focuses on software sustainability. The library focuses on solving sparse linear systems and accommodates a large variety of matrix formats, state-of-the-art iterative (Krylov) solvers and preconditioners, which make the library suitable for a variety of scientific applications. Ginkgo supports many architectures such as multi-threaded CPU, NVIDIA GPUs, and AMD GPUs. The heavy use of modern C++ features simplifies the addition of new executor paradigms and algorithmic functionality without introducing significant performance overhead. Solving linear systems is usually one of the most computationally and memory intensive aspects of any application. Hence there has been a significant amount of effort in this direction with software libraries such as UMFPACK (Davis, 2004) and CHOLMOD (Chen, Davis, Hager, & Rajamanickam, 2008) for solving linear systems with direct methods and PETSc (Balay et al., 2020), Trilinos (“The Trilinos Project Website,” 2020), Eigen (Guennebaud, Jacob, & others, 2010) and many more to solve linear systems with iterative methods. With Ginkgo, we aim to ensure high performance while not compromising portability. Hence, we provide very efficient low level kernels optimized for different architectures and separate these kernels from the algorithms thereby ensuring extensibility and ease of use. Ginkgo is also a part of the xSDK effort (Bartlett et al., 2017) and available as a Spack (Gamblin et al., 2015) package. xSDK aims to provide infrastructure for and interoperability between a collection of related and complementary software elements to foster rapid and efficient development of scientific applications using High Performance Computing. Within this effort, we provide interoperability with application libraries such as deal.ii (Arndt et al., 2019) and mfem (Anderson et al., 2020). Ginkgo provides wrappers within these two libraries so that they can take advantage of the features of Ginkgo.
%B Journal of Open Source Software
%V 5
%8 2020-08
%G eng
%N 52
%R https://doi.org/10.21105/joss.02260
%0 Generic
%D 2020
%T Ginkgo: A Node-Level Sparse Linear Algebra Library for HPC (Poster)
%A Hartwig Anzt
%A Terry Cojean
%A Yen-Chen Chen
%A Fritz Goebel
%A Thomas Gruetzmacher
%A Pratik Nayak
%A Tobias Ribizel
%A Yu-Hsiang Tsai
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng
%0 Book Section
%B Fog Computing: Theory and Practice
%D 2020
%T Harnessing the Computing Continuum for Programming Our World
%A Pete Beckman
%A Jack Dongarra
%A Nicola Ferrier
%A Geoffrey Fox
%A Terry Moore
%A Dan Reed
%A Micah Beck
%X This chapter outlines a vision for how best to harness the computing continuum of interconnected sensors, actuators, instruments, and computing systems, from small numbers of very large devices to large numbers of very small devices. The hypothesis is that only via a continuum perspective one can intentionally specify desired continuum actions and effectively manage outcomes and systemic properties—adaptability and homeostasis, temporal constraints and deadlines—and elevate the discourse from device programming to intellectual goals and outcomes. Development of a framework for harnessing the computing continuum would catalyze new consumer services, business processes, social services, and scientific discovery. Realizing and implementing a continuum programming model requires balancing conflicting constraints and translating the high‐level specification into a form suitable for execution on a unifying abstract machine model. In turn, the abstract machine must implement the mapping of specification demands to end‐to‐end resources.
%B Fog Computing: Theory and Practice
%I John Wiley & Sons, Inc.
%@ 9781119551713
%G eng
%& 7
%R https://doi.org/10.1002/9781119551713.ch7
%0 Conference Paper
%B 40th IEEE Real-Time Systems Symposium (RTSS 2019)
%D 2020
%T Improved Energy-Aware Strategies for Periodic Real-Time Tasks under Reliability Constraints
%A Li Han
%A Louis-Claude Canon
%A Jing Liu
%A Yves Robert
%A Frederic Vivien
%B 40th IEEE Real-Time Systems Symposium (RTSS 2019)
%I IEEE Press
%C York, UK
%8 2020-02
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2020
%T MAGMA Templates for Scalable Linear Algebra on Emerging Architectures
%A Mohammed Al Farhan
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Mark Gates
%A Dalal Sukkari
%A Azzam Haidar
%A Robert Rosenberg
%A Jack Dongarra
%X With the acquisition and widespread use of more resources that rely on accelerator/wide vector–based computing, there has been a strong demand for science and engineering applications to take advantage of these latest assets. This, however, has been extremely challenging due to the diversity of systems to support their extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. To address these challenges, we design a programming model and describe its ease of use in the development of a new MAGMA Templates library that delivers high-performance scalable linear algebra portable on current and emerging architectures. MAGMA Templates derives its performance and portability by (1) building on existing state-of-the-art linear algebra libraries, like MAGMA, SLATE, Trilinos, and vendor-optimized math libraries, and (2) providing access (seamlessly to the users) to the latest algorithms and architecture-specific optimizations through a single, easy-to-use C++-based API.
%B The International Journal of High Performance Computing Applications
%V 34
%P 645-658
%8 2020-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342020938421
%0 Conference Paper
%B 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2020)
%D 2020
%T Reservation and Checkpointing Strategies for Stochastic Jobs
%A Ana Gainaru
%A Brice Goglin
%A Valentin Honoré
%A Padma Raghavan
%A Guillaume Pallez
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%B 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2020)
%I IEEE Computer Society Press
%C New Orleans, LA
%8 2020-05
%G eng
%0 Conference Paper
%B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020)
%D 2020
%T Revisiting Dynamic DAG Scheduling under Memory Constraints for Shared-Memory Platforms
%A Gabriel Bathie
%A Loris Marchal
%A Yves Robert
%A Samuel Thibault
%B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020)
%I IEEE Computer Society Press
%C New Orleans, LA
%8 2020-05
%G eng
%0 Conference Paper
%B 49th International Conference on Parallel Processing (ICPP 2020)
%D 2020
%T Robustness of the Young/Daly Formula for Stochastic Iterative Applications
%A Yishu Du
%A Loris Marchal
%A Guillaume Pallez
%A Yves Robert
%B 49th International Conference on Parallel Processing (ICPP 2020)
%I ACM Press
%C Edmonton, AB, Canada
%8 2020-08
%G eng
%0 Generic
%D 2020
%T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Erik Boman
%A Erin Carson
%A Terry Cojean
%A Jack Dongarra
%A Mark Gates
%A Thomas Gruetzmacher
%A Nicholas J. Higham
%A Sherry Li
%A Neil Lindquist
%A Yang Liu
%A Jennifer Loe
%A Piotr Luszczek
%A Pratik Nayak
%A Sri Pranesh
%A Siva Rajamanickam
%A Tobias Ribizel
%A Barry Smith
%A Kasia Swirydowicz
%A Stephen Thomas
%A Stanimire Tomov
%A Yaohung Tsai
%A Ichitaro Yamazaki
%A Urike Meier Yang
%B SLATE Working Notes
%I University of Tennessee
%8 2020-07
%G eng
%9 SLATE Working Notes
%0 Conference Paper
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops
%D 2019
%T Approximate and Exact Selection on GPUs
%A Tobias Ribizel
%A Hartwig Anzt
%X We present a novel algorithm for parallel selection on GPUs. The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always using the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for - and exploiting the characteristics of - "pleasant" data distributions. At the same time, as the SampleSelect does not work on the actual values but the ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. Additionally to the exact SampleSelect, we address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy.
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R 10.1109/IPDPSW.2019.00088
%0 Generic
%D 2019
%T CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Valeria Barra
%A Natalie Beams
%A Jed Brown
%A Jean-Sylvain Camier
%A Veselin Dobrev
%A Jack Dongarra
%A Yohann Dudouit
%A Paul Fischer
%A Ali Karakus
%A Stefan Kerkemeier
%A Tzanio Kolev
%A YuHsiang Lan
%A Elia Merzari
%A Misun Min
%A Aleks Obabko
%A Scott Parker
%A Thilina Ratnayaka
%A Jeremy Thompson
%A Ananias Tomboulides
%A Vladimir Tomov
%A Tim Warburton
%I Zenodo
%8 2019-10
%G eng
%R https://doi.org/10.5281/zenodo.3477618
%0 Generic
%D 2019
%T CEED ECP Milestone Report: Public release of CEED 2.0
%A Jed Brown
%A Ahmad Abdelfattah
%A Valeria Barra
%A Veselin Dobrev
%A Yohann Dudouit
%A Paul Fischer
%A Tzanio Kolev
%A David Medina
%A Misun Min
%A Thilina Ratnayaka
%A Cameron Smith
%A Jeremy Thompson
%A Stanimire Tomov
%A Vladimir Tomov
%A Tim Warburton
%I Zenodo
%8 2019-04
%G eng
%U https://doi.org/10.5281/zenodo.2641316
%R 10.5281/zenodo.2641316
%0 Journal Article
%J International Journal of Networking and Computing
%D 2019
%T Checkpointing Strategies for Shared High-Performance Computing Platforms
%A Thomas Herault
%A Yves Robert
%A Aurelien Bouteiller
%A Dorian Arnold
%A Kurt Ferreira
%A George Bosilca
%A Jack Dongarra
%X Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Overall, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem.
%B International Journal of Networking and Computing
%V 9
%P 28–52
%G eng
%U http://www.ijnc.org/index.php/ijnc/article/view/195
%0 Generic
%D 2019
%T A Collection of Presentations from the BDEC2 Workshop in Kobe, Japan
%A Rosa M. Badia
%A Micah Beck
%A François Bodin
%A Taisuke Boku
%A Franck Cappello
%A Alok Choudhary
%A Carlos Costa
%A Ewa Deelman
%A Nicola Ferrier
%A Katsuki Fujisawa
%A Kohei Fujita
%A Maria Girone
%A Geoffrey Fox
%A Shantenu Jha
%A Yoshinari Kameda
%A Christian Kniep
%A William Kramer
%A James Lin
%A Kengo Nakajima
%A Yiwei Qiu
%A Kishore Ramachandran
%A Glenn Ricart
%A Kim Serradell
%A Dan Stanzione
%A Lin Gan
%A Martin Swany
%A Christine Sweeney
%A Alex Szalay
%A Christine Kirkpatrick
%A Kenton McHenry
%A Alainna White
%A Steve Tuecke
%A Ian Foster
%A Joe Mambretti
%A William. M Tang
%A Michela Taufer
%A Miguel Vázquez
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2019-02
%G eng
%0 Generic
%D 2019
%T A Collection of White Papers from the BDEC2 Workshop in San Diego, CA
%A Ilkay Altintas
%A Kyle Marcus
%A Volkan Vural
%A Shweta Purawat
%A Daniel Crawl
%A Gabriel Antoniu
%A Alexandru Costan
%A Ovidiu Marcu
%A Prasanna Balaprakash
%A Rongqiang Cao
%A Yangang Wang
%A Franck Cappello
%A Robert Underwood
%A Sheng Di
%A Justin M. Wozniak
%A Jon C. Calhoun
%A Cong Xu
%A Antonio Lain
%A Paolo Faraboschi
%A Nic Dube
%A Dejan Milojicic
%A Balazs Gerofi
%A Maria Girone
%A Viktor Khristenko
%A Tony Hey
%A Erza Kissel
%A Yu Liu
%A Richard Loft
%A Pekka Manninen
%A Sebastian von Alfthan
%A Takemasa Miyoshi
%A Bruno Raffin
%A Olivier Richard
%A Denis Trystram
%A Maryam Rahnemoonfar
%A Robin Murphy
%A Joel Saltz
%A Kentaro Sano
%A Rupak Roy
%A Kento Sato
%A Jian Guo
%A Jen s Domke
%A Weikuan Yu
%A Takaki Hatsui
%A Yasumasa Joti
%A Alex Szalay
%A William M. Tang
%A Michael R. Wyatt II
%A Michela Taufer
%A Todd Gamblin
%A Stephen Herbein
%A Adam Moody
%A Dong H. Ahn
%A Rich Wolski
%A Chandra Krintz
%A Fatih Bakir
%A Wei-tsung Lin
%A Gareth George
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-10
%G eng
%0 Journal Article
%J International Journal of Networking and Computing
%D 2019
%T Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors
%A Anne Benoit
%A Aurelien Cavelan
%A Florina M. Ciorba
%A Valentin Le Fèvre
%A Yves Robert
%K checkpoint
%K fail-stop error; silent error
%K HPC
%K linear workflow
%K Replication
%X Large-scale platforms currently experience errors from two di?erent sources, namely fail-stop errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt data). This work combines checkpointing and replication for the reliable execution of linear work?ows on platforms subject to these two error types. While checkpointing and replication have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear work?ows in error-prone environments. Moreover, combined checkpointing and replication has not yet been studied in the presence of both fail-stop and silent errors. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it to ensure its reliable execution. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques, lead to improved performance.
%B International Journal of Networking and Computing
%V 9
%P 2-27
%8 2019
%G eng
%U http://www.ijnc.org/index.php/ijnc/article/view/194
%0 Journal Article
%J Parallel Computing
%D 2019
%T Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms
%A Valentin Le Fèvre
%A Thomas Herault
%A Yves Robert
%A Aurelien Bouteiller
%A Atsushi Hori
%A George Bosilca
%A Jack Dongarra
%B Parallel Computing
%V 85
%P 1–12
%8 2019-07
%G eng
%R https://doi.org/10.1016/j.parco.2019.02.002
%0 Journal Article
%J Algorithmica
%D 2019
%T Computing Dense Tensor Decompositions with Optimal Dimension Trees
%A Oguz Kaya
%A Yves Robert
%K CP decomposition
%K Dimension tree
%K Tensor computations
%K Tucker decomposition
%B Algorithmica
%V 81
%P 2092–2121
%8 2019-05
%G eng
%N 5
%R https://doi.org/10.1007/s00453-018-0525-3
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2019
%T Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms
%A Guillaume Aupy
%A Anne Benoit
%A Brice Goglin
%A Loïc Pottier
%A Yves Robert
%K cache partitioning
%K chip multiprocessor
%K co-scheduling
%K HPC application
%X With the recent advent of many-core architectures such as chip multiprocessors (CMPs), the number of processing units accessing a global shared memory is constantly increasing. Co-scheduling techniques are used to improve application throughput on such architectures, but sharing resources often generates critical interferences. In this article, we focus on the interferences in the last level of cache (LLC) and use the Cache Allocation Technology (CAT) recently provided by Intel to partition the LLC and give each co-scheduled application their own cache area. We consider m iterative HPC applications running concurrently and answer to the following questions: (i) How to precisely model the behavior of these applications on the cache-partitioned platform? and (ii) how many cores and cache fractions should be assigned to each application to maximize the platform efficiency? Here, platform efficiency is defined as maximizing the performance either globally, or as guaranteeing a fixed ratio of iterations per second for each application. Through extensive experiments using CAT, we demonstrate the impact of cache partitioning when multiple HPC applications are co-scheduled onto CMP platforms.
%B International Journal of High Performance Computing Applications
%V 33
%P 1221-1239
%8 2019-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342019846956
%0 Conference Paper
%B 11th International Workshop on Parallel Tools for High Performance Computing
%D 2019
%T Counter Inspection Toolkit: Making Sense out of Hardware Performance Events
%A Anthony Danalis
%A Heike Jagode
%A H Hanumantharayappa
%A Sangamesh Ragate
%A Jack Dongarra
%X Hardware counters play an essential role in understanding the behavior of performance-critical applications, and inform any effort to identify opportunities for performance optimization. However, because modern hardware is becoming increasingly complex, the number of counters that are offered by the vendors increases and, in some cases, so does their complexity. In this paper we present a toolkit that aims to assist application developers invested in performance analysis by automatically categorizing and disambiguating performance counters. We present and discuss the set of microbenchmarks and analyses that we developed as part of our toolkit. We explain why they work and discuss the non-obvious reasons why some of our early benchmarks and analyses did not work in an effort to share with the rest of the community the wisdom we acquired from negative results.
%B 11th International Workshop on Parallel Tools for High Performance Computing
%I Cham, Switzerland: Springer
%C Dresden, Germany
%8 2019-02
%G eng
%R https://doi.org/10.1007/978-3-030-11987-4_2
%0 Journal Article
%J Int. Journal of High Performance Computing Applications
%D 2019
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Han, Li
%A Le Fèvre, Valentin
%A Canon, Louis-Claude
%A Robert, Yves
%A Vivien, Frédéric
%B Int. Journal of High Performance Computing Applications
%V 33
%P 1255-1274
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2019
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Li Han
%A Valentin Le Fèvre
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%K checkpoint
%K fail-stop error
%K resilience
%K workflow
%B International Journal of High Performance Computing Applications
%V 33
%P 1255-1274
%8 2019-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342019866891
%0 Conference Paper
%B ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2019
%T Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC
%A Thomas Herault
%A Yves Robert
%A George Bosilca
%A Jack Dongarra
%B ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%I IEEE
%C Denver, CO
%8 2019-11
%G eng
%0 Journal Article
%J Parallel Computing
%D 2019
%T Parallel Selection on GPUs
%A Tobias Ribizel
%A Hartwig Anzt
%K approximate selection
%K gpu
%K kth order statistics
%K multiselection
%K parallel selection algorithm
%X We present a novel parallel selection algorithm for GPUs capable of handling single rank selection (single selection) and multiple rank selection (multiselection). The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always leveraging the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for – and exploiting the characteristics of – “pleasant” data distributions. At the same time, as the proposed SampleSelect algorithm does not work on the actual element values but on the element ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. We also address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy.
%B Parallel Computing
%V 91
%8 2020-03
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167819119301796
%! Parallel Computing
%R https://doi.org/10.1016/j.parco.2019.102588
%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2019
%T ParILUT – A Parallel Threshold ILU for GPUs
%A Hartwig Anzt
%A Tobias Ribizel
%A Goran Flegar
%A Edmond Chow
%A Jack Dongarra
%X In this paper, we present the first algorithm for computing threshold ILU factorizations on GPU architectures. The proposed ParILUT-GPU algorithm is based on interleaving parallel fixed-point iterations that approximate the incomplete factors for an existing nonzero pattern with a strategy that dynamically adapts the nonzero pattern to the problem characteristics. This requires the efficient selection of thresholds that separate the values to be dropped from the incomplete factors, and we design a novel selection algorithm tailored towards GPUs. All components of the ParILUT-GPU algorithm make heavy use of the features available in the latest NVIDIA GPU generations, and outperform existing multithreaded CPU implementations.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R https://doi.org/10.1109/IPDPS.2019.00033
%0 Conference Paper
%B The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19)
%D 2019
%T Replication is More Efficient Than You Think
%A Anne Benoit
%A Thomas Herault
%A Valentin Le Fèvre
%A Yves Robert
%B The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19)
%I ACM Press
%C Denver, CO
%8 2019-11
%G eng
%0 Conference Paper
%B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019)
%D 2019
%T Reservation Strategies for Stochastic Jobs
%A Guillaume Aupy
%A Ana Gainaru
%A Valentin Honoré
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019)
%I IEEE Computer Society Press
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%0 Conference Paper
%B IEEE Cluster 2019
%D 2019
%T Scheduling Independent Stochastic Tasks on Heterogeneous Cloud Platforms
%A Yiqin Gao
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%B IEEE Cluster 2019
%I IEEE Computer Society Press
%C Albuquerque, New Mexico
%8 2019-09
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2019
%T Scheduling Independent Stochastic Tasks under Deadline and Budget Constraints
%A Louis-Claude Canon
%A Aurélie Kong Win Chang
%A Yves Robert
%A Frederic Vivien
%X This article discusses scheduling strategies for the problem of maximizing the expected number of tasks that can be executed on a cloud platform within a given budget and under a deadline constraint. The execution times of tasks follow independent and identically distributed probability laws. The main questions are how many processors to enroll and whether and when to interrupt tasks that have been executing for some time. We provide complexity results and an asymptotically optimal strategy for the problem instance with discrete probability distributions and without deadline. We extend the latter strategy for the general case with continuous distributions and a deadline and we design an efficient heuristic which is shown to outperform standard approaches when running simulations for a variety of useful distribution laws.
%B International Journal of High Performance Computing Applications
%V 34
%P 246-264
%8 2019-06
%G eng
%N 2
%R https://doi.org/10.1177/1094342019852135
%0 Report
%D 2018
%T Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
%A Jack Dongarra
%A Iain Duff
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jonathan Hogg
%A Pedro Valero Lara
%A Piotr Luszczek
%A Mawussi Zounon
%A Samuel D. Relton
%A Stanimire Tomov
%A Timothy Costa
%A Sarah Knepper
%X This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility.
%8 2018-07
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T Big Data and Extreme-Scale Computing: Pathways to Convergence - Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry
%A Mark Asch
%A Terry Moore
%A Rosa M. Badia
%A Micah Beck
%A Pete Beckman
%A Thierry Bidot
%A François Bodin
%A Franck Cappello
%A Alok Choudhary
%A Bronis R. de Supinski
%A Ewa Deelman
%A Jack Dongarra
%A Anshu Dubey
%A Geoffrey Fox
%A Haohuan Fu
%A Sergi Girona
%A Michael Heroux
%A Yutaka Ishikawa
%A Kate Keahey
%A David Keyes
%A William T. Kramer
%A Jean-François Lavignon
%A Yutong Lu
%A Satoshi Matsuoka
%A Bernd Mohr
%A Stéphane Requena
%A Joel Saltz
%A Thomas Schulthess
%A Rick Stevens
%A Martin Swany
%A Alexander Szalay
%A William Tang
%A Gaël Varoquaux
%A Jean-Pierre Vilotte
%A Robert W. Wisniewski
%A Zhiwei Xu
%A Igor Zacharov
%X Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review.
%B The International Journal of High Performance Computing Applications
%V 32
%P 435–479
%8 2018-07
%G eng
%N 4
%R https://doi.org/10.1177/1094342018778123
%0 Conference Paper
%B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2018
%T Budget-Aware Scheduling Algorithms for Scientific Workflows with Stochastic Task Weights on Heterogeneous IaaS Cloud Platforms
%A Yves Caniou
%A Eddy Caron
%A Aurélie Kong Win Chang
%A Yves Robert
%K budget aware algorithm
%K multi criteria scheduling
%K workflow
%X This paper introduces several budget-aware algorithms to deploy scientific workflows on IaaS cloud platforms, where users can request Virtual Machines (VMs) of different types, each with specific cost and speed parameters. We use a realistic application/platform model with stochastic task weights, and VMs communicating through a datacenter. We extend two well-known algorithms, MinMin and HEFT, and make scheduling decisions based upon machine availability and available budget. During the mapping process, the budget-aware algorithms make conservative assumptions to avoid exceeding the initial budget; we further improve our results with refined versions that aim at re-scheduling some tasks onto faster VMs, thereby spending any budget fraction leftover by the first allocation. These refined variants are much more time-consuming than the former algorithms, so there is a trade-off to find in terms of scalability. We report an extensive set of simulations with workflows from the Pegasus benchmark suite. Most of the time our budget-aware algorithms succeed in achieving efficient makespans while enforcing the given budget, despite (i) the uncertainty in task weights and (ii) the heterogeneity of VMs in both cost and speed values.
%B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Vancouver, BC, Canada
%8 2018-05
%G eng
%R 10.1109/IPDPSW.2018.00014
%0 Journal Article
%J IEEE Transactions on Computers
%D 2018
%T Checkpointing Workflows for Fail-Stop Errors
%A Li Han
%A Louis-Claude Canon
%A Henri Casanova
%A Yves Robert
%A Frederic Vivien
%K checkpoint
%K fail-stop error
%K resilience
%K workflow
%X We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS), which is relevant to many real-world workflow applications. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide how to checkpoint these sub-graphs. We assess the performance of our algorithm for production workflow configurations, comparing it to an approach in which all application data is checkpointed and an approach in which no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures.
%B IEEE Transactions on Computers
%V 67
%P 1105–1120
%8 2018-08
%G eng
%U http://ieeexplore.ieee.org/document/8279499/
%N 8
%0 Generic
%D 2018
%T A Collection of White Papers from the BDEC2 Workshop in Bloomington, IN
%A James Ahrens
%A Christopher M. Biwer
%A Alexandru Costan
%A Gabriel Antoniu
%A Maria S. Pérez
%A Nenad Stojanovic
%A Rosa Badia
%A Oliver Beckstein
%A Geoffrey Fox
%A Shantenu Jha
%A Micah Beck
%A Terry Moore
%A Sunita Chandrasekaran
%A Carlos Costa
%A Thierry Deutsch
%A Luigi Genovese
%A Tarek El-Ghazawi
%A Ian Foster
%A Dennis Gannon
%A Toshihiro Hanawa
%A Tevfik Kosar
%A William Kramer
%A Madhav V. Marathe
%A Christopher L. Barrett
%A Takemasa Miyoshi
%A Alex Pothen
%A Ariful Azad
%A Judy Qiu
%A Bo Peng
%A Ravi Teja
%A Sahil Tyagi
%A Chathura Widanage
%A Jon Koskey
%A Maryam Rahnemoonfar
%A Umakishore Ramachandran
%A Miles Deegan
%A William Tang
%A Osamu Tatebe
%A Michela Taufer
%A Michel Cuende
%A Ewa Deelman
%A Trilce Estrada
%A Rafael Ferreira Da Silva
%A Harrel Weinstein
%A Rodrigo Vargas
%A Miwako Tsuji
%A Kevin G. Yager
%A Wanling Gao
%A Jianfeng Zhan
%A Lei Wang
%A Chunjie Luo
%A Daoyi Zheng
%A Xu Wen
%A Rui Ren
%A Chen Zheng
%A Xiwen He
%A Hainan Ye
%A Haoning Tang
%A Zheng Cao
%A Shujie Zhang
%A Jiahui Dai
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2018-11
%G eng
%0 Journal Article
%J Parallel Computing
%D 2018
%T Computing the Expected Makespan of Task Graphs in the Presence of Silent Errors
%A Henri Casanova
%A Julien Herrmann
%A Yves Robert
%K Expected makespan
%K fault-tolerance
%K scheduling
%K Scientific workflows
%K silent errors
%K Task graphs
%X Applications structured as Directed Acyclic Graphs (DAGs) of tasks occur in many domains, including popular scientific workflows. DAG scheduling has thus received an enormous amount of attention. Many of the popular DAG scheduling heuristics make scheduling deci- sions based on path lengths. At large scale compute platforms are subject to various types of failures with non-negligible probabilities of occurrence. Failures that have recently re- ceived increased attention are “silent errors,” which cause data corruption. Tolerating silent errors is done by checking the validity of computed results and possibly re-executing a task from scratch. The execution time of a task then becomes a random variable, and so do path lengths in a DAG. Unfortunately, computing the expected makespan of a DAG (and equivalently computing expected path lengths in a DAG) is a computationally dif- ficult problem. Consequently, designing effective scheduling heuristics in this context is challenging. In this work, we propose an algorithm that computes a first order approxi- mation of the expected makespan of a DAG when tasks are subject to silent errors. We find that our proposed approximation outperforms previously proposed approaches both in terms of approximation error and of speed.
%B Parallel Computing
%V 75
%P 41–60
%8 2018-07
%G eng
%R https://doi.org/10.1016/j.parco.2018.03.004
%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2018
%T Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing
%A Anne Benoit
%A Aurelien Cavelan
%A Franck Cappello
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%K checkpointing
%K fail-stop errors
%K Fault tolerance
%K High-performance computing
%K Replication
%K silent errors
%X This paper provides a model and an analytical study of replication as a technique to cope with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale platforms. Compared with fail-stop errors that are immediately detected when they occur, silent errors require a detection mechanism. To detect silent errors, many application-specific techniques are available, either based on algorithms (e.g., ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject to only silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to even correct some errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. Otherwise, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model.
%B Journal of Parallel and Distributed Computing
%V 122
%P 209–225
%8 2018-12
%G eng
%R https://doi.org/10.1016/j.jpdc.2018.08.002
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2018
%T Co-Scheduling Amdhal Applications on Cache-Partitioned Systems
%A Guillaume Aupy
%A Anne Benoit
%A Sicheng Dai
%A Loïc Pottier
%A Padma Raghavan
%A Yves Robert
%A Manu Shantharam
%K cache partitioning
%K co-scheduling
%K complexity results
%X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are as follows: (i) which proportion of cache and (ii) how many processors should be given to each application? In this article, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.
%B International Journal of High Performance Computing Applications
%V 32
%P 123–138
%8 2018-01
%G eng
%N 1
%R https://doi.org/10.1177/1094342017710806
%0 Conference Paper
%B Cluster 2018
%D 2018
%T Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms
%A Guillaume Aupy
%A Anne Benoit
%A Brice Goglin
%A Loïc Pottier
%A Yves Robert
%B Cluster 2018
%I IEEE Computer Society Press
%C Belfast, UK
%8 2018-09
%G eng
%0 Generic
%D 2018
%T Distributed Termination Detection for HPC Task-Based Environments
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Valentin Le Fèvre
%A Yves Robert
%A Jack Dongarra
%X This paper revisits distributed termination detection algorithms in the context of high-performance computing applications in task systems. We first outline the need to efficiently detect termination in workflows for which the total number of tasks is data dependent and therefore not known statically but only revealed dynamically during execution. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). On the theoretical side, we analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. On the practical side, we provide a highly tuned implementation of each termination detection algorithm within PaRSEC and compare their performance for a variety of benchmarks, extracted from scientific applications that exhibit dynamic behaviors.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-06
%G eng
%0 Conference Paper
%B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids
%D 2018
%T Do moldable applications perform better on failure-prone HPC platforms?
%A Valentin Le Fèvre
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Atsushi Hori
%A Yves Robert
%A Jack Dongarra
%X This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) Rigid applications, which use a constant number of processors throughout execution; (ii) Moldable applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GridShaped applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage.
%B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids
%S LNCS
%I Springer Verlag
%C Turin, Italy
%8 2018-08
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T A Failure Detector for HPC Platforms
%A George Bosilca
%A Aurelien Bouteiller
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Pierre Sens
%A Jack Dongarra
%K failure detection
%K Fault tolerance
%K MPI
%X Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This article describes the design and evaluation of a robust failure detector that can maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage uses a nonuniform variant of a reliable broadcast over a circulant graph overlay network and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan Oak Ridge National Laboratory supercomputer, show that the algorithm performs extremely well and exhibits all the desired properties of an exascale-ready algorithm.
%B The International Journal of High Performance Computing Applications
%V 32
%P 139–158
%8 2018-01
%G eng
%N 1
%R https://doi.org/10.1177/1094342017711505
%0 Conference Paper
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%D 2018
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Li Han
%A Valentin Le Fèvre
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%X This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target failstop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows.
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%I IEEE Computer Society Press
%C Eugene, OR
%8 2018-08
%G eng
%0 Generic
%D 2018
%T Initial Integration and Evaluation of SLATE Parallel BLAS in LATTE
%A Asim YarKhan
%A Gerald Ragghianti
%A Jack Dongarra
%A Marc Cawkwell
%A Danny Perez
%A Arthur Voter
%B Innovative Computing Laboratory Technical Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng
%0 Generic
%D 2018
%T Linear Systems Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Ali Charara
%A Asim YarKhan
%A Jamie Finney
%A Gerald Ragghianti
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-09
%G eng
%9 SLATE Working Notes
%1 08
%0 Journal Article
%J Journal of Computational Science
%D 2018
%T Multi-Level Checkpointing and Silent Error Detection for Linear Workflows
%A Anne Benoit
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%X We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for fail-stop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also show how to combine all these techniques and solve the problem with both fail-stop and silent errors. Simulation results demonstrate that these extensions lead to significantly improved performance compared to the standard single-level checkpointing algorithm.
%B Journal of Computational Science
%V 28
%P 398–415
%8 2018-09
%G eng
%0 Conference Paper
%B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award
%D 2018
%T Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms
%A Thomas Herault
%A Yves Robert
%A Aurelien Bouteiller
%A Dorian Arnold
%A Kurt Ferreira
%A George Bosilca
%A Jack Dongarra
%X In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.
%B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award
%I IEEE
%C Vancouver, BC, Canada
%8 2018-05
%G eng
%R 10.1109/IPDPSW.2018.00127
%0 Conference Paper
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%D 2018
%T A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures
%A Anne Benoit
%A Swann Perarnau
%A Loïc Pottier
%A Yves Robert
%X This work presents a realistic performance model to execute scientific workflows on high-bandwidth memory architectures such as the Intel Knights Landing. We provide a detailed analysis of the execution time on such platforms, taking into account transfers from both fast and slow memory and their overlap with computations. We discuss several scheduling and mapping strategies: not only tasks must be assigned to computing resource, but also one has to decide which fraction of input and output data will reside in fast memory, and which will have to stay in slow memory. Extensive simulations allow us to assess the impact of the mapping strategies on performance. We also conduct actual experiments for a simple 1D Gauss-Seidel kernel, which assess the accuracy of the model and further demonstrate the importance of a tuned memory management. Altogether, our model and results lay the foundations for further studies and experiments on dual-memory systems.
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%I IEEE Computer Society Press
%C Eugene, OR
%8 2018-08
%G eng
%0 Book Section
%B Topics in Parallel and Distributed Computing
%D 2018
%T Scheduling for Fault-Tolerance: An Introduction
%A Guillaume Aupy
%A Yves Robert
%B Topics in Parallel and Distributed Computing
%I Springer International Publishing
%P 143–170
%@ 978-3-319-93108-1
%G eng
%R 10.1007/978-3-319-93109-8
%0 Conference Paper
%B The 3rd International Workshop on Fault Tolerant Systems (FTS)
%D 2017
%T Assuming failure independence: are we right to be wrong?
%A Guillaume Aupy
%A Yves Robert
%A Frederic Vivien
%X This paper revisits the failure temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is incorrect, and we propose a new method to detect failure cascades, i.e., series of non-independent consecutive failures. We use this new method to assess whether public archive failure logs contain failure cascades. Then we design and compare several cascadeaware checkpointing algorithms to quantify the maximum gain that could be obtained, and we report extensive simulation results with archive and synthetic failure logs. Altogether, there are a few logs that contain cascades, but we show that the gain that can be achieved from this knowledge is not significant. The conclusion is that we can wrongly, but safely, assume failure independence!
%B The 3rd International Workshop on Fault Tolerant Systems (FTS)
%I IEEE
%C Honolulu, Hawaii
%8 2017-09
%G eng
%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2017
%T Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation
%A Mathieu Faverge
%A Julien Langou
%A Yves Robert
%A Jack Dongarra
%K Algorithm design and analysis
%K Approximation algorithms
%K Kernel
%K Multicore processing
%K Shape
%K Software algorithms
%K Transforms
%X We study tiled algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthog-onal transformations: (i) the tiled bidiagonalization algorithm BIDIAG, which is a tiled version of the standard scalar bidiago-nalization algorithm; and (ii) the R-bidiagonalization algorithm R-BIDIAG, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R- factor. For both BIDIAG and R-BIDIAG, we use four main types of reduction trees, namely FLATTS, FLATTT, GREEDY, and a newly introduced auto-adaptive tree, AUTO. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BIDIAG has a shorter critical path length than BIDIAG for tall and skinny matrices, and (ii) GREEDY based schemes are much better than earlier proposed algorithms with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared- memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Orlando, FL
%8 2017-05
%G eng
%R 10.1109/IPDPS.2017.46
%0 Conference Paper
%B IEEE Cluster
%D 2017
%T Checkpointing Workflows for Fail-Stop Errors
%A Li Han
%A Louis-Claude Canon
%A Henri Casanova
%A Yves Robert
%A Frederic Vivien
%X We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS). It turns out that many real-world workflow applications are naturally structured as M-SPGS. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide which tasks in these sub-graphs should be checkpointed. Furthermore, it is possible to efficiently compute the expected makespan for the solution produced by this algorithm, using a first-order approximation of task weights and existing evaluation algorithms for 2-state probabilistic DAGs. We assess the performance of our algorithm for production workflow configurations, comparing it to (i) an approach in which all application data is checkpointed, which corresponds to the standard way in which most production workflows are executed today; and (ii) an approach in which no application data is checkpointed. Our results demonstrate that our algorithm strikes a good compromise between these two approaches, leading to lower checkpointing overhead than the former and to better resilience to failure than the latter.
%B IEEE Cluster
%I IEEE
%C Honolulu, Hawaii
%8 2017-09
%G eng
%0 Conference Paper
%B 19th Workshop on Advances in Parallel and Distributed Computational Models
%D 2017
%T Co-Scheduling Algorithms for Cache-Partitioned Systems
%A Guillaume Aupy
%A Anne Benoit
%A Loïc Pottier
%A Padma Raghavan
%A Yves Robert
%A Manu Shantharam
%K Computational modeling
%K Degradation
%K Interference
%K Mathematical model
%K Program processors
%K Supercomputers
%K Throughput
%X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? Here, we assign rational numbers of processors to each application, since they can be shared across applications through multi-threading. In this paper, we provide answers to (i) and (ii) for perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for general applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.
%B 19th Workshop on Advances in Parallel and Distributed Computational Models
%I IEEE Computer Society Press
%C Orlando, FL
%8 2017-05
%G eng
%R 10.1109/IPDPSW.2017.60
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2017
%T Design and Implementation of the PULSAR Programming System for Large Scale Computing
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Yves Robert
%A Jack Dongarra
%X The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.
%B Supercomputing Frontiers and Innovations
%V 4
%G eng
%U http://superfri.org/superfri/article/view/121/210
%N 1
%R 10.14529/jsfi170101
%0 Conference Paper
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
%A Jack Dongarra
%A Sven Hammarling
%A Nicholas J. Higham
%A Samuel Relton
%A Pedro Valero-Lara
%A Mawussi Zounon
%K Batched BLAS
%K BLAS
%K High-performance computing
%K Memory management
%K Parallel processing
%K Scientific computing
%X A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on a general matrix-matrix multiplication (GEMM), to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL1 CuBLAS2 by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, for large numbers of double precision GEMM operations using matrices of size 2 × 2 to 20 × 20.
%B International Conference on Computational Science (ICCS 2017)
%I Elsevier
%C Zürich, Switzerland
%8 2017-06
%G eng
%R DOI:10.1016/j.procs.2017.05.138
%0 Generic
%D 2017
%T Designing SLATE: Software for Linear Algebra Targeting Exascale
%A Jakub Kurzak
%A Panruo Wu
%A Mark Gates
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Gerald Ragghianti
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-10
%G eng
%9 SLATE Working Notes
%1 03
%0 Conference Paper
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%D 2017
%T Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale
%A Anne Benoit
%A Franck Cappello
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%X This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint. We provide a detailed analytical study of both scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that corroborates the analytical model.
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%I ACM
%C Washington, DC
%8 2017-06
%G eng
%R 10.1145/3086157.3086162
%0 Generic
%D 2017
%T MAGMA-sparse Interface Design Whitepaper
%A Hartwig Anzt
%A Erik Boman
%A Jack Dongarra
%A Goran Flegar
%A Mark Gates
%A Mike Heroux
%A Mark Hoemmen
%A Jakub Kurzak
%A Piotr Luszczek
%A Sivasankaran Rajamanickam
%A Stanimire Tomov
%A Stephen Wood
%A Ichitaro Yamazaki
%X In this report we describe the logic and interface we develop for the MAGMA-sparse library to allow for easy integration as third-party library into a top-level software ecosystem. The design choices are based on extensive consultation with other software library developers, in particular the Trilinos software development team. The interface documentation is at this point not exhaustive, but a first proposal for setting a standard. Although the interface description targets the MAGMA-sparse software module, we hope that the design choices carry beyond this specific library, and are attractive for adoption in other packages. This report is not intended as static document, but will be updated over time to reflect the agile software development in the ECP 1.3.3.11 STMS11-PEEKS project.
%B Innovative Computing Laboratory Technical Report
%8 2017-09
%G eng
%9 Technical Report
%0 Conference Paper
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%D 2017
%T Optimal Checkpointing Period with replicated execution on heterogeneous platforms
%A Anne Benoit
%A Aurelien Cavelan
%A Valentin Le Fèvre
%A Yves Robert
%X In this paper, we design and analyze strategies to replicate the execution of an application on two different platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~W for a periodic checkpointing strategy where both platforms concurrently try and execute W units of work before checkpointing. The first platform that completes its pattern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use first or second-order approximations to compute overheads and optimal pattern sizes, and show through extensive simulations that these models are very accurate. The simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively different speeds: in average, over a wide range of scenarios, the overhead is reduced by 30%. The simulations also demonstrate that the periodic checkpointing strategy is globally more efficient, unless platform speeds are quite close.
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%I IEEE Computer Society Press
%C Washington, DC
%8 2017-06
%G eng
%R 10.1145/3086157.3086165
%0 Conference Paper
%B Euro-Par 2017
%D 2017
%T Optimized Batched Linear Algebra for Modern Architectures
%A Jack Dongarra
%A Sven Hammarling
%A Nicholas J. Higham
%A Samuel Relton
%A Mawussi Zounon
%X Solving large numbers of small linear algebra problems simultaneously is becoming increasingly important in many application areas. Whilst many researchers have investigated the design of efficient batch linear algebra kernels for GPU architectures, the common approach for many/multi-core CPUs is to use one core per subproblem in the batch. When solving batches of very small matrices, 2 × 2 for example, this design exhibits two main issues: it fails to fully utilize the vector units and the cache of modern architectures, since the matrices are too small. Our approach to resolve this is as follows: given a batch of small matrices spread throughout the primary memory, we first reorganize the elements of the matrices into a contiguous array, using a block interleaved memory format, which allows us to process the small independent problems as a single large matrix problem and enables cross-matrix vectorization. The large problem is solved using blocking strategies that attempt to optimize the use of the cache. The solution is then converted back to the original storage format. To explain our approach we focus on two BLAS routines: general matrix-matrix multiplication (GEMM) and the triangular solve (TRSM). We extend this idea to LAPACK routines using the Cholesky factorization and solve (POSV). Our focus is primarily on very small matrices ranging in size from 2 × 2 to 32 × 32. Compared to both MKL and OpenMP implementations, our approach can be up to 4 times faster for GEMM, up to 14 times faster for TRSM, and up to 40 times faster for POSV on the new Intel Xeon Phi processor, code-named Knights Landing (KNL). Furthermore, we discuss strategies to avoid data movement between sockets when using our interleaved approach on a NUMA node.
%B Euro-Par 2017
%I Springer
%C Santiago de Compostela, Spain
%8 2017-08
%G eng
%R https://doi.org/10.1007/978-3-319-64203-1_37
%0 Generic
%D 2017
%T PLASMA 17 Performance Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng
%0 Generic
%D 2017
%T PLASMA 17.1 Functionality Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng
%0 Conference Paper
%B International Conference on Parallel Processing (ICPP)
%D 2017
%T Resilience for Stencil Computations with Latent Errors
%A Aiman Fang
%A Aurelien Cavelan
%A Yves Robert
%A Andrew Chien
%X Projections and measurements of error rates in near-exascale and exascale systems suggest a dramatic growth, due to extreme scale (109,109 cores), concurrency, software complexity, and deep submicron transistor scaling. Such a growth makes resilience a critical concern, and may increase the incidence of errors that "escape," silently corrupting application state. Such errors can often be revealed by application software tests but with long latencies, and thus are known as latent errors. We explore how to efficiently recover from latent errors, with an approach called application-based focused recovery (ABFR). Specifically we present a case study of stencil computations, a widely useful computational structure, showing how ABFR focuses recovery effort where needed, using intelligent testing and pruning to reduce recovery effort, and enables recovery effort to be overlapped with application computation. We analyze and characterize the ABFR approach on stencils, creating a performance model parameterized by error rate and detection interval (latency). We compare projections from the model to experimental results with the Chombo stencil application, validating the model and showing that ABFR on stencil can achieve a significant reductions in error recovery cost (up to 400x) and recovery latency (up to 4x). Such reductions enable efficient execution at scale with high latent error rates.
%B International Conference on Parallel Processing (ICPP)
%I IEEE Computer Society Press
%C Bristol, UK
%8 2017-08
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications (IJHPCA)
%D 2017
%T Resilient Co-Scheduling of Malleable Applications
%A Anne Benoit
%A Loïc Pottier
%A Yves Robert
%K co-scheduling
%K complexity results
%K heuristics
%K Redistribution
%K resilience
%K simulations
%X Recently, the benefits of co-scheduling several applications have been demonstrated in a fault-free context, both in terms of performance and energy savings. However, large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between applications and significantly degrade performance. In this article, we aim at minimizing the expected completion time of a set of co-scheduled applications. We propose to redistribute the resources assigned to each application upon the occurrence of failures, and upon the completion of some applications, in order to achieve this goal. First, we introduce a formal model and establish complexity results. The problem is NP-complete for malleable applications, even in a fault-free context. Therefore, we design polynomial-time heuristics that perform redistributions and account for processor failures. A fault simulator is used to perform extensive simulations that demonstrate the usefulness of redistribution and the performance of the proposed heuristics.
%B International Journal of High Performance Computing Applications (IJHPCA)
%8 2017-05
%G eng
%R 10.1177/1094342017704979
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2017
%T Solving Dense Symmetric Indefinite Systems using GPUs
%A Marc Baboulin
%A Jack Dongarra
%A Adrien Remy
%A Stanimire Tomov
%A Ichitaro Yamazaki
%X This paper studies the performance of different algorithms for solving a dense symmetric indefinite linear system of equations on multicore CPUs with a Graphics Processing Unit (GPU). To ensure the numerical stability of the factorization, pivoting is required. Obtaining high performance of such algorithms on the GPU is difficult because all the existing pivoting strategies lead to frequent synchronizations and irregular data accesses. Until recently, there has not been any implementation of these algorithms on a hybrid CPU/GPU architecture. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive data transfer and synchronization between the CPU and GPU, or on the GPU (e.g., factorizing the matrix entirely on the GPU or in a communication-avoiding fashion). We also study the performance of the solver using iterative refinements along with the factorization without pivoting combined with the preprocessing technique based on random butterfly transformations, or with the mixed-precision algorithm where the matrix is factorized in single precision. This randomization algorithm only has a probabilistic proof on the numerical stability, and for this paper, we only focused on the mixed-precision algorithm without pivoting. However, they demonstrate that we can obtain good performance on the GPU by avoiding the pivoting and using the lower precision arithmetics, respectively. As illustrated with the application in acoustics studied in this paper, in many practical cases, the matrices can be factorized without pivoting. Because the componentwise backward error computed in the iterative refinement signals when the algorithm failed to obtain the desired accuracy, the user can use these potentially unstable but efficient algorithms in most of the cases and fall back to a more stable algorithm with pivoting only in the case of the failure.
%B Concurrency and Computation: Practice and Experience
%V 29
%8 2017-03
%G eng
%U http://onlinelibrary.wiley.com/doi/10.1002/cpe.4055/full
%N 9
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.4055
%0 Journal Article
%J IEEE Transactions on Computers
%D 2017
%T Towards Optimal Multi-Level Checkpointing
%A Anne Benoit
%A Aurelien Cavelan
%A Valentin Le Fèvre
%A Yves Robert
%A Hongyang Sun
%K checkpointing
%K Dynamic programming
%K Error analysis
%K Heuristic algorithms
%K Optimized production technology
%K protocols
%K Shape
%B IEEE Transactions on Computers
%V 66
%P 1212–1226
%8 2017-07
%G eng
%N 7
%R 10.1109/TC.2016.2643660
%0 Journal Article
%J ACM Transactions on Parallel Computing
%D 2016
%T Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors
%A Anne Benoit
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%K checkpoint
%K fail-stop error
%K failure
%K HPC
%K resilience
%K silent data corruption
%K silent error
%K verification
%X In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bi-criteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via dynamic voltage and frequency scaling (DVFS). We determine in this scenario the optimal checkpointing and verification locations, as well as the optimal speed pairs. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.
%B ACM Transactions on Parallel Computing
%8 2016-08
%G eng
%R 10.1145/2897189
%0 Journal Article
%J Parallel Computing
%D 2016
%T Assessing the Cost of Redistribution followed by a Computational Kernel: Complexity and Performance Results
%A Julien Herrmann
%A George Bosilca
%A Thomas Herault
%A Loris Marchal
%A Yves Robert
%A Jack Dongarra
%K Data partition
%K linear algebra
%K parsec
%K QR factorization
%K Redistribution
%K Stencil
%X The classical redistribution problem aims at optimally scheduling communications when reshuffling from an initial data distribution to a target data distribution. This target data distribution is usually chosen to optimize some objective for the algorithmic kernel under study (good computational balance or low communication volume or cost), and therefore to provide high efficiency for that kernel. However, the choice of a distribution minimizing the target objective is not unique. This leads to generalizing the redistribution problem as follows: find a re-mapping of data items onto processors such that the data redistribution cost is minimal, and the operation remains as efficient. This paper studies the complexity of this generalized problem. We compute optimal solutions and evaluate, through simulations, their gain over classical redistribution. We also show the NP-hardness of the problem to find the optimal data partition and processor permutation (defined by new subsets) that minimize the cost of redistribution followed by a simple computational kernel. Finally, experimental validation of the new redistribution algorithms are conducted on a multicore cluster, for both a 1D-stencil kernel and a more compute-intensive dense linear algebra routine.
%B Parallel Computing
%V 52
%P 22-41
%8 2016-02
%G eng
%R doi:10.1016/j.parco.2015.09.005
%0 Book Section
%B Lecture Notes in Computer Science
%D 2016
%T Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures
%A Marc Baboulin
%A Jack Dongarra
%A Adrien Remy
%A Stanimire Tomov
%A Ichitaro Yamazaki
%E Roman Wyrzykowski
%E Ewa Deelman
%E Konrad Karczewski
%E Jacek Kitowski
%E Kazimierz Wiatr
%K Communication-avoiding
%K Dense symmetric indefinite factorization
%K gpu computation
%K randomization
%X We study the performance of dense symmetric indefinite factorizations (Bunch-Kaufman and Aasen’s algorithms) on multicore CPUs with a Graphics Processing Unit (GPU). Though such algorithms are needed in many scientific and engineering simulations, obtaining high performance of the factorization on the GPU is difficult because the pivoting that is required to ensure the numerical stability of the factorization leads to frequent synchronizations and irregular data accesses. As a result, until recently, there has not been any implementation of these algorithms on hybrid CPU/GPU architectures. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive communication and synchronization between the CPU and GPU, or on the GPU. We also study the performance of an LDL^T factorization with no pivoting combined with the preprocessing technique based on Random Butterfly Transformations. Though such transformations only have probabilistic results on the numerical stability, they avoid the pivoting and obtain a great performance on the GPU.
%B Lecture Notes in Computer Science
%S 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I
%I Springer International Publishing
%V 9573
%P 86-95
%8 2015-09
%@ 978-3-319-32149-3
%G eng
%& Parallel Processing and Applied Mathematics
%R 10.1007/978-3-319-32149-3_9
%0 Conference Proceedings
%B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16)
%D 2016
%T Failure Detection and Propagation in HPC Systems
%A George Bosilca
%A Aurelien Bouteiller
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Pierre Sens
%A Jack Dongarra
%K failure detection
%K fault-tolerance
%K MPI
%B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16)
%I IEEE Press
%C Salt Lake City, Utah
%P 27:1-27:11
%8 2016-11
%@ 978-1-4673-8815-3
%G eng
%U http://dl.acm.org/citation.cfm?id=3014904.3014941
%0 Conference Paper
%B 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2016
%T Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors
%A Anne Benoit
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%K fail-stop errors
%K multilevel checkpoint
%K optimal pattern
%K resilience
%K silent errors
%K verification
%X This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop errors. Many others deal with silent errors (or silent data corruptions). But very few papers deal with fail-stop and silent errors simultaneously. However, HPC applications will obviously have to cope with both error sources. This paper presents a unified framework and optimal algorithmic solutions to this double challenge. Silent errors are handled via verification mechanisms (either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, and a full characterization of the optimal pattern. Our results nicely extend several published solutions and demonstrate how to make use of different techniques to solve the double threat of fail-stop and silent errors. Extensive simulations based on real data confirm the accuracy of the model, and show that patterns that combine all resilience mechanisms are required to provide acceptable overheads.
%B 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%R 10.1109/IPDPS.2016.39
%0 Journal Article
%J International Journal of Networking and Computing
%D 2016
%T Scheduling Computational Workflows on Failure-prone Platforms
%A Guillaume Aupy
%A Anne Benoit
%A Henri Casanova
%A Yves Robert
%K checkpointing
%K fault-tolerance
%K reliability
%K scheduling
%K workflow
%X We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomialtime optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.
%B International Journal of Networking and Computing
%V 6
%P 2-26
%G eng
%0 Generic
%D 2016
%T A Standard for Batched BLAS Routines
%A Pedro Valero-Lara
%A Jack Dongarra
%A Azzam Haidar
%A Samuel D. Relton
%A Stanimire Tomov
%A Mawussi Zounon
%I 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16)
%C Paris, France
%8 2016-04
%G eng
%0 Journal Article
%J International Journal of Networking and Computing
%D 2015
%T Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%K ABFT
%K checkpoint
%K fault-tolerance
%K High-performance computing
%K model
%K performance evaluation
%K resilience
%X Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFT- aware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic program- ming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint.
%B International Journal of Networking and Computing
%V 5
%P 2-15
%8 2015-01
%G eng
%0 Journal Article
%J International Journal on High Performance Computing Applications
%D 2015
%T Efficient Checkpoint/Verification Patterns
%A Anne Benoit
%A Saurabh K. Raina
%A Yves Robert
%K checkpointing
%K Fault tolerance
%K High Performance Computing
%K silent data corruption
%K silent error
%K verification
%X Errors have become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (Mean Time Between Failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared to the base algorithm that always perform a verification just before taking a checkpoint (p = q = 1), and we exhibit gains of up to 19%.
%B International Journal on High Performance Computing Applications
%8 2015-07
%G eng
%R 10.1177/1094342015594531
%0 Generic
%D 2015
%T Exascale Computing and Big Data
%A Dan Reed
%A Jack Dongarra
%X Scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics.
%B Communications of the ACM
%I ACM
%V 58
%P 56-68
%8 2015-07
%G eng
%9 Magazine Article
%R 10.1145/2699414
%0 Generic
%D 2015
%T Fault Tolerance Techniques for High-performance Computing
%A Jack Dongarra
%A Thomas Herault
%A Yves Robert
%X This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the report by discussing techniques to cope with silent errors (or silent data corruption).
%B University of Tennessee Computer Science Technical Report (also LAWN 289)
%I University of Tennessee
%8 2015-05
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn289.pdf
%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2015
%T Mixing LU-QR Factorization Algorithms to Design High-Performance Dense Linear Algebra Solvers
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%K lu factorization
%K Numerical algorithms
%K QR factorization
%K Stability; Performance
%X This paper introduces hybrid LU–QR algorithms for solving dense linear systems of the form Ax=b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floating-point operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. The choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. A comprehensive set of experiments shows that hybrid LU–QR algorithms provide a continuous range of trade-offs between stability and performances.
%B Journal of Parallel and Distributed Computing
%V 85
%P 32-46
%8 2015-11
%G eng
%R doi:10.1016/j.jpdc.2015.06.007
%0 Generic
%D 2015
%T Scheduling for fault-tolerance: an introduction
%A Guillaume Aupy
%A Yves Robert
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2015-01
%G eng
%0 Conference Proceedings
%B 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
%D 2015
%T UCX: An Open Source Framework for HPC Network APIs and Beyond
%A P. Shamis
%A Manjunath Gorentla Venkata
%A M. Graham Lopez
%A M. B. Baker
%A O. Hernandez
%A Y. Itigin
%A M. Dubman
%A G. Shainer
%A R. L. Graham
%A L. Liss
%A Y. Shahar
%A S. Potluri
%A D. Rossetti
%A D. Becker
%A D. Poole
%A C. Lamb
%A S. Kumar
%A C. Stunkel
%A George Bosilca
%A Aurelien Bouteiller
%K application program interfaces
%K Bandwidth
%K Electronics packaging
%K Hardware
%K high throughput computing
%K highly-scalable network stack
%K HPC
%K HPC network APIs
%K I/O bound applications
%K Infiniband
%K input-output programs
%K Libraries
%K Memory management
%K message passing
%K message passing interface
%K Middleware
%K MPI
%K open source framework
%K OpenSHMEM
%K parallel programming
%K parallel programming models
%K partitioned global address space languages
%K PGAS
%K PGAS languages
%K Programming
%K protocols
%K public domain software
%K RDMA
%K system libraries
%K task-based paradigms
%K UCX
%K Unified Communication X
%X This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.
%B 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
%I IEEE
%C Santa Clara, CA, USA
%P 40-43
%8 Aug
%@ 978-1-4673-9160-3
%G eng
%M 15573048
%R 10.1109/HOTI.2015.13
%0 Conference Paper
%B 2nd Workshop on Visual Performance Analysis (VPA '15)
%D 2015
%T Visualizing Execution Traces with Task Dependencies
%A Blake Haugen
%A Stephen Richmond
%A Jakub Kurzak
%A Chad A. Steed
%A Jack Dongarra
%X Task-based scheduling has emerged as one method to reduce the complexity of parallel computing. When using task-based schedulers, developers must frame their computation as a series of tasks with various data dependencies. The scheduler can take these tasks, along with their input and output dependencies, and schedule the task in parallel across a node or cluster. While these schedulers simplify the process of parallel software development, they can obfuscate the performance characteristics of the execution of an algorithm. The execution trace has been used for many years to give developers a visual representation of how their computations are performed. These methods can be employed to visualize when and where each of the tasks in a task-based algorithm is scheduled. In addition, the task dependencies can be used to create a directed acyclic graph (DAG) that can also be visualized to demonstrate the dependencies of the various tasks that make up a workload. The work presented here aims to combine these two data sets and extend execution trace visualization to better suit task-based workloads. This paper presents a brief description of task-based schedulers and the performance data they produce. It will then describe an interactive extension to the current trace visualization methods that combines the trace and DAG data sets. This new tool allows users to gain a greater understanding of how their tasks are scheduled. It also provides a simplified way for developers to evaluate and debug the performance of their scheduler.
%B 2nd Workshop on Visual Performance Analysis (VPA '15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Conference Paper
%B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014
%D 2014
%T Assessing the Impact of ABFT and Checkpoint Composite Strategies
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%K ABFT
%K checkpoint
%K fault-tolerance
%K High-performance computing
%K resilience
%X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFTunaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed.
%B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Designing LU-QR Hybrid Solvers for Performance and Stability
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%X This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. Ideally, the choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the Parsec software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%@ 978-1-4799-3800-1
%G eng
%R 10.1109/IPDPS.2014.108
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14)
%D 2014
%T Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster
%A Ichitaro Yamazaki
%A Sivasankaran Rajamanickam
%A Eric G. Boman
%A Mark Hoemmen
%A Michael A. Heroux
%A Stanimire Tomov
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14)
%I IEEE
%C New Orleans, LA
%8 2014-11
%G eng
%0 Generic
%D 2014
%T Efficient checkpoint/verification patterns for silent error detection
%A Anne Benoit
%A Yves Robert
%A Saurabh K. Raina
%X Resilience has become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their particularities is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (Mean Time Between Failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared to the base algorithm that always perform a verification just before taking a checkpoint (p = q = 1), and we exhibit gains of up to 19%.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2014-05
%G eng
%9 LAWN 287
%0 Journal Article
%J International Journal of Networking and Computing
%D 2014
%T Performance and Reliability Trade-offs for the Double Checkpointing Algorithm
%A Jack Dongarra
%A Thomas Herault
%A Yves Robert
%K communication contention
%K in-memory checkpoint
%K performance
%K resilience
%K risk
%X Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach based upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé [23], with the non-blocking algorithm of Ni, Meneses and Kalé [15] in terms of both performance and risk. We also extend the model proposedcan provide a better efficiency in [23, 15] to assess the impact of the overhead associated to non-blocking communications. In addition, we deal with arbitrary failure distributions (as opposed to uniform distributions in [23]). We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work without additional memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
%B International Journal of Networking and Computing
%V 4
%P 32-41
%8 2014
%G eng
%& 32
%0 Conference Paper
%B 2014 IEEE International Conference on Cluster Computing
%D 2014
%T Power Monitoring with PAPI for Extreme Scale Architectures and Dataflow-based Programming Models
%A Heike McCraw
%A James Ralph
%A Anthony Danalis
%A Jack Dongarra
%X For more than a decade, the PAPI performance-monitoring library has provided a clear, portable interface to the hardware performance counters available on all modern CPUs and other components of interest (e.g., GPUs, network, and I/O systems). Most major end-user tools that application developers use to analyze the performance of their applications rely on PAPI to gain access to these performance counters. One of the critical road-blockers on the way to larger, more complex high performance systems, has been widely identified as being the energy efficiency constraints. With modern extreme scale machines having hundreds of thousands of cores, the ability to reduce power consumption for each CPU at the software level becomes critically important, both for economic and environmental reasons. In order for PAPI to continue playing its well established role in HPC, it is pressing to provide valuable performance data that not only originates from within the processing cores but also delivers insight into the power consumption of the system as a whole. An extensive effort has been made to extend the Performance API to support power monitoring capabilities for various platforms. This paper provides detailed information about three components that allow power monitoring on the Intel Xeon Phi and Blue Gene/Q. Furthermore, we discuss the integration of PAPI in PARSEC – a taskbased dataflow-driven execution engine – enabling hardware performance counter and power monitoring at true task granularity.
%B 2014 IEEE International Conference on Cluster Computing
%I IEEE
%C Madrid, Spain
%8 2014-09
%G eng
%R 10.1109/CLUSTER.2014.6968672
%0 Conference Paper
%B IPDPS 2014
%D 2014
%T A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU
%A Tingxing Dong
%A Veselin Dobrev
%A Tzanio Kolev
%A Robert Rieben
%A Stanimire Tomov
%A Jack Dongarra
%K Computer science
%K CUDA
%K FEM
%K Finite element method
%K linear algebra
%K nVidia
%K Tesla K20
%X Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs [10, 1]. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor’s library routines substantially in performance. We proposed an autotuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2:5x overall speedup and 1:42x energy efficiency (greenup) using 4th order (Q4) finite elements, and a 1:9x speedup and 1:27x greenup using 2nd order (Q2) finite elements.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Conference Paper
%B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014
%D 2014
%T Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes
%A Xavier Lacoste
%A Mathieu Faverge
%A Pierre Ramet
%A Samuel Thibault
%A George Bosilca
%K DAG based runtime
%K gpu
%K Multicore
%K Sparse linear solver
%X The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of the computing resources. The pressure to maintain reasonable levels of performance and portability, forces the application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical architectures. In this paper, we study the replacement of the highly specialized internal scheduler in PaStiX by two generic runtime frameworks: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them with the opportunity to optimize it in order to maximize the algorithm eefficiency for a predefined execution environment. A comparative study of the performance of the PaStiX solver with the three schedulers { native PaStiX, StarPU and PaRSEC schedulers { on different execution contexts is performed. The analysis highlights the similarities from a performance point of view between the different execution supports. These results demonstrate that these generic DAG-based runtimes provide a uniform and portable programming interface across heterogeneous environments, and are, therefore, a sustainable solution for hybrid environments.
%B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Generic
%D 2013
%T Assessing the impact of ABFT and Checkpoint composite strategies
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%K ABFT
%K checkpoint
%K fault-tolerance
%K High-performance computing
%K resilience
%X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFT-unaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed.
%B University of Tennessee Computer Science Technical Report
%G eng
%0 Generic
%D 2013
%T On the Combination of Silent Error Detection and Checkpointing
%A Guillaume Aupy
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%K checkpointing
%K error recovery
%K High-performance computing
%K silent data corruption
%K verification
%X In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
%B UT-CS-13-710
%I University of Tennessee Computer Science Technical Report
%8 2013-06
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn278.pdf
%0 Generic
%D 2013
%T Designing LU-QR hybrid solvers for performance and stability
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report (also LAWN 282)
%I University of Tennessee
%8 2013-10
%G eng
%0 Journal Article
%J Parallel Computing
%D 2013
%T Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Mathias Jacquelin
%A Julien Langou
%A Yves Robert
%K Cluster
%K Distributed memory
%K Hierarchical architecture
%K multi-core
%K numerical linear algebra
%K QR factorization
%X This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ‘‘communication-avoiding’’), it is natural to consider hierarchical trees composed of an ‘‘inter-node’’ tree which acts on top of ‘‘intra-node’’ trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ‘‘TS level’’ for cache-friendliness, (1) ‘‘low-level’’ for decoupled highly parallel inter-node reductions, (2) ‘‘domino level’’ to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGUE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
%B Parallel Computing
%V 39
%P 212-232
%8 2013-05
%G eng
%N 4-5
%0 Generic
%D 2013
%T Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters
%A Tingxing Dong
%A Veselin Dobrev
%A Tzanio Kolev
%A Robert Rieben
%A Stanimire Tomov
%A Jack Dongarra
%X The explosion of parallelism and heterogeneity in today's computer architectures has created opportunities as well as challenges for redesigning legacy numerical software to harness the power of new hardware. In this paper we address the main challenges in redesigning BLAST { a numerical library that solves the equations of compressible hydrodynamics using high order nite element methods (FEM) in a moving Lagrangian frame { to support CPU-GPU clusters. We use a hybrid MPI + OpenMP + CUDA programming model that includes two layers: domain decomposed MPI parallelization and OpenMP + CUDA acceleration in a given domain. To optimize the code, we implemented custom linear algebra kernels and introduced an auto-tuning technique to deal with heterogeneity and load balancing at runtime. Our tests show that 12 Intel Xeon cores and two M2050 GPUs deliver a 24x speedup compared to a single core, and a 2.5x speedup compared to 12 MPI tasks in one node. Further, we achieve perfect weak scaling, demonstrated on a cluster with up to 64 GPUs in 32 nodes. Our choice of programming model and proposed solutions, as related to parallelism and load balancing, specifically targets high order FEM discretizations, and can be used equally successfully for applications beyond hydrodynamics. A major accomplishment is that we further establish the appeal of high order FEMs, which despite their better approximation properties, are often avoided due to their high computational cost. GPUs, as we show, have the potential to make them the method of choice, as the increased computational cost is also localized, e.g., cast as Level 3 BLAS, and thus can be done very efficiently (close to \free" relative to the usual overheads inherent in sparse computations).
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng
%0 Generic
%D 2013
%T Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC
%A Guillaume Aupy
%A Mathieu Faverge
%A Yves Robert
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter-node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures
%B Lawn 277
%8 2013-05
%G eng
%0 Book Section
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%D 2013
%T Keeneland: Computational Science Using Heterogeneous GPU Computing
%A Jeffrey Vetter
%A Richard Glassbrook
%A Karsten Schwan
%A Sudha Yalamanchili
%A Mitch Horton
%A Ada Gavrilovska
%A Magda Slawinska
%A Jack Dongarra
%A Jeremy Meredith
%A Philip Roth
%A Kyle Spafford
%A Stanimire Tomov
%A John Wynkoop
%X The Keeneland Project is a five year Track 2D grant awarded by the National Science Foundation (NSF) under solicitation NSF 08-573 in August 2009 for the development and deployment of an innovative high performance computing system. The Keeneland project is led by the Georgia Institute of Technology (Georgia Tech) in collaboration with the University of Tennessee at Knoxville, National Institute of Computational Sciences, and Oak Ridge National Laboratory.
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%S CRC Computational Science Series
%I Taylor and Francis
%C Boca Raton, FL
%G eng
%& 7
%0 Generic
%D 2013
%T Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization
%A Aurelien Bouteiller
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%X Failures are increasingly threatening the eciency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to generalpurpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.
%B University of Tennessee Computer Science Technical Report
%8 2013-02
%G eng
%0 Conference Paper
%B Euro-Par 2013
%D 2013
%T Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization
%A Aurelien Bouteiller
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%X Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that roll- back recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the sys- tem batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted com- putation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application comple- tion time is unchanged, while it delivers near-perfect platform efficiency.
%B Euro-Par 2013
%I Springer
%C Aachen, Germany
%8 2013-08
%G eng
%0 Journal Article
%J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%D 2013
%T Multithreading in the PLASMA Library
%A Jakub Kurzak
%A Piotr Luszczek
%A Asim YarKhan
%A Mathieu Faverge
%A Julien Langou
%A Henricus Bouwmeester
%A Jack Dongarra
%E Mohamed Ahmed
%E Reda Ammar
%E Sanguthevar Rajasekaran
%B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%I Taylor & Francis
%8 2013-00
%G eng
%0 Generic
%D 2013
%T Optimal Checkpointing Period: Time vs. Energy
%A Guillaume Aupy
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report (also LAWN 281)
%I University of Tennessee
%8 2013-10
%G eng
%0 Generic
%D 2013
%T PAPI 5: Measuring Power, Energy, and the Cloud
%A Vincent Weaver
%A Dan Terpstra
%A Heike McCraw
%A Matt Johnson
%A Kiran Kasichayanula
%A James Ralph
%A John Nelson
%A Phil Mucci
%A Tushar Mohan
%A Shirley Moore
%I 2013 IEEE International Symposium on Performance Analysis of Systems and Software
%C Austin, TX
%8 2013-04
%G eng
%0 Generic
%D 2013
%T Revisiting the Double Checkpointing Algorithm
%A Jack Dongarra
%A Thomas Herault
%A Yves Robert
%K checkpoint algorithm
%K communication overlap
%K fault-tolerance
%K performance model
%K resilience
%X Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé in terms of both performance and risk. We also extend the model that they have proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
%B University of Tennessee Computer Science Technical Report (LAWN 274)
%8 2013-01
%G eng
%0 Conference Paper
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, at the IEEE International Parallel & Distributed Processing Symposium
%D 2013
%T Revisiting the Double Checkpointing Algorithm
%A Jack Dongarra
%A Thomas Herault
%A Yves Robert
%X Abstract—Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kale [1], with the non-blocking algorithm of Ni, Meneses and Kale [2] in terms of both performance and risk. We also extend the model proposed in [1], [2] to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-topeer checkpointing algorithm, called the triple checkpointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, at the IEEE International Parallel & Distributed Processing Symposium
%C Boston, MA
%8 2013-05
%G eng
%0 Conference Paper
%B 17th IEEE High Performance Extreme Computing Conference (HPEC '13)
%D 2013
%T Standards for Graph Algorithm Primitives
%A Tim Mattson
%A David Bader
%A Jon Berry
%A Aydin Buluc
%A Jack Dongarra
%A Christos Faloutsos
%A John Feo
%A John Gilbert
%A Joseph Gonzalez
%A Bruce Hendrickson
%A Jeremy Kepner
%A Charles Lieserson
%A Andrew Lumsdaine
%A David Padua
%A Steve W. Poole
%A Steve Reinhardt
%A Mike Stonebraker
%A Steve Wallach
%A Andrew Yoo
%K algorithms
%K graphs
%K linear algebra
%K software standards
%X It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.
%B 17th IEEE High Performance Extreme Computing Conference (HPEC '13)
%I IEEE
%C Waltham, MA
%8 2013-09
%G eng
%R 10.1109/HPEC.2013.6670338
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
%A George Bosilca
%A Aurelien Bouteiller
%A Elisabeth Brunet
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%X In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
%B Concurrency and Computation: Practice and Experience
%8 2013-11
%G eng
%R 10.1002/cpe.3173
%0 Generic
%D 2012
%T Acceleration of the BLAST Hydro Code on GPU
%A Tingxing Dong
%A Tzanio Kolev
%A Robert Rieben
%A Veselin Dobrev
%A Stanimire Tomov
%A Jack Dongarra
%B Supercomputing '12 (poster)
%I SC12
%C Salt Lake City, Utah
%8 2012-11
%G eng
%0 Conference Proceedings
%B Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012
%D 2012
%T Algorithm-Based Fault Tolerance for Dense Matrix Factorization
%A Peng Du
%A Aurelien Bouteiller
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%E J. Ramanujam
%E P. Sadayappan
%K ft-la
%K ftmpi
%X Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors.
%B Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012
%I ACM
%C New Orleans, LA, USA
%P 225-234
%8 2012-02
%G eng
%R 10.1145/2145816.2145845
%0 Conference Proceedings
%B Proc. of the International Conference on Computational Science (ICCS)
%D 2012
%T A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines
%A Marc Baboulin
%A Simplice Donfack
%A Jack Dongarra
%A Laura Grigori
%A Adrien Remi
%A Stanimire Tomov
%K magma
%B Proc. of the International Conference on Computational Science (ICCS)
%V 9
%P 17-26
%8 2012-06
%G eng
%0 Conference Proceedings
%B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium
%D 2012
%T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Julien Langou
%A Yves Robert
%B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium
%I IEEE Computer Society Press
%C Shanghai, China
%8 2012-05
%G eng
%0 Journal Article
%J Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear)
%D 2012
%T Looking Back at Dense Linear Algebra Software
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%E Viktor K. Prasanna
%E Yves Robert
%E Per Stenström
%B Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear)
%8 2012-00
%G eng
%0 Conference Proceedings
%B International Workshop on Power-Aware Systems and Architectures
%D 2012
%T Measuring Energy and Power with PAPI
%A Vincent M Weaver
%A Matt Johnson
%A Kiran Kasichayanula
%A James Ralph
%A Piotr Luszczek
%A Dan Terpstra
%A Shirley Moore
%K papi
%X Energy and power consumption are becoming critical metrics in the design and usage of high performance systems. We have extended the Performance API (PAPI) analysis library to measure and report energy and power values. These values are reported using the existing PAPI API, allowing code previously instrumented for performance counters to also measure power and energy. Higher level tools that build on PAPI will automatically gain support for power and energy readings when used with the newest version of PAPI. We describe in detail the types of energy and power readings available through PAPI. We support external power meters, as well as values provided internally by recent CPUs and GPUs. Measurements are provided directly to the instrumented process, allowing immediate code analysis in real time. We provide examples showing results that can be obtained with our infrastructure.
%B International Workshop on Power-Aware Systems and Architectures
%C Pittsburgh, PA
%8 2012-09
%G eng
%R 10.1109/ICPPW.2012.39
%0 Generic
%D 2012
%T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
%A George Bosilca
%A Aurelien Bouteiller
%A Elisabeth Brunet
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%B University of Tennessee Computer Science Technical Report (also LAWN 269)
%8 2012-06
%G eng
%0 Conference Proceedings
%B The Twentieth International Conference on Domain Decomposition Methods
%D 2011
%T Algebraic Schwarz Preconditioning for the Schur Complement: Application to the Time-Harmonic Maxwell Equations Discretized by a Discontinuous Galerkin Method.
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Stephane Lanteri
%A Jean Roman
%B The Twentieth International Conference on Domain Decomposition Methods
%C La Jolla, California
%8 2011-02
%G eng
%U http://hal.inria.fr/inria-00577639
%0 Journal Article
%J TeraGrid'11
%D 2011
%T Autotuned Parallel I/O for Highly Scalable Biosequence Analysis
%A Haihang You
%A Bhanu Rekapalli
%A Qing Liu
%A Shirley Moore
%B TeraGrid'11
%C Salt Lake City, Utah
%8 2011-07
%G eng
%0 Conference Proceedings
%B Proceedings of 17th International Conference, Euro-Par 2011, Part II
%D 2011
%T Correlated Set Coordination in Fault Tolerant Message Logging Protocols
%A Aurelien Bouteiller
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%E Emmanuel Jeannot
%E Raymond Namyst
%E Jean Roman
%K ftmpi
%B Proceedings of 17th International Conference, Euro-Par 2011, Part II
%I Springer
%C Bordeaux, France
%V 6853
%P 51-64
%8 2011-08
%G eng
%0 Generic
%D 2011
%T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Julien Langou
%A Yves Robert
%K magma
%K plasma
%B University of Tennessee Computer Science Technical Report (also Lawn 257)
%8 2011-10
%G eng
%0 Journal Article
%J International Journal of High Performance Computing
%D 2011
%T The International Exascale Software Project Roadmap
%A Jack Dongarra
%A Pete Beckman
%A Terry Moore
%A Patrick Aerts
%A Giovanni Aloisio
%A Jean-Claude Andre
%A David Barkai
%A Jean-Yves Berthou
%A Taisuke Boku
%A Bertrand Braunschweig
%A Franck Cappello
%A Barbara Chapman
%A Xuebin Chi
%A Alok Choudhary
%A Sudip Dosanjh
%A Thom Dunning
%A Sandro Fiore
%A Al Geist
%A Bill Gropp
%A Robert Harrison
%A Mark Hereld
%A Michael Heroux
%A Adolfy Hoisie
%A Koh Hotta
%A Zhong Jin
%A Yutaka Ishikawa
%A Fred Johnson
%A Sanjay Kale
%A Richard Kenway
%A David Keyes
%A Bill Kramer
%A Jesus Labarta
%A Alain Lichnewsky
%A Thomas Lippert
%A Bob Lucas
%A Barney MacCabe
%A Satoshi Matsuoka
%A Paul Messina
%A Peter Michielse
%A Bernd Mohr
%A Matthias S. Mueller
%A Wolfgang E. Nagel
%A Hiroshi Nakashima
%A Michael E. Papka
%A Dan Reed
%A Mitsuhisa Sato
%A Ed Seidel
%A John Shalf
%A David Skinner
%A Marc Snir
%A Thomas Sterling
%A Rick Stevens
%A Fred Streitz
%A Bob Sugar
%A Shinji Sumimoto
%A William Tang
%A John Taylor
%A Rajeev Thakur
%A Anne Trefethen
%A Mateo Valero
%A Aad van der Steen
%A Jeffrey Vetter
%A Peg Williams
%A Robert Wisniewski
%A Kathy Yelick
%X Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.
%B International Journal of High Performance Computing
%V 25
%P 3-60
%8 2011-01
%G eng
%R https://doi.org/10.1177/1094342010391989
%0 Journal Article
%J IEEE Computing in Science & Engineering
%D 2011
%T Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community
%A Jeffrey Vetter
%A Richard Glassbrook
%A Jack Dongarra
%A Karsten Schwan
%A Bruce Loftis
%A Stephen McNally
%A Jeremy Meredith
%A James Rogers
%A Philip Roth
%A Kyle Spafford
%A Sudhakar Yalamanchili
%K Benchmark testing
%K Computational modeling
%K Computer architecture
%K Graphics processing unit
%K Hardware
%K Random access memory
%K Scientific computing
%X The Keeneland project's goal is to develop and deploy an innovative, GPU-based high-performance computing system for the NSF computational science community.
%B IEEE Computing in Science & Engineering
%V 13
%P 90-95
%8 2011-08
%G eng
%N 5
%R https://doi.org/10.1109/MCSE.2011.83
%0 Journal Article
%J Parallel, Distributed, Grid and Cloud Computing for Engineering, Ajaccio, Corsica, France, 12-15 April
%D 2011
%T Parallel algebraic domain decomposition solver for the solution of augmented systems.
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Jean Roman
%B Parallel, Distributed, Grid and Cloud Computing for Engineering, Ajaccio, Corsica, France, 12-15 April
%8 2011-00
%G eng
%0 Journal Article
%J Future Generation Computer Systems
%D 2011
%T QCG-OMPI: MPI Applications on Grids.
%A Emmanuel Agullo
%A Camille Coti
%A Thomas Herault
%A Julien Langou
%A Sylvain Peyronnet
%A A. Rezmerita
%A Franck Cappello
%A Jack Dongarra
%B Future Generation Computer Systems
%V 27
%P 435-369
%8 2011-01
%G eng
%0 Generic
%D 2011
%T On Scalability for MPI Runtime Systems
%A George Bosilca
%A Thomas Herault
%A A. Rezmerita
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report
%C Knoxville, TN
%8 2011-05
%G eng
%0 Conference Proceedings
%B International Conference on Cluster Computing (CLUSTER)
%D 2011
%T On Scalability for MPI Runtime Systems
%A George Bosilca
%A Thomas Herault
%A A. Rezmerita
%A Jack Dongarra
%K harness
%B International Conference on Cluster Computing (CLUSTER)
%I IEEEE
%C Austin, TX, USA
%P 187-195
%8 2011-09
%G eng
%0 Conference Proceedings
%B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011
%D 2011
%T Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure
%A George Bosilca
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%A A. Rezmerita
%E Yiannis Cotronis
%E Anthony Danalis
%E Dimitrios S. Nikolopoulos
%E Jack Dongarra
%K ftmpi
%B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011
%I Springer
%C Santorini, Greece
%V 6960
%P 342-344
%8 2011-09
%G eng
%0 Journal Article
%J Procedia Computer Science
%D 2011
%T User-Defined Events for Hardware Performance Monitoring
%A Shirley Moore
%A James Ralph
%K mumi
%K papi
%X PAPI is a widely used cross-platform interface to hardware performance counters. PAPI currently supports native events, which are those provided by a given platform, and preset events, which are pre-defined events thought to be common across platforms. Presets are currently mapped and defined at the time that PAPI is compiled and installed. The idea of user-defined events is to allow users to define their own metrics and to have those metrics mapped to events on a platform without the need to re-install PAPI. User-defined events can be defined in terms of native, preset, and previously defined user-defined events. The user can combine events and constants in an arbitrary expression to define a new metric and give a name to the new metric. This name can then be specified as a PAPI event in a PAPI library call the same way as native and preset events. End-user tools such as TAU and Scalasca that use PAPI can also use the user-defined metrics. Users can publish their metric definitions so that other users can use them as well. We present several examples of how user-defined events can be used for performance analysis and modeling.
%B Procedia Computer Science
%I Elsevier
%V 4
%P 2096-2104
%8 2011-05
%G eng
%R https://doi.org/10.1016/j.procs.2011.04.229
%0 Conference Proceedings
%B Proceedings of EuroMPI 2010
%D 2010
%T Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%E Jack Dongarra
%E Michael Resch
%E Rainer Keller
%E Edgar Gabriel
%K ftmpi
%B Proceedings of EuroMPI 2010
%I Springer
%C Stuttgart, Germany
%8 2010-09
%G eng
%0 Generic
%D 2010
%T EZTrace: a generic framework for performance analysis
%A Jack Dongarra
%A Mathieu Faverge
%A Yutaka Ishikawa
%A Raymond Namyst
%A François Rue
%A Francois Trahay
%B ICL Technical Report
%8 2010-12
%G eng
%0 Journal Article
%J Sparse Days 2010 Meeting at CERFACS
%D 2010
%T MaPHyS or the Development of a Parallel Algebraic Domain Decomposition Solver in the Course of the Solstice Project
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Jean Roman
%A Yohan Lee-Tin-Yien
%B Sparse Days 2010 Meeting at CERFACS
%C Toulouse, France
%8 2010-06
%G eng
%0 Journal Article
%J Future Generation Computer Systems
%D 2010
%T QCG-OMPI: MPI Applications on Grids
%A Emmanuel Agullo
%A Camille Coti
%A Thomas Herault
%A Julien Langou
%A Sylvain Peyronnet
%A A. Rezmerita
%A Franck Cappello
%A Jack Dongarra
%B Future Generation Computer Systems
%V 27
%P 357-369
%8 2010-03
%G eng
%0 Conference Proceedings
%B EuroMPI 2010 Proceedings
%D 2010
%T Recent Advances in the Message Passing Interface, Lecture Notes in Computer Science (LNCS)
%E Rainer Keller
%E Edgar Gabriel
%E Michael Resch
%E Jack Dongarra
%B EuroMPI 2010 Proceedings
%I Springer
%C Stuttgart, Germany
%V 6305
%8 2010-09
%G eng
%0 Generic
%D 2010
%T Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Rajib Nath
%A Jean Roman
%A Samuel Thibault
%A Stanimire Tomov
%I 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster
%C Knoxville, TN
%8 2010-07
%G eng
%0 Journal Article
%J PARA 2010
%D 2010
%T Towards a Complexity Analysis of Sparse Hybrid Linear Solvers
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Jean Roman
%B PARA 2010
%C Reykjavik, Iceland
%8 2010-06
%G eng
%0 Conference Proceedings
%B SciDAC 2009, Journal of Physics: Conference Series
%D 2009
%T Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team
%A Bronis R. de Supinski
%A Sadaf Alam
%A David Bailey
%A Laura Carrington
%A Chris Daley
%A Anshu Dubey
%A Todd Gamblin
%A Dan Gunter
%A Paul D. Hovland
%A Heike Jagode
%A Karen Karavanic
%A Gabriel Marin
%A John Mellor-Crummey
%A Shirley Moore
%A Boyana Norris
%A Leonid Oliker
%A Catherine Olschanowsky
%A Philip C. Roth
%A Martin Schulz
%A Sameer Shende
%A Allan Snavely
%K test
%B SciDAC 2009, Journal of Physics: Conference Series
%I IOP Publishing
%C San Diego, California
%V 180(2009)012039
%8 2009-07
%G eng
%0 Conference Paper
%B CLUSTER '09
%D 2009
%T Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery
%A Aurelien Bouteiller
%A Thomas Ropars
%A George Bosilca
%A Christine Morin
%A Jack Dongarra
%K fault tolerant computing
%K libraries message passing
%K parallel machines
%K protocols
%X With the growing scale of high performance computing platforms, fault tolerance has become a major issue. Among the various approaches for providing fault tolerance to MPI applications, message logging has been proved to tolerate higher failure rate. However, this advantage comes at the expense of a higher overhead on communications, due to latency intrusive logging of events to a stable storage. Previous work proposed and evaluated several protocols relaxing the synchronicity of event logging to moderate this overhead. Recently, the model of message logging has been refined to better match the reality of high performance network cards, where message receptions are decomposed in multiple interdependent events. According to this new model, deterministic and non-deterministic events are clearly discriminated, reducing the overhead induced by message logging. In this paper we compare, experimentally, a pessimistic and an optimistic message logging protocol, using this new model and implemented in the Open MPI library. Although pessimistic and optimistic message logging are, respectively, the most and less synchronous message logging paradigms, experiments show that most of the time their performance is comparable.
%B CLUSTER '09
%I IEEE
%C New Orleans
%8 2009-08
%G eng
%R 10.1109/CLUSTR.2009.5289157
%0 Journal Article
%J Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface - 16th European PVM/MPI Users' Group Meeting
%D 2009
%T Towards Efficient MapReduce Using MPI
%A Torsten Hoefler
%A Yuan-Shun Dai
%A Jack Dongarra
%E M. Ropo
%E J Westerholm
%E Jack Dongarra
%B Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface - 16th European PVM/MPI Users' Group Meeting
%I Springer Berlin / Heidelberg
%C Espoo, Finland
%V 5759
%P 240-249
%8 2009-00
%G eng
%0 Conference Proceedings
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%D 2009
%T VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance
%A Lavanya Ramakrishan
%A Daniel Nurmi
%A Anirban Mandal
%A Charles Koelbel
%A Dennis Gannon
%A Mark Huang
%A Yang-Suk Kee
%A Graziano Obertelli
%A Kiran Thyagaraja
%A Rich Wolski
%A Asim YarKhan
%A Dmitrii Zagorodnov
%K grads
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%C Portland, OR
%8 2009-00
%G eng
%0 Conference Proceedings
%B Proceedings of the DoD HPCMP User Group Conference
%D 2008
%T Exploring New Architectures in Accelerating CFD for Air Force Applications
%A Jack Dongarra
%A Shirley Moore
%A Gregory D. Peterson
%A Stanimire Tomov
%A Jeff Allred
%A Vincent Natoli
%A David Richie
%K magma
%B Proceedings of the DoD HPCMP User Group Conference
%C Seattle, Washington
%8 2008-01
%G eng
%0 Journal Article
%J Computing and Informatics
%D 2008
%T Interactive Grid-Access Using Gridsolve and Giggle
%A Marcus Hardt
%A Keith Seymour
%A Jack Dongarra
%A Michael Zapf
%A Nicole Ruiter
%K netsolve
%B Computing and Informatics
%V 27
%P 233-248,ISSN1335-9150
%8 2008-00
%G eng
%0 Conference Proceedings
%B 2008 PPoPP Conference
%D 2008
%T Matrix Product on Heterogeneous Master Worker Platforms
%A Jack Dongarra
%A Jean-Francois Pineau
%A Yves Robert
%A Frederic Vivien
%B 2008 PPoPP Conference
%C Salt Lake City, Utah
%8 2008-01
%G eng
%0 Journal Article
%J International Journal of Foundations of Computer Science (IJFCS)
%D 2008
%T Revisiting Matrix Product on Master-Worker Platforms
%A Jack Dongarra
%A Jean-Francois Pineau
%A Yves Robert
%A Zhiao Shi
%A Frederic Vivien
%B International Journal of Foundations of Computer Science (IJFCS)
%V 19
%P 1317-1336
%8 2008-12
%G eng
%0 Conference Proceedings
%B Proceedings of the 2nd International Workshop on Tools for High Performance Computing
%D 2008
%T Usage of the Scalasca Toolset for Scalable Performance Analysis of Large-scale Parallel Applications
%A Felix Wolf
%A Brian Wylie
%A Erika Abraham
%A Wolfgang Frings
%A Karl Fürlinger
%A Markus Geimer
%A Marc-Andre Hermanns
%A Bernd Mohr
%A Shirley Moore
%A Matthias Pfeifer
%E Michael Resch
%E Rainer Keller
%E Valentin Himmler
%E Bettina Krammer
%E A Schulz
%K point
%B Proceedings of the 2nd International Workshop on Tools for High Performance Computing
%I Springer
%C Stuttgart, Germany
%P 157-167
%8 2008-01
%G eng
%0 Book Section
%B Distributed and Parallel Systems
%D 2007
%T A New Approach to MPI Collective Communication Implementations
%A Torsten Hoefler
%A Jeffrey M. Squyres
%A Graham Fagg
%A George Bosilca
%A Wolfgang Rehm
%A Andrew Lumsdaine
%K Automatic Selection
%K Collective Operation
%K Framework
%K Message Passing (MPI)
%K Open MPI
%X Recent research into the optimization of collective MPI operations has resulted in a wide variety of algorithms and corresponding implementations, each typically only applicable in a relatively narrow scope: on a specific architecture, on a specific network, with a specific number of processes, with a specific data size and/or data-type – or any combination of these (or other) factors. This situation presents an enormous challenge to portable MPI implementations which are expected to provide optimized collective operation performance on all platforms. Many portable implementations have attempted to provide a token number of algorithms that are intended to realize good performance on most systems. However, many platform configurations are still left without well-tuned collective operations. This paper presents a proposal for a framework that will allow a wide variety of collective algorithm implementations and a flexible, multi-tiered selection process for choosing which implementation to use when an application invokes an MPI collective function.
%B Distributed and Parallel Systems
%I Springer US
%P 45-54
%@ 978-0-387-69857-1
%G eng
%R 10.1007/978-0-387-69858-8_5
%0 Journal Article
%J International Journal of Foundations of Computer Science (IJFCS) (accepted)
%D 2007
%T Revisiting Matrix Product on Master-Worker Platforms
%A Jack Dongarra
%A Jean-Francois Pineau
%A Yves Robert
%A Zhiao Shi
%A Frederic Vivien
%B International Journal of Foundations of Computer Science (IJFCS) (accepted)
%8 2007-00
%G eng
%0 Conference Proceedings
%B SC06 Conference Tutorial
%D 2006
%T The HPC Challenge (HPCC) Benchmark Suite
%A Piotr Luszczek
%A David Bailey
%A Jack Dongarra
%A Jeremy Kepner
%A Robert Lucas
%A Rolf Rabenseifner
%A Daisuke Takahashi
%K hpcc
%K hpcchallenge
%B SC06 Conference Tutorial
%I IEEE
%C Tampa, Florida
%8 2006-11
%G eng
%0 Journal Article
%J Euro PVM/MPI 2006
%D 2006
%T Implementation and Usage of the PERUSE-Interface in Open MPI
%A Rainer Keller
%A George Bosilca
%A Graham Fagg
%A Michael Resch
%A Jack Dongarra
%B Euro PVM/MPI 2006
%C Bonn, Germany
%8 2006-09
%G eng
%0 Journal Article
%J PARA 2006
%D 2006
%T Prospectus for the Next LAPACK and ScaLAPACK Libraries
%A James Demmel
%A Jack Dongarra
%A B. Parlett
%A William Kahan
%A Ming Gu
%A David Bindel
%A Yozo Hida
%A Xiaoye Li
%A Osni Marques
%A Jason E. Riedy
%A Christof Voemel
%A Julien Langou
%A Piotr Luszczek
%A Jakub Kurzak
%A Alfredo Buttari
%A Julien Langou
%A Stanimire Tomov
%B PARA 2006
%C Umea, Sweden
%8 2006-06
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications (Special Issue: Scheduling for Large-Scale Heterogeneous Platforms)
%D 2006
%T Recent Developments in GridSolve
%A Asim YarKhan
%A Keith Seymour
%A Kiran Sagi
%A Zhiao Shi
%A Jack Dongarra
%E Yves Robert
%K netsolve
%B International Journal of High Performance Computing Applications (Special Issue: Scheduling for Large-Scale Heterogeneous Platforms)
%I Sage Science Press
%V 20
%8 2006-00
%G eng
%0 Journal Article
%D 2005
%T Introduction to the HPC Challenge Benchmark Suite
%A Piotr Luszczek
%A Jack Dongarra
%A David Koester
%A Rolf Rabenseifner
%A Bob Lucas
%A Jeremy Kepner
%A John McCalpin
%A David Bailey
%A Daisuke Takahashi
%K hpcc
%K hpcchallenge
%8 2005-03
%G eng
%0 Journal Article
%J International Journal of Parallel Programming
%D 2005
%T New Grid Scheduling and Rescheduling Methods in the GrADS Project
%A Francine Berman
%A Henri Casanova
%A Andrew Chien
%A Keith Cooper
%A Holly Dail
%A Anshuman Dasgupta
%A Wei Deng
%A Jack Dongarra
%A Lennart Johnsson
%A Ken Kennedy
%A Charles Koelbel
%A Bo Liu
%A Xu Liu
%A Anirban Mandal
%A Gabriel Marin
%A Mark Mazina
%A John Mellor-Crummey
%A Celso Mendes
%A A. Olugbile
%A Jignesh M. Patel
%A Dan Reed
%A Zhiao Shi
%A Otto Sievert
%A H. Xia
%A Asim YarKhan
%K grads
%B International Journal of Parallel Programming
%I Springer
%V 33
%P 209-229
%8 2005-06
%G eng
%0 Conference Proceedings
%B In Proceedings of the 2005 SciDAC Conference
%D 2005
%T Performance Analysis of GYRO: A Tool Evaluation
%A Patrick H. Worley
%A Jeff Candy
%A Laura Carrington
%A Kevin Huck
%A Timothy Kaiser
%A Kumar Mahinthakumar
%A Allen D. Malony
%A Shirley Moore
%A Dan Reed
%A Philip C. Roth
%A H. Shan
%A Sameer Shende
%A Allan Snavely
%A S. Sreepathi
%A Felix Wolf
%A Y. Zhang
%K kojak
%B In Proceedings of the 2005 SciDAC Conference
%C San Francisco, CA
%8 2005-06
%G eng
%0 Journal Article
%J Numerische Mathematik
%D 2005
%T Rounding Error Analysis of the Classical Gram-Schmidt Orthogonalization Process
%A Luc Giraud
%A Julien Langou
%A Miroslav Rozložník
%A Jasper van den Eshof
%B Numerische Mathematik
%V 101
%P 87-100
%8 2005-01
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience, Special Issue: Grid Performance
%D 2005
%T Self Adaptivity in Grid Computing
%A Sathish Vadhiyar
%A Jack Dongarra
%E John Gurd
%E Anthony Hey
%E Juri Papay
%E Graham Riley
%K netsolve
%K sans
%B Concurrency and Computation: Practice and Experience, Special Issue: Grid Performance
%V 17
%P 235-257
%8 2005-00
%G eng
%0 Conference Proceedings
%B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04')
%D 2004
%T LAPACK for Clusters Project: An Example of Self Adapting Numerical Software
%A Zizhong Chen
%A Jack Dongarra
%A Piotr Luszczek
%A Kenneth Roche
%K lacsi
%K lfc
%B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04')
%C Big Island, Hawaii
%V 9
%P 90282
%8 2004-01
%G eng
%0 Journal Article
%J Parallel Computing
%D 2003
%T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters
%A Zizhong Chen
%A Jack Dongarra
%A Piotr Luszczek
%A Kenneth Roche
%K lacsi
%K lfc
%K sans
%B Parallel Computing
%V 29
%P 1723-1743
%8 2003-11
%G eng
%0 Generic
%D 2003
%T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters (LAPACK Working Note 160)
%A Zizhong Chen
%A Jack Dongarra
%A Piotr Luszczek
%A Kenneth Roche
%K lacsi
%B University of Tennessee Computer Science Technical Report, UT-CS-03-499
%8 2003-01
%G eng
%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T VisPerf: Monitoring Tool for Grid Computing
%A DongWoo Lee
%A Jack Dongarra
%E R. S. Ramakrishna
%K netsolve
%B Lecture Notes in Computer Science
%I Springer Verlag, Heidelberg
%V 2659
%P 233-243
%8 2003-00
%G eng
%0 Journal Article
%J EuroPar 2002
%D 2002
%T Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load
%A Javier Cuenca
%A Domingo Giminez
%A José González
%A Jack Dongarra
%A Kenneth Roche
%B EuroPar 2002
%C Paderborn, Germany
%8 2002-08
%G eng
%0 Conference Proceedings
%B Parallel Computing: Advances and Current Issues:Proceedings of the International Conference ParCo2001
%D 2002
%T Deploying Parallel Numerical Library Routines to Cluster Computing in a Self Adapting Fashion
%A Kenneth Roche
%A Jack Dongarra
%E Gerhard R. Joubert
%E Almerica Murli
%E Frans Peters
%E Marco Vanneschi
%K lfc
%K sans
%B Parallel Computing: Advances and Current Issues:Proceedings of the International Conference ParCo2001
%I Imperial College Press
%C London, England
%8 2002-01
%G eng
%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2002
%T An Updated Set of Basic Linear Algebra Subprograms (BLAS)
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Iain Duff
%A Sven Hammarling
%A Greg Henry
%A Michael Heroux
%A Linda Kaufman
%A Andrew Lumsdaine
%A Antoine Petitet
%A Roldan Pozo
%A Karin Remington
%A Clint Whaley
%B ACM Transactions on Mathematical Software
%V 28
%P 135-151
%8 2002-12
%G eng
%R 10.1145/567806.567807
%0 Journal Article
%J (an update), submitted to ACM TOMS
%D 2001
%T Basic Linear Algebra Subprograms (BLAS)
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Iain Duff
%A Sven Hammarling
%A Greg Henry
%A Michael Heroux
%A Linda Kaufman
%A Andrew Lumsdaine
%A Antoine Petitet
%A Roldan Pozo
%A Karin Remington
%A Clint Whaley
%B (an update), submitted to ACM TOMS
%8 2001-02
%G eng
%0 Conference Proceedings
%B Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science
%D 2001
%T Fault Tolerant MPI for the HARNESS Meta-Computing System
%A Graham Fagg
%A Antonin Bukovsky
%A Jack Dongarra
%E Benjoe A. Juliano
%E R. Renner
%E K. Tan
%K ftmpi
%K harness
%B Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science
%I Springer Verlag
%C Berlin
%V 2073
%P 355-366
%8 2001-00
%G eng
%R 10.1007/3-540-45545-0_44
%0 Journal Article
%J International Journal of High Performance Applications and Supercomputing
%D 2001
%T The GrADS Project: Software Support for High-Level Grid Application Development
%A Francine Berman
%A Andrew Chien
%A Keith Cooper
%A Jack Dongarra
%A Ian Foster
%A Dennis Gannon
%A Lennart Johnsson
%A Ken Kennedy
%A Carl Kesselman
%A John Mellor-Crummey
%A Dan Reed
%A Linda Torczon
%A Rich Wolski
%K grads
%B International Journal of High Performance Applications and Supercomputing
%V 15
%P 327-344
%8 2001-01
%G eng
%0 Journal Article
%J International Journal of High Performance Applications and Supercomputing
%D 2001
%T Numerical Libraries and The Grid
%A Antoine Petitet
%A Susan Blackford
%A Jack Dongarra
%A Brett Ellis
%A Graham Fagg
%A Kenneth Roche
%A Sathish Vadhiyar
%K grads
%B International Journal of High Performance Applications and Supercomputing
%V 15
%P 359-374
%8 2001-01
%G eng
%0 Generic
%D 2001
%T Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK
%A Antoine Petitet
%A Susan Blackford
%A Jack Dongarra
%A Brett Ellis
%A Graham Fagg
%A Kenneth Roche
%A Sathish Vadhiyar
%K grads
%K scalapack
%B University of Tennessee Computer Science Technical Report
%8 2001-01
%G eng
%0 Journal Article
%J Handbook of Massive Data Sets
%D 2001
%T Overview of High Performance Computers
%A Aad J. van der Steen
%A Jack Dongarra
%E James Abello
%E Panos Pardalos
%E Mauricio Resende
%B Handbook of Massive Data Sets
%I Kluwer Academic Publishers
%P 791-852
%8 2001-01
%G eng
%0 Journal Article
%J 8th European PVM/MPI User's Group Meeting, Lecture Notes in Computer Science
%D 2001
%T Parallel IO Support for Meta-Computing Applications: MPI_Connect IO Applied to PACX-MPI
%A Graham Fagg
%A Edgar Gabriel
%A Michael Resch
%K ftmpi
%B 8th European PVM/MPI User's Group Meeting, Lecture Notes in Computer Science
%I Springer Verlag, Berlin
%C Greece
%V 2131
%8 2001-09
%G eng
%0 Generic
%D 2000
%T Design and Implementation of NetSolve using DCOM as the Remoting Layer
%A Ganapathy Raman
%A Jack Dongarra
%K netsolve
%B University of Tennessee Computer Science Department Technical Report
%8 2000-05
%G eng
%0 Generic
%D 2000
%T The GrADS Project: Software Support for High-Level Grid Application Development
%A Francine Berman
%A Andrew Chien
%A Keith Cooper
%A Jack Dongarra
%A Ian Foster
%A Dennis Gannon
%A Lennart Johnsson
%A Ken Kennedy
%A Carl Kesselman
%A Dan Reed
%A Linda Torczon
%A Rich Wolski
%K grads
%B Technical Report
%8 2000-02
%G eng
%0 Conference Proceedings
%B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation
%D 2000
%T A New Recursive Implementation of Sparse Cholesky Factorization
%A Jack Dongarra
%A Padma Raghavan
%B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation
%C Lausanne, Switzerland
%8 2000-08
%G eng
%0 Journal Article
%J Parallel Processing Letters
%D 1999
%T Algorithmic Issues on Heterogeneous Computing Platforms
%A Pierre Boulet
%A Jack Dongarra
%A Fabrice Rastello
%A Yves Robert
%A Frederic Vivien
%B Parallel Processing Letters
%V 9
%P 197-213
%8 1999-01
%G eng
%0 Journal Article
%J SIAM Annual Meeting
%D 1999
%T A Numerical Linear Algebra Problem Solving Environment Designer's Perspective (LAPACK Working Note 139)
%A Antoine Petitet
%A Henri Casanova
%A Clint Whaley
%A Jack Dongarra
%A Yves Robert
%B SIAM Annual Meeting
%C Atlanta, GA
%8 1999-05
%G eng
%0 Journal Article
%J Handbook on Parallel and Distributed Processing
%D 1999
%T Parallel and Distributed Scientific Computing: A Numerical Linear Algebra Problem Solving Environment Designer's Perspective
%A Antoine Petitet
%A Henri Casanova
%A Jack Dongarra
%A Yves Robert
%A Clint Whaley
%B Handbook on Parallel and Distributed Processing
%8 1999-01
%G eng
%0 Journal Article
%J Parallel Computing
%D 1999
%T Static Tiling for Heterogeneous Computing Platforms
%A Pierre Boulet
%A Jack Dongarra
%A Yves Robert
%A Frederic Vivien
%B Parallel Computing
%V 25
%P 547-568
%8 1999-01
%G eng
%0 Journal Article
%J Concurrency: Practice and Experience
%D 1999
%T Tiling on Systems with Communication/Computation Overlap
%A Pierre-Yves Calland
%A Jack Dongarra
%A Yves Robert
%B Concurrency: Practice and Experience
%V 11
%P 139-153
%8 1999-01
%G eng