%0 Generic %D 2024 %T CholeskyQR with Randomization and Pivoting for Tall Matrices (CQRRPT) %A Maksim Melnichenko %A Oleg Balabanov %A Riley Murray %A James Demmel %A Michael W. Mahoney %A Piotr Luszczek %X This paper develops and analyzes a new algorithm for QR decomposition with column pivoting (QRCP) of rectangular matrices with large row counts. The algorithm combines methods from randomized numerical linear algebra in a particularly careful way in order to accelerate both pivot decisions for the input matrix and the process of decomposing the pivoted matrix into the QR form. The source of the latter acceleration is a use of randomized preconditioning and CholeskyQR. Comprehensive analysis is provided in both exact and finite-precision arithmetic to characterize the algorithm's rank-revealing properties and its numerical stability granted probabilistic assumptions of the sketching operator. An implementation of the proposed algorithm is described and made available inside the open-source RandLAPACK library, which itself relies on RandBLAS - also available in open-source format. Experiments with this implementation on an Intel Xeon Gold 6248R CPU demonstrate order-of-magnitude speedups relative to LAPACK's standard function for QRCP, and comparable performance to a specialized algorithm for unpivoted QR of tall matrices, which lacks the strong rank-revealing properties of the proposed method. %I arXiv %8 2024-02 %G eng %U https://arxiv.org/abs/2311.08316 %0 Generic %D 2024 %T XaaS: Acceleration as a Service to Enable Productive High-Performance Cloud Computing %A Torsten Hoefler %A Marcin Copik %A Pete Beckman %A Andrew Jones %A Ian Foster %A Manish Parashar %A Daniel Reed %A Matthias Troyer %A Thomas Schulthess %A Dan Ernst %A Jack Dongarra %X HPC and Cloud have evolved independently, specializing their innovations into performance or productivity. Acceleration as a Service (XaaS) is a recipe to empower both fields with a shared execution platform that provides transparent access to computing resources, regardless of the underlying cloud or HPC service provider. Bridging HPC and cloud advancements, XaaS presents a unified architecture built on performance-portable containers. Our converged model concentrates on low-overhead, high-performance communication and computing, targeting resource-intensive workloads from climate simulations to machine learning. XaaS lifts the restricted allocation model of Function-as-a-Service (FaaS), allowing users to benefit from the flexibility and efficient resource utilization of serverless while supporting long-running and performance-sensitive workloads from HPC. %I arXiv %8 2024-01 %G eng %U https://arxiv.org/abs/2401.04552 %0 Conference Paper %B Lecture Notes in Computer Science %D 2023 %T AI Benchmarking for Science: Efforts from the MLCommons Science Working Group %A Thiyagalingam, Jeyan %A von Laszewski, Gregor %A Yin, Junqi %A Emani, Murali %A Papay, Juri %A Barrett, Gregg %A Luszczek, Piotr %A Tsaris, Aristeidis %A Kirkpatrick, Christine %A Wang, Feiyi %A Gibbs, Tom %A Vishwanath, Venkatram %A Shankar, Mallikarjun %A Fox, Geoffrey %A Hey, Tony %E Anzt, Hartwig %E Bienz, Amanda %E Luszczek, Piotr %E Baboulin, Marc %X With machine learning (ML) becoming a transformative tool for science, the scientific community needs a clear catalogue of ML techniques, and their relative benefits on various scientific problems, if they were to make significant advances in science using AI. Although this comes under the purview of benchmarking, conventional benchmarking initiatives are focused on performance, and as such, science, often becomes a secondary criteria. In this paper, we describe a community effort from a working group, namely, MLCommons Science Working Group, in developing science-specific AI benchmarking for the international scientific community. Since the inception of the working group in 2020, the group has worked very collaboratively with a number of national laboratories, academic institutions and industries, across the world, and has developed four science-specific AI benchmarks. We will describe the overall process, the resulting benchmarks along with some initial results. We foresee that this initiative is likely to be very transformative for the AI for Science, and for performance-focused communities. %B Lecture Notes in Computer Science %I Springer International Publishing %V 13387 %P 47 - 64 %8 2023-01 %@ 978-3-031-23219-0 %G eng %U https://link.springer.com/chapter/10.1007/978-3-031-23220-6_4 %R 10.1007/978-3-031-23220-610.1007/978-3-031-23220-6_4 %0 Journal Article %J ACM Transactions on Mathematical Software %D 2023 %T Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors %A Deshmukh, Sameer %A Yokota, Rio %A Bosilca, George %X Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be designed to efficiently exploit modern hardware architectures. The majority of existing work relies on batched BLAS libraries to handle the computation of many small dense matrices. We show that through careful analysis of the cache utilization, register accumulation using SIMD registers and a redesign of the implementation, one can achieve significantly higher throughput for these types of batched low-rank matrices across a large range of block and batch sizes. We test our algorithm on three CPUs using diverse ISAs – the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148 using AVX-512, and AMD EPYC 7502 using AVX-2, and show that our new batching methodology is able to obtain more than twice the throughput of vendor optimized libraries for all CPU architectures and problem sizes. %B ACM Transactions on Mathematical Software %V 49 %P 1 - 29 %8 2023-09 %G eng %U https://dl.acm.org/doi/10.1145/3595178 %N 3 %! ACM Trans. Math. Softw. %R 10.1145/3595178 %0 Conference Paper %B 52nd International Conference on Parallel Processing (ICPP 2023) %D 2023 %T O(N) distributed direct factorization of structured dense matrices using runtime systems %A Sameer Deshmukh %A Rio Yokota %A George Bosilca %A Qinxiang Ma %B 52nd International Conference on Parallel Processing (ICPP 2023) %I ACM %C Salt Lake City, Utah %8 2023-08 %@ 9798400708435 %G eng %U https://dl.acm.org/doi/proceedings/10.1145/3605573 %R 10.1145/3605573.3605606 %0 Generic %D 2023 %T Earth Virtualization Engines - A Technical Perspective %A Torsten Hoefler %A Bjorn Stevens %A Andreas F. Prein %A Johanna Baehr %A Thomas Schulthess %A Thomas F. Stocker %A John Taylor %A Daniel Klocke %A Pekka Manninen %A Piers M. Forster %A Tobias Kölling %A Nicolas Gruber %A Hartwig Anzt %A Claudia Frauen %A Florian Ziemen %A Milan Klöwer %A Karthik Kashinath %A Christoph Schär %A Oliver Fuhrer %A Bryan N. Lawrence %X Participants of the Berlin Summit on Earth Virtualization Engines (EVEs) discussed ideas and concepts to improve our ability to cope with climate change. EVEs aim to provide interactive and accessible climate simulations and data for a wide range of users. They combine high-resolution physics-based models with machine learning techniques to improve the fidelity, efficiency, and interpretability of climate projections. At their core, EVEs offer a federated data layer that enables simple and fast access to exabyte-sized climate data through simple interfaces. In this article, we summarize the technical challenges and opportunities for developing EVEs, and argue that they are essential for addressing the consequences of climate change. %8 2023-09 %G eng %U https://arxiv.org/abs/2309.09002 %0 Conference Paper %B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis %D 2023 %T Elastic deep learning through resilient collective operations %A Li, Jiali %A Bosilca, George %A Bouteiller, Aurélien %A Nicolae, Bogdan %X A robust solution that incorporates fault tolerance and elastic scaling capabilities for distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level Failure Mitigation (ULFM), this novel approach promotes efficient and lightweight failure management and encourages smooth scaling in volatile computational settings. The proposed ULFM MPI-centered mechanism outperforms the only officially supported elastic learning framework, Elastic Horovod (using Gloo and NCCL), by a significant factor. These results reinforce the capability of MPI extension to deal with resiliency, and promote ULFM as an effective technique for fault management, minimizing downtime, and thereby enhancing the overall performance of distributed applications, in particular elastic training in high-performance computing (HPC) environments and machine learning applications. %B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis %I ACM %C Denver, CO %8 2023-11 %@ 9798400707858 %G eng %U https://dl.acm.org/doi/abs/10.1145/3624062.3626080 %R 10.1145/3624062.3626080 %0 Conference Paper %B 52nd International Conference on Parallel Processing (ICPP 2023) %D 2023 %T Improving the Scaling of an Asynchronous Many-Task Runtime with a Lightweight Communication Engine %A Omri Mor %A George Bosilca %A Marc Snir %K asynchronous many-task %K dynamic runtime %K lightweight communication %K low-rank Cholesky %K message-passing %K MPI %K strong scaling %X There is a growing interest in Asynchronous Many-Task (AMT) runtimes as an efficient way to map irregular and dynamic parallel applications onto heterogeneous computing resources. In this work, we show that AMTs nonetheless struggle with communication bottlenecks when scaling computations strongly and that the design of commonly-used communication libraries such as MPI contribute to these bottlenecks. We replace MPI with LCI, a Lightweight Communication Interface that is designed for dynamic, asynchronous frameworks, as the communication layer for the PaRSEC runtime. The result is a significant reduction of end-to-end latency in communication microbenchmarks and a reduction of overall time-tosolution by up to 12% in HiCMA, a tile-based low-rank Cholesky factorization package. %B 52nd International Conference on Parallel Processing (ICPP 2023) %I ACM %C Salt Lake City, Utah %8 2023-09 %G eng %U http://snir.cs.illinois.edu/listed/icpp2023-69.pdf %R 10.1145/3605573.3605642 %0 Conference Proceedings %B 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2023 %T Memory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements %A Daniel Barry %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %K GPU power %K High Performance Computing %K network traffic %K papi %K performance analysis %K Performance Counters %X Some of the most important categories of performance events count the data traffic between the processing cores and the main memory. However, since these counters are not core-private, applications require elevated privileges to access them. PAPI offers a component that can access this information on IBM systems through the Performance Co-Pilot (PCP); however, doing so adds an indirection layer that involves querying the PCP daemon. This paper performs a quantitative study of the accuracy of the measurements obtained through this component on the Summit supercomputer. We use two linear algebra kernels---a generalized matrix multiply, and a modified matrix-vector multiply---as benchmarks and a distributed, GPU-accelerated 3D-FFT mini-app (using cuFFT) to compare the measurements obtained through the PAPI PCP component against the expected values across different problem sizes. We also compare our measurements against an in-house machine with a very similar architecture to Summit, where elevated privileges allow PAPI to access the hardware counters directly (without using PCP) to show that measurements taken via PCP are as accurate as the those taken directly. Finally, using both QMCPACK and the 3D-FFT, we demonstrate the diverse hardware activities that can be monitored simultaneously via PAPI hardware components. %B 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %C St. Petersburg, Florida %8 2023-08 %G eng %U https://ieeexplore.ieee.org/document/10196656 %R 10.1109/IPDPSW59300.2023.00070 %0 Generic %D 2023 %T Memory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements %A Daniel Barry %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %I 28th HIPS Workshop %C St. Petersburg, FL %8 2023-05 %G eng %0 Conference Paper %B Parallel Processing and Applied Mathematics (PPAM 2022) %D 2023 %T Mixed Precision Algebraic Multigrid on GPUs %A Tsai, Yu-Hsiang Mike %A Natalie Beams %A Anzt, Hartwig %E Wyrzykowski, Roman %E Dongarra, Jack %E Deelman, Ewa %E Karczewski, Konrad %K Algebraic multigrid %K GPUs %K mixed precision %K Portability %X In this paper, we present the first GPU-native platform-portable algebraic multigrid (AMG) implementation that allows the user to use different precision formats for the distinct multigrid levels. The AMG we present uses an aggregation size 2 parallel graph match as the AMG coarsening strategy. The implementation provides a high level of flexibility in terms of configuring the bottom-level solver and the precision format for the distinct levels. We present convergence and performance results on the GPUs from AMD, Intel, and NVIDIA, and compare against corresponding functionality available in other libraries. %B Parallel Processing and Applied Mathematics (PPAM 2022) %I Springer International Publishing %C Cham %V 13826 %8 2023-04 %@ 978-3-031-30441-5 %G eng %U https://link.springer.com/10.1007/978-3-031-30442-2 %R 10.1007/978-3-031-30442-2_9 %0 Conference Paper %B Sustained Simulation Performance 2021 %D 2023 %T MPI Continuations And How To Invoke Them %A Schuchart, Joseph %A George Bosilca %X Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and non-blocking operations, it falls short of adequately supporting the asynchrony of separate but dependent parts of an application coupled by the start and completion of a communication operation. Correctly and efficiently handling MPI communication in differentAPMmodels is still a challenge. We have previously proposed an extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations. This interface is flexible enough to accommodate a wide range of different APMs. In this paper, we discuss different variations of the callback signature and how to best pass data from the code starting the communication operation to the code reacting to its completion. We establish three requirements (efficiency, usability, safety) and evaluate different variations against them. Finally, we find that the current choice is not the best design in terms of both efficiency and safety and propose a simpler, possibly more efficient and safe interface. We also show how the transfer of information into the continuation callback can be largely automated using C++ lambda captures. %B Sustained Simulation Performance 2021 %I Springer International Publishing %C Cham %P 67 - 83 %8 2023-02 %@ 978-3-031-18045-3 %G eng %U https://link.springer.com/10.1007/978-3-031-18046-0 %R 10.1007/978-3-031-18046-010.1007/978-3-031-18046-0_5 %0 Conference Paper %B 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2023 %T PAQR: Pivoting Avoiding QR factorization %A Sid-Lakhdar, Wissam %A Cayrols, Sebastien %A Bielich, Daniel %A Abdelfattah, Ahmad %A Luszczek, Piotr %A Gates, Mark %A Tomov, Stanimire %A Johansen, Hans %A Williams-Young, David %A Davis, Timothy %A Dongarra, Jack %A Anzt, Hartwig %B 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C St. Petersburg, FL, USA %G eng %U https://ieeexplore.ieee.org/document/10177407/ %R 10.1109/IPDPS54959.2023.00040 %0 Conference Paper %B 2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops) %D 2023 %T Performance Insights into Device-initiated RMA Using Kokkos Remote Spaces %A Mishler, Daniel %A Ciesko, Jan %A Olivier, Stephen %A Bosilca, George %X Achieving scalable performance on supercomputers requires careful coordination of communication and computation. Often, MPI applications rely on buffering, packing, and sorting techniques to accommodate a two-sided API, minimize communication overhead, and achieve performance goals. As interconnects between accelerators become more performant and scalable, programming models such as SHMEM may have the opportunity to enable bandwidth maximization along with ease of programming. In this work, we take a closer look at device-initiated PGAS programming models using NVIDIA Corp’s NVSHMEM communication library and our interface through the Kokkos Remote Spaces project. We show that benchmarks can benefit from this programming model in terms of performance and programmability. We anticipate similar results for miniapplications. %B 2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops) %I IEEE %C Santa Fe, NM, USA %8 2023-11 %G eng %U https://ieeexplore.ieee.org/document/10321871/ %R 10.1109/CLUSTERWorkshops61457.2023.00028 %0 Conference Paper %B 2023 IEEE International Conference on Cluster Computing (CLUSTER) %D 2023 %T Reducing Data Motion and Energy Consumption of Geospatial Modeling Applications Using Automated Precision Conversion %A Cao, Qinglei %A Abdulah, Sameh %A Ltaief, Hatem %A Genton, Marc G. %A Keyes, David %A Bosilca, George %X The burgeoning interest in large-scale geospatial modeling, particularly within the domains of climate and weather prediction, underscores the concomitant critical importance of accuracy, scalability, and computational speed. Harnessing these complex simulations’ potential, however, necessitates innovative computational strategies, especially considering the increasing volume of data involved. Recent advancements in Graphics Processing Units (GPUs) have opened up new avenues for accelerating these modeling processes. In particular, their efficient utilization necessitates new strategies, such as mixed-precision arithmetic, that can balance the trade-off between computational speed and model accuracy. This paper leverages PaRSEC runtime system and delves into the opportunities provided by mixed-precision arithmetic to expedite large-scale geospatial modeling in heterogeneous environments. By using an automated conversion strategy, our mixed-precision approach significantly improves computational performance (up to 3X) on Summit supercomputer and reduces the associated energy consumption on various Nvidia GPU generations. Importantly, this implementation ensures the requisite accuracy in environmental applications, a critical factor in their operational viability. The findings of this study bear significant implications for future research and development in high-performance computing, underscoring the transformative potential of mixed-precision arithmetic on GPUs in addressing the computational demands of large-scale geospatial modeling and making a stride toward a more sustainable, efficient, and accurate future in large-scale environmental applications. %B 2023 IEEE International Conference on Cluster Computing (CLUSTER) %I IEEE %C Santa Fe, NM, USA %8 2023-11 %G eng %U https://ieeexplore.ieee.org/document/10319946/ %R 10.1109/CLUSTER52292.2023.00035 %0 Report %D 2023 %T Revisiting I/O bandwidth-sharing strategies for HPC applications %A Anne Benoit %A Thomas Herault %A Lucas Perotin %A Yves Robert %A Frederic Vivien %K bandwidth sharing %K HPC applications %K I/O %K scheduling strategy %X This work revisits I/O bandwidth-sharing strategies for HPC applications. When several applications post concurrent I/O operations, well-known approaches include serializing these operations (First-Come First-Served) or fair-sharing the bandwidth across them (FairShare). Another recent approach, I/O-Sets, assigns priorities to the applications, which are classified into different sets based upon the average length of their iterations. We introduce several new bandwidth-sharing strategies, some of them simple greedy algorithms, and some of them more complicated to implement, and we compare them with existing ones. Our new strategies do not rely on any a-priori knowledge of the behavior of the applications, such as the length of work phases, the volume of I/O operations, or some expected periodicity. We introduce a rigorous framework, namely steady-state windows, which enables to derive bounds on the competitive ratio of all bandwidth-sharing strategies for three different objectives: minimum yield, platform utilization, and global efficiency. To the best of our knowledge, this work is the first to provide a quantitative assessment of the online competitiveness of any bandwidth-sharing strategy. This theory-oriented assessment is complemented by a comprehensive set of simulations, based upon both synthetic and realistic traces. The main conclusion is that our simple and low-complexity greedy strategies significantly outperform First-Come First-Served, FairShare and I/O-Sets, and we recommend that the I/O community implements them for further assessment. %B INRIA Research Report %I INRIA %8 2023-03 %G eng %U https://hal.inria.fr/hal-04038011 %0 Conference Proceedings %B EUROMPI '23: 30th European MPI Users' Group Meeting %D 2023 %T Synchronizing MPI Processes in Space and Time %A Schuchart, Joseph %A Hunold, Sascha %A Bosilca, George %X Performance benchmarks are an integral part of the development and evaluation of parallel algorithms, both in distributed applications as well as MPI implementations themselves. The initial step of the benchmark process is to obtain a common timestamp to mark the start of an operation across all involved processes, and the state-of-the-art in many applications and widely used MPI benchmark suites is the use of MPI barriers. In this paper, we show that the synchronization in space provided by an MPI_Barrier is insufficient for proper benchmark results of parallel distributed algorithms, using MPI collective operations as examples. The resulting lack of a global start timestamp for an operation leads to skewed results, with a significant impact of the used barrier algorithm. In order to mitigate these issues, we propose and discuss the implementation of MPIX_Harmonize, which extends the synchronization in space provided by MPI_Barrier with a time synchronization to guarantee a common starting timestamp across all involved processes. By replacing the use of MPI_Barrier with MPIX_Harmonize, benchmark implementors can eliminate skews resulting from barrier algorithms and achieve stable performance benchmark results. We will show that the proper time synchronization can have significant impact on the benchmark results for various implementations of MPI_Allreduce, MPI_Reduce, and MPI_Bcast. %B EUROMPI '23: 30th European MPI Users' Group Meeting %I ACM %C Bristol, United Kingdom %8 2023-09 %@ 9798400709135 %G eng %U https://dl.acm.org/doi/proceedings/10.1145/3615318 %R 10.1145/3615318.3615325 %0 Journal Article %J Future Generation Computer Systems %D 2023 %T Three-precision algebraic multigrid on GPUs %A Tsai, Yu-Hsiang Mike %A Beams, Natalie %A Anzt, Hartwig %K Algebraic multigrid %K GPUs %K mixed precision %K Portability %X Recent research has demonstrated that using low precision inside some levels of an algebraic multigrid (AMG) solver can improve performance without negatively impacting the AMG quality. In this paper, we build upon previous research and implement an AMG that can use double, single, and half precision for the distinct multigrid levels. The implementation is platform-portable across GPU architectures from AMD, Intel, and NVIDIA. In an experimental analysis, we demonstrate that the use of half precision can be a viable option in multigrid. We evaluate the performance of different AMG configurations and demonstrate that mixed precision AMG can provide runtime savings compared to a double precision AMG. %B Future Generation Computer Systems %8 2023-07 %G eng %U https://www.sciencedirect.com/science/article/pii/S0167739X23002741 %R 10.1016/j.future.2023.07.024 %0 Conference Paper %B Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop %D 2023 %T When to checkpoint at the end of a fixed-length reservation? %A Quentin Barbut %A Anne Benoit %A Thomas Herault %A Yves Robert %A Frederic Vivien %X This work considers an application executing for a fixed duration, namely the length of the reservation that it has been granted. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. We address two scenarios. In the first scenario, a checkpoint can be taken at any time; despite its simplicity, this natural problem has not been considered yet (to the best of our knowledge). We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. The second scenario is more involved: the application is a linear workflow consisting of a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the application checkpoints at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue executing at the end of each task. We instantiate this second scenario with several examples of probability distribution laws for task durations. %B Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop %C Denver, United States %8 2023-08 %G eng %U https://inria.hal.science/hal-04215554 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2022 %T Accelerating Geostatistical Modeling and Prediction With Mixed-Precision Computations: A High-Productivity Approach With PaRSEC %A Abdulah, Sameh %A Qinglei Cao %A Pei, Yu %A George Bosilca %A Jack Dongarra %A Genton, Marc G. %A Keyes, David E. %A Ltaief, Hatem %A Sun, Ying %K Computational modeling %K Covariance matrices %K Data models %K Maximum likelihood estimation %K Predictive models %K runtime %K Task analysis %X Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Spatial data are assumed to possess properties of stationarity or non-stationarity via a kernel fitted to a covariance matrix. A primary workhorse of stationary spatial statistics is Gaussian maximum log-likelihood estimation (MLE), whose central data structure is a dense, symmetric positive definite covariance matrix of the dimension of the number of correlated observations. Two essential operations in MLE are the application of the inverse and evaluation of the determinant of the covariance matrix. These can be rendered through the Cholesky decomposition and triangular solution. In this contribution, we reduce the precision of weakly correlated locations to single- or half- precision based on distance. We thus exploit mathematical structure to migrate MLE to a three-precision approximation that takes advantage of contemporary architectures offering BLAS3-like operations in a single instruction that are extremely fast for reduced precision. We illustrate application-expected accuracy worthy of double-precision from a majority half-precision computation, in a context where uniform single-precision is by itself insufficient. In tackling the complexity and imbalance caused by the mixing of three precisions, we deploy the PaRSEC runtime system. PaRSEC delivers on-demand casting of precisions while orchestrating tasks and data movement in a multi-GPU distributed-memory environment within a tile-based Cholesky factorization. Application-expected accuracy is maintained while achieving up to 1.59X by mixing FP64/FP32 operations on 1536 nodes of HAWK or 4096 nodes of Shaheen II , and up to 2.64X by mixing FP64/FP32/FP16 operations on 128 nodes of Summit , relative to FP64-only operations. This translates into up to 4.5, 4.7, ... %B IEEE Transactions on Parallel and Distributed Systems %V 33 %P 964 - 976 %8 2022-04 %G eng %U https://ieeexplore.ieee.org/document/9442267/https://ieeexplore.ieee.org/ielam/71/9575177/9442267-aam.pdfhttp://xplorestaging.ieee.org/ielx7/71/9575177/09442267.pdf?arnumber=9442267 %N 4 %! IEEE Trans. Parallel Distrib. Syst. %R 10.1109/TPDS.2021.3084071 %0 Conference Proceedings %B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) %D 2022 %T Addressing Irregular Patterns of Matrix Computations on GPUs and Their Impact on Applications Powered by Sparse Direct Solvers %A Ahmad Abdelfattah %A Pieter Ghysels %A Wajih Boukaram %A Stanimire Tomov %A Xiaoye Sherry Li %A Jack Dongarra %K GPU computing %K irregular computational workloads %K lu factorization %K multifrontal solvers %K sparse direct solvers %X Many scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are different in size. Matrix decompositions on such irregular workloads are rarely addressed on GPUs. This paper addresses irregular workloads of matrix computations on GPUs, and their application to accelerate sparse direct solvers. We design an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse direct LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on a multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high-frequency indefinite Maxwell problem. %B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) %I IEEE Computer Society %C Dallas, TX %P 354-367 %8 2022-11 %G eng %U https://dl.acm.org/doi/abs/10.5555/3571885.3571919 %0 Book Section %B Approximate Computing Techniques %D 2022 %T Approximate Computing for Scientific Applications %A Anzt, Hartwig %A Casas, Marc %A Malossi,  Cristiano I. %A Quintana-Ortí, Enrique S %A Scheidegger, Florian %A Zhuang, Sicong %E Bosio, Alberto %E Ménard, Daniel %E Sentieys, Olivier %X This chapter reviews the performance benefits that result from applying (software) approximate computing to scientific applications. For this purpose, we target two particular areas, linear algebra and deep learning, with the first one selected for being ubiquitous in scientific problems and the second one for its considerable and growing number of important applications both in industry and science. The review of linear algebra in scientific computing is focused on the iterative solution of sparse linear systems, exposing the prevalent costs of memory accesses in these methods, and demonstrating how approximate computing can help to reduce these overheads, for example, in the case of stationary solvers themselves or the application of preconditioners for the solution of sparse linear systems via Krylov subspace methods. The discussion of deep learning is focused on the use of approximate data transfer for cutting costs of host-to-device operations, as well as the use of adaptive precision for accelerating training of classical CNN architectures. Additionally we discuss model optimization and architecture search in presence of constraints for edge devices applications. %B Approximate Computing Techniques %7 322 %I Springer International Publishing %P 415 - 465 %8 2022-01 %@ 978-3-030-94704-0 %G eng %U https://link.springer.com/chapter/10.1007/978-3-030-94705-7_14 %R 10.1007/978-3-030-94705-7_14 %0 Conference Proceedings %B IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing %D 2022 %T Checkpointing à la Young/Daly: An Overview %A Anne Benoit %A Yishu Du %A Thomas Herault %A Loris Marchal %A Guillaume Pallez %A Lucas Perotin %A Yves Robert %A Hongyang Sun %A Frederic Vivien %X The Young/Daly formula provides an approximation of the optimal checkpoint period for a parallel application executing on a supercomputing platform. The Young/Daly formula was originally designed for preemptible tightly-coupled applications. We provide some background and survey various application scenarios to assess the usefulness and limitations of the formula. %B IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing %I ACM Press %C Noida, India %P 701-710 %8 2022-08 %@ 9781450396752 %G eng %U https://dl.acm.org/doi/fullHtml/10.1145/3549206.3549328 %R 10.1145/3549206 %0 Generic %D 2022 %T Communication Avoiding LU with Tournament Pivoting in SLATE %A Rabab Alomairy %A Mark Gates %A Sebastien Cayrols %A Dalal Sukkari %A Kadir Akbudak %A Asim YarKhan %A Paul Bagwell %A Jack Dongarra %B SLATE Working Notes %8 2022-01 %G eng %0 Journal Article %J International Journal of Networking and Computing %D 2022 %T Comparing Distributed Termination Detection Algorithms for Modern HPC Platforms %A George Bosilca %A Bouteiller, Aurélien %A Herault, Thomas %A Le Fèvre, Valentin %A Robert, Yves %A Dongarra, Jack %X This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. We then compare the implementation of these algorithms over a task-based runtime system, PaRSEC and show the advantages and limitations of each approach in a real implementation. %B International Journal of Networking and Computing %V 12 %P 26 - 46 %8 2022-01 %G eng %U https://www.jstage.jst.go.jp/article/ijnc/12/1/12_26/_article %N 1 %! IJNC %R 10.15803/ijnc.12.1_26 %0 Conference Paper %B 2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM) %D 2022 %T Composition of Algorithmic Building Blocks in Template Task Graphs %A Herault, Thomas %A Schuchart, Joseph %A Valeev, Edward F. %A George Bosilca %B 2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM) %I IEEE %C Dallas, TX, USA %8 2023-01 %G eng %U https://ieeexplore.ieee.org/document/10024647/ %R 10.1109/PAW-ATM56565.2022.00008 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2022 %T Evaluating Data Redistribution in PaRSEC %A Qinglei Cao %A George Bosilca %A Losada, Nuria %A Wu, Wei %A Zhong, Dong %A Jack Dongarra %B IEEE Transactions on Parallel and Distributed Systems %V 33 %P 1856-1872 %8 2022-08 %G eng %R 10.1109/TPDS.2021.3131657 %0 Journal Article %J Journal of Radioanalytical and Nuclear Chemistry %D 2022 %T Evaluations of molecular modeling and machine learning for predictive capabilities in binding of lanthanum and actinium with carboxylic acids %A Penchoff, Deborah A. %A Peterson, Charles C. %A Wrancher, Eleigha M. %A George Bosilca %A Harrison, Robert J. %A Valeev, Edward F. %A Benny, Paul D. %B Journal of Radioanalytical and Nuclear Chemistry %8 2022-12 %G eng %U https://rdcu.be/c2lGj %! J Radioanal Nucl Chem %R 10.1007/s10967-022-08620-7 %0 Conference Paper %B IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2022 %T A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization %A Qinglei Cao %A Rabab Alomairy %A Yu Pei %A George Bosilca %A Hatem Ltaief %A David Keyes %A Jack Dongarra %B IEEE International Parallel and Distributed Processing Symposium (IPDPS) %8 2022-07 %G eng %U https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9820680&isnumber=9820610 %R 10.1109/IPDPS53621.2022.00047 %0 Conference Paper %B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2022 %T Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment %A Schuchart, Joseph %A Nookala, Poornima %A Javanmard, Mohammad Mahdi %A Herault, Thomas %A Valeev, Edward F. %A George Bosilca %A Harrison, Robert J. %X We present and evaluate TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. Programming interfaces that support task-based execution often only support shared memory parallel environments; a few support distributed memory environments, either by discovering the entire DAG of tasks on all processes, or by introducing explicit communications. The first approach limits scalability, while the second increases the complexity of programming. We demonstrate how TTG can address these issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task-centric programming systems, without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently. TTG supports distributed memory execution over 2 different task runtimes, PaRSEC and MADNESS. Performance of four paradigmatic applications (in graph analytics, dense and block-sparse linear algebra, and numerical integrodifferential calculus) with various degrees of irregularity implemented in TTG is illustrated on large distributed-memory platforms and compared to the state-of-the-art implementations. %B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Lyon, France %8 2022-07 %G eng %U https://ieeexplore.ieee.org/abstract/document/9820613 %R 10.1109/IPDPS53621.2022.00086 %0 Conference Paper %B 2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) %D 2022 %T Implicit Actions and Non-blocking Failure Recovery with MPI %A Bouteiller, Aurélien %A George Bosilca %X Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in which the fault tolerance activities of multiple components can be carried out simultaneously and overlap. This work proposes to: (1) provide the required consistency in fault reporting to applications (i.e., enable an application to assess the success of a computational phase without incurring an unacceptable performance hit); (2) bring forward the building blocks that permit the effective scoping of fault recovery in an application, so that independent components in an application can recover without interfering with each other, and separate groups of processes in the application can recover independently or in unison; and (3) overlap recovery activities necessary to restore the consistency of the system (e.g., eviction of faulty processes from the communication group) with application recovery activities (e.g., dataset restoration from checkpoints). %B 2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) %I IEEE %C Dallas, TX, USA %8 2023-01 %G eng %U https://ieeexplore.ieee.org/document/10024038/ %R 10.1109/FTXS56515.2022.00009 %0 Conference Proceedings %B 2022 IEEE International Conference on Cluster Computing (CLUSTER 2022) %D 2022 %T Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach %A Whitlock, Matthew %A Morales, Nicolas %A George Bosilca %A Bouteiller, Aurélien %A Nicolae, Bogdan %A Teranishi, Keita %A Giem, Elisabeth %A Sarkar, Vivek %K checkpointing %K Fault tolerance %K Fenix %K HPC %K Kokkos %K MPI-ULFM %K resilience %B 2022 IEEE International Conference on Cluster Computing (CLUSTER 2022) %C Heidelberg, Germany %8 2022-09 %G eng %U https://hal.archives-ouvertes.fr/hal-03772536 %0 Conference Proceedings %B 2022 IEEE International Conference on Cluster Computing (CLUSTER) %D 2022 %T Lossy all-to-all exchange for accelerating parallel 3-D FFTs on hybrid architectures with GPUs %A Cayrols, Sebastien %A Li, Jiali %A George Bosilca %A Stanimire Tomov %A Ayala, Alan %A Dongarra, Jack %X In the context of parallel applications, communication is a critical part of the infrastructure and a potential bottleneck. The traditional approach to tackle communication challenges consists of redesigning algorithms so that the complexity or the communication volume is reduced. However, there are algorithms like the Fast Fourier Transform (FFT) where reducing the volume of communication is very challenging yet can reap large benefit in terms of time-to-completion. In this paper, we revisit the implementation of the MPI all-to-all routine at the core of 3D FFTs by using advanced MPI features, such as One-Sided Communication, and integrate data compression during communication to reduce the volume of data exchanged. Since some compression techniques are ‘lossy’ in the sense that they involve a loss of accuracy, we study the impact of lossy compression in heFFTe, the state-of-the-art FFT library for large scale 3D FFTs on hybrid architectures with GPUs. Consequently, we design an approximate FFT algorithm that trades off user-controlled accuracy for speed. We show that we speedup the 3D FFTs proportionally to the compression rate. In terms of accuracy, comparing our approach with a reduced precision execution, where both the data and the computation are in reduced precision, we show that when the volume of communication is compressed to the size of the reduced precision data, the approximate FFT algorithm is as fast as the one in reduced precision while the accuracy is one order of magnitude better. %B 2022 IEEE International Conference on Cluster Computing (CLUSTER) %P 152-160 %8 2022-09 %G eng %R 10.1109/CLUSTER51413.2022.00029 %0 Generic %D 2022 %T Mixed precision and approximate 3D FFTs: Speed for accuracy trade-off with GPU-aware MPI and run-time data compression %A Sebastien Cayrols %A Jiali Li %A George Bosilca %A Stanimire Tomov %A Alan Ayala %A Jack Dongarra %K All-to-all %K Approximate FFTs %K ECP %K heFFTe %K Lossy compression %K mixed-precision algorithms %K MPI %B ICL Technical Report %8 2022-05 %G eng %0 Journal Article %J Parallel Computing %D 2022 %T OpenMP application experiences: Porting to accelerated nodes %A Bak, Seonmyeong %A Bertoni, Colleen %A Boehm, Swen %A Budiardja, Reuben %A Chapman, Barbara M. %A Doerfert, Johannes %A Eisenbach, Markus %A Finkel, Hal %A Hernandez, Oscar %A Huber, Joseph %A Iwasaki, Shintaro %A Kale, Vivek %A Kent, Paul R.C. %A Kwack, JaeHyuk %A Lin, Meifeng %A Luszczek, Piotr %A Luo, Ye %A Pham, Buu %A Pophale, Swaroop %A Ravikumar, Kiran %A Sarkar, Vivek %A Scogland, Thomas %A Tian, Shilei %A Yeung, P.K. %X As recent enhancements to the OpenMP specification become available in its implementations, there is a need to share the results of experimentation in order to better understand the OpenMP implementation’s behavior in practice, to identify pitfalls, and to learn how the implementations can be effectively deployed in scientific codes. We report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadership-class high-performance supercomputer systems at the United States Department of Energy. Additionally, we identify important challenges and open problems related to the deployment of OpenMP. Through our report of experiences, we find that OpenMP implementations are successful on current supercomputing platforms and that OpenMP is a promising programming model to use for applications to be run on emerging and future platforms with accelerated nodes. %B Parallel Computing %V 109 %8 2022-03 %G eng %U https://www.sciencedirect.com/science/article/pii/S0167819121001009 %! Parallel Computing %R 10.1016/j.parco.2021.102856 %0 Generic %D 2022 %T PAQR: Pivoting Avoiding QR factorization %A Wissam M. Sid-Lakhdar %A Sebastien Cayrols %A Daniel Bielich %A Ahmad Abdelfattah %A Piotr Luszczek %A Mark Gates %A Stanimire Tomov %A Hans Johansen %A David Williams-Young %A Timothy A. Davis %A Jack Dongarra %X The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is deficient. On the other hand, QR with column pivoting (QRCP) is able to produce smaller forward errors on deficient matrices, but its cost is prohibitive compared to QR. The aim of this paper is to propose PAQR, an alternative solution method with the same cost (or smaller) as QR and as accurate as QRCP in practical cases, for the solution of rank-deficient linear least-squares problems. After presenting the algorithm and its implementations on different architectures, we compare its accuracy and performance results on a variety of application problems. %B ICL Technical Report %8 2022-06 %G eng %0 Conference Proceedings %B Euro-Par 2021: Parallel Processing Workshops %D 2022 %T Porting Sparse Linear Algebra to Intel GPUs %A Tsai, Yuhsiang M. %A Cojean, Terry %A Anzt, Hartwig %E Chaves, Ricardo %E B. Heras, Dora %E Ilic, Aleksandar %E Unat, Didem %E Badia, Rosa M. %E Bracciali, Andrea %E Diehl, Patrick %E Dubey, Anshu %E Sangyoon, Oh %E L. Scott, Stephen %E Ricci, Laura %K Ginkgo %K Intel GPUs %K math library %K oneAPI %K SpMV %X With discrete Intel GPUs entering the high performance computing landscape, there is an urgent need for production-ready software stacks for these platforms. In this paper, we report how we prepare the Ginkgo math library for Intel GPUs by developing a kernel backed based on the DPC++ programming environment. We discuss conceptual differences to the CUDA and HIP programming models and describe workflows for simplified code conversion. We benchmark advanced sparse linear algebra routines utilizing the converted kernels to assess the efficiency of the DPC++ backend in the hardware-specific performance bounds. We compare the performance of basic building blocks against routines providing the same functionality that ship with Intel’s oneMKL vendor library. %B Euro-Par 2021: Parallel Processing Workshops %I Springer International Publishing %C Lisbon, Portugal %V 13098 %P 57 - 68 %8 2022-06 %@ 978-3-031-06155-4 %G eng %U https://link.springer.com/chapter/10.1007/978-3-031-06156-1_5 %R 10.1007/978-3-031-06156-1_5 %0 Conference Paper %B 2022 IEEE International Conference on Cluster Computing (CLUSTER) %D 2022 %T Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG %A Schuchart, Joseph %A Nookala, Poornima %A Herault, Thomas %A Valeev, Edward F. %A George Bosilca %K Dataflow graph %K Hardware %K Instruction sets %K Memory management %K PaR-SEC %K parallel programming %K runtime %K scalability %K Task analysis %K task-based programming %K Template Task Graph %K TTG %X Shared memory parallel programming models strive to provide low-overhead execution environments. Task-based programming models, in particular, are well-suited to cope with the ubiquitous multi- and many-core systems since they allow applications to express all available concurrency to a scheduler, which is tasked with exploiting the available hardware resources. It is general consensus that atomic operations should be preferred over locks and mutexes to avoid inter-thread serialization and the resulting loss in efficiency. However, even atomic operations may serialize threads if not used judiciously. In this work, we will discuss several optimizations applied to TTG and the underlying PaRSEC runtime system aiming at removing contentious atomic operations to reduce the overhead of task management to a few hundred clock cycles. The result is an optimized data-flow programming system that seamlessly scales from a single node to distributed execution and which is able to compete with OpenMP in shared memory. %B 2022 IEEE International Conference on Cluster Computing (CLUSTER) %I IEEE %C Heidelberg, Germany %8 2022-09 %G eng %U https://ieeexplore.ieee.org/document/9912704/ %R 10.1109/CLUSTER51413.2022.00026 %0 Conference Proceedings %B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) %D 2022 %T Reshaping Geostatistical Modeling and Prediction for Extreme-Scale Environmental Applications %A Cao, Qinglei %A Abdulah, Sameh %A Rabab Alomairy %A Pei, Yu %A Pratik Nag %A George Bosilca %A Dongarra, Jack %A Genton, Marc G. %A Keyes, David %A Ltaief, Hatem %A Sun, Ying %K climate/weather prediction %K dynamic runtime systems %K high performance computing. %K low- rank matrix approximations %K mixed-precision computations %K space-time geospatial statistics %K Task-based programming models %X We extend the capability of space-time geostatistical modeling using algebraic approximations, illustrating application-expected accuracy worthy of double precision from majority low-precision computations and low-rank matrix approximations. We exploit the mathematical structure of the dense covariance matrix whose inverse action and determinant are repeatedly required in Gaussian log-likelihood optimization. Geostatistics augments first-principles modeling approaches for the prediction of environmental phenomena given the availability of measurements at a large number of locations; however, traditional Cholesky-based approaches grow cubically in complexity, gating practical extension to continental and global datasets now available. We combine the linear algebraic contributions of mixed-precision and low-rank computations within a tilebased Cholesky solver with on-demand casting of precisions and dynamic runtime support from PaRSEC to orchestrate tasks and data movement. Our adaptive approach scales on various systems and leverages the Fujitsu A64FX nodes of Fugaku to achieve up to 12X performance speedup against the highly optimized dense Cholesky implementation. %B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) %I IEEE Press %C Dallas, TX %8 2022-11 %@ 9784665454445 %G eng %U https://dl.acm.org/doi/abs/10.5555/3571885.3571888 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2022 %T Resiliency in numerical algorithm design for extreme scale simulations %A Agullo, Emmanuel %A Altenbernd, Mirco %A Anzt, Hartwig %A Bautista-Gomez, Leonardo %A Benacchio, Tommaso %A Bonaventura, Luca %A Bungartz, Hans-Joachim %A Chatterjee, Sanjay %A Ciorba, Florina M %A DeBardeleben, Nathan %A Drzisga, Daniel %A Eibl, Sebastian %A Engelmann, Christian %A Gansterer, Wilfried N %A Giraud, Luc %A Göddeke, Dominik %A Heisig, Marco %A Jézéquel, Fabienne %A Kohl, Nils %A Li, Xiaoye Sherry %A Lion, Romain %A Mehl, Miriam %A Mycek, Paul %A Obersteiner, Michael %A Quintana-Ortí, Enrique S %A Rizzi, Francesco %A Rüde, Ulrich %A Schulz, Martin %A Fung, Fred %A Speck, Robert %A Stals, Linda %A Teranishi, Keita %A Thibault, Samuel %A Thönnes, Dominik %A Wagner, Andreas %A Wohlmuth, Barbara %K Fault tolerance %K Numerical algorithms %K parallel computer architecture %K resilience %B The International Journal of High Performance Computing Applications %V 36371337212766180823 %P 251 - 285 %8 2022-03 %G eng %U http://journals.sagepub.com/doi/10.1177/10943420211055188http://journals.sagepub.com/doi/pdf/10.1177/10943420211055188http://journals.sagepub.com/doi/pdf/10.1177/10943420211055188http://journals.sagepub.com/doi/full-xml/10.1177/10943420211055188 %N 2 %! The International Journal of High Performance Computing Applications %R 10.1177/10943420211055188 %0 Conference Paper %B 2022 IEEE High Performance Extreme Computing Conference (HPEC) %D 2022 %T Surrogate ML/AI Model Benchmarking for FAIR Principles' Conformance %A Piotr Luszczek %A Cade Brown %K Analytical models %K Benchmark testing %K Cloud computing %K Computational modeling %K Data models %K Measurement %K Satellites %X We present benchmarking platform for surrogate ML/AI models that enables the essential properties for open science and allow them to be findable, accessible, interoperable, and reusable. We also present a use case of cloud cover modeling, analysis, and experimental testing based on a large dataset of multi-spectral satellite sensor data. We use this particular evaluation to highlight the plethora of choices that need resolution for the life cycle of supporting the scientific workflows with data-driven models that need to be first trained to satisfactory accuracy and later monitored during field usage for proper feedback into both computational results and future data model improvements. Unlike traditional testing, performance, or analysis efforts, we focus exclusively on science-oriented metrics as the relevant figures of merit. %B 2022 IEEE High Performance Extreme Computing Conference (HPEC) %I IEEE %8 2022-09 %G eng %U https://ieeexplore.ieee.org/document/9926401/ %R 10.1109/HPEC55821.2022.9926401 %0 Journal Article %J Parallel Computing %D 2022 %T Using long vector extensions for MPI reductions %A Zhong, Dong %A Cao, Qinglei %A George Bosilca %A Dongarra, Jack %X The modern CPU’s design, including the deep memory hierarchies and SIMD/vectorization capability have a more significant impact on algorithms’ efficiency than the modest frequency increase observed recently. The current introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become a critical software component to increase efficiency and close the gap to peak performance. In this paper, we investigate the impact of the vectorization of MPI reduction operations. We propose an implementation of predefined MPI reduction operations using vector intrinsics (AVX and SVE) to improve the time-to-solution of the predefined MPI reduction operations. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architectures. Experiments conducted on varied architectures (Intel Xeon Gold, AMD Zen 2, and Arm A64FX), show that the proposed vector extension optimized reduction operations significantly reduce completion time for collective communication reductions. With these optimizations, we achieve higher memory bandwidth and an increased efficiency for local computations, which directly benefit the overall cost of collective reductions and applications based on them. %B Parallel Computing %V 109 %P 102871 %8 2022-03 %G eng %U https://www.sciencedirect.com/science/article/pii/S0167819121001137 %! Parallel Computing %R 10.1016/j.parco.2021.102871 %0 Generic %D 2021 %T Accelerating FFT towards Exascale Computing %A Alan Ayala %A Stanimire Tomov %A Haidar, Azzam %A Stoyanov, M. %A Cayrols, Sebastien %A Li, Jiali %A George Bosilca %A Jack Dongarra %I NVIDIA GPU Technology Conference (GTC2021) %G eng %0 Journal Article %J Parallel Computing %D 2021 %T Callback-based completion notification using MPI Continuations %A Schuchart, Joseph %A Samfass, Philipp %A Niethammer, Christoph %A Gracia, José %A George Bosilca %K MPI %K MPI Continuations %K OmpSs %K OpenMP %K parsec %K TAMPI %K Task-based programming models %X Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and nonblocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. We have previously proposed an extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations. This interface is flexible enough to accommodate a wide range of different APMs. In this paper, we present an extension to the previously described interface that allows for finer control of the behavior of the MPI Continuations interface. We then present some of our first experiences in using the interface in the context of different applications, including the NAS parallel benchmarks, the PaRSEC task-based runtime system, and a load-balancing scheme within an adaptive mesh refinement solver called ExaHyPE. We show that the interface, implemented inside Open MPI, enables low-latency, high-throughput completion notifications that outperform solutions implemented in the application space. %B Parallel Computing %V 21238566 %P 102793 %8 Jan-05-2021 %G eng %U https://www.sciencedirect.com/science/article/abs/pii/S0167819121000466?via%3Dihub %N 0225 %! Parallel Computing %R 10.1016/j.parco.2021.102793 %0 Conference Paper %B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021) %D 2021 %T Distributed-Memory Multi-GPU Block-Sparse Tensor Contraction for Electronic Structure %A Thomas Herault %A Yves Robert %A George Bosilca %A Robert Harrison %A Cannada Lewis %A Edward Valeev %A Jack Dongarra %K block-sparse matrix multiplication %K distributed-memory %K Electronic structure %K multi-GPU node %K parsec %K tensor contraction %X Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size. %B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021) %I IEEE %C Portland, OR %8 2021-05 %G eng %U https://hal.inria.fr/hal-02970659/document %0 Generic %D 2021 %T DTE: PaRSEC Enabled Libraries and Applications %A George Bosilca %A Thomas Herault %A Jack Dongarra %I 2021 Exascale Computing Project Annual Meeting %8 2021-04 %G eng %0 Journal Article %J Int. J. of Networking and Computing %D 2021 %T Dynamic DAG scheduling under memory constraints for shared-memory platforms %A Gabriel Bathie %A Loris Marchal %A Yves Robert %A Samuel Thibault %B Int. J. of Networking and Computing %V 11 %P 27-49 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2021 %T Efficient exascale discretizations: High-order finite element methods %A Kolev, Tzanio %A Fischer, Paul %A Min, Misun %A Jack Dongarra %A Brown, Jed %A Dobrev, Veselin %A Warburton, Tim %A Stanimire Tomov %A Shephard, Mark S %A Abdelfattah, Ahmad %A others %K co-design %K high-order discretizations %K High-performance computing %K PDEs %K unstructured grids %X Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms used in many large-scale applications. These architectures favor algorithms that expose ultra fine-grain parallelism and maximize the ratio of floating point operations to energy intensive data movement. One of the few viable approaches to achieve high efficiency in the area of PDE discretizations on unstructured grids is to use matrix-free/partially assembled high-order finite element methods, since these methods can increase the accuracy and/or lower the computational time due to reduced data motion. In this paper we provide an overview of the research and development activities in the Center for Efficient Exascale Discretizations (CEED), a co-design center in the Exascale Computing Project that is focused on the development of next-generation discretization software and algorithms to enable a wide range of finite element applications to run efficiently on future hardware. CEED is a research partnership involving more than 30 computational scientists from two US national labs and five universities, including members of the Nek5000, MFEM, MAGMA and PETSc projects. We discuss the CEED co-design activities based on targeted benchmarks, miniapps and discretization libraries and our work on performance optimizations for large-scale GPU architectures. We also provide a broad overview of research and development activities in areas such as unstructured adaptive mesh refinement algorithms, matrix-free linear solvers, high-order data visualization, and list examples of collaborations with several ECP and external applications. %B The International Journal of High Performance Computing Applications %P 10943420211020803 %G eng %R 10.1177/10943420211020803 %0 Book Section %B Tools for High Performance Computing 2018/2019 %D 2021 %T Effortless Monitoring of Arithmetic Intensity with PAPI’s Counter Analysis Toolkit %A Daniel Barry %A Danalis, Anthony %A Heike Jagode %X With exascale computing forthcoming, performance metrics such as memory traffic and arithmetic intensity are increasingly important for codes that heavily utilize numerical kernels. Performance metrics in different CPU architectures can be monitored by reading the occurrences of various hardware events. However, from architecture to architecture, it becomes more and more unclear which native performance events are indexed by which event names, making it difficult for users to understand what specific events actually measure. This ambiguity seems particularly true for events related to hardware that resides beyond the compute core, such as events related to memory traffic. Still, traffic to memory is a necessary characteristic for determining arithmetic intensity. To alleviate this difficulty, PAPI’s Counter Analysis Toolkit measures the occurrences of events through a series of benchmarks, allowing its users to discover the high-level meaning of native events. We (i) leverage the capabilities of the Counter Analysis Toolkit to identify the names of hardware events for reading and writing bandwidth utilization in addition to floating-point operations, (ii) measure the occurrences of the events they index during the execution of important numerical kernels, and (iii) verify their identities by comparing these occurrence patterns to the expected arithmetic intensity of the numerical kernels. %B Tools for High Performance Computing 2018/2019 %I Springer %P 195–218 %@ 978-3-030-66057-4 %G eng %R 10.1007/978-3-030-66057-4_11 %0 Generic %D 2021 %T Gingko: A Sparse Linear Algebrea Library for HPC %A Hartwig Anzt %A Natalie Beams %A Terry Cojean %A Fritz Göbel %A Thomas Grützmacher %A Aditya Kashi %A Pratik Nayak %A Tobias Ribizel %A Yuhsiang M. Tsai %I 2021 ECP Annual Meeting %8 2021-04 %G eng %0 Journal Article %J Parallel Computing %D 2021 %T GPU algorithms for Efficient Exascale Discretizations %A Abdelfattah, Ahmad %A Valeria Barra %A Natalie Beams %A Bleile, Ryan %A Brown, Jed %A Camier, Jean-Sylvain %A Carson, Robert %A Chalmers, Noel %A Dobrev, Veselin %A Dudouit, Yohann %A others %K Exascale applications %K Finite element methods %K GPU acceleration %K high-order discretizations %K High-performance computing %X In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems. %B Parallel Computing %V 108 %P 102841 %G eng %R 10.1016/j.parco.2021.102841 %0 Journal Article %J Parallel Computing %D 2021 %T An international survey on MPI users %A Atsushi Hori %A Emmanuel Jeannot %A George Bosilca %A Takahiro Ogura %A Balazs Gerofi %A Jie Yin %A Yutaka Ishikawa %K message passing interface %K MPI %K survey %X The Message Passing Interface (MPI) plays a crucial part in the parallel computing ecosystem, a driving force behind many of thehigh-performance computing (HPC) successes. To maintain its relevance to the user community—and in particular to the growingHPC community at large—the MPI standard needs to identify and understand the MPI users’ concerns and expectations, and adaptaccordingly to continue to efficiently bridge the gap between users and hardware. This questionnaire survey was conducted usingtwo online questionnaire frameworks and has gathered more than 850 answers from 42 countries since February 2019. Some ofpreceding surveys of MPI uses are questionnaire surveys like ours, while others are conducted either by analyzing MPI programsto reveal static behavior or by using profiling tools to analyze the dynamic runtime behavior of MPI jobs. Our survey is differentfrom other questionnaire surveys in terms of its larger number of participants and wide geographic spread. As a result, it is possibleto illustrate the current status of MPI users more accurately and with a wider geographical distribution. In this report, we will showsome interesting findings, compare the results with preceding studies when possible, and provide some recommendations for MPIForum based on the findings. %B Parallel Computing %V 108 %8 2021-12 %G eng %U https://www.sciencedirect.com/science/article/abs/pii/S0167819121000983 %R 10.1016/j.parco.2021.102853 %0 Book Section %B Rare Earth Elements and Actinides: Progress in Computational Science Applications %D 2021 %T An Introduction to High Performance Computing and Its Intersection with Advances in Modeling Rare Earth Elements and Actinides %A Deborah A. Penchoff %A Edward Valeev %A Heike Jagode %A Piotr Luszczek %A Anthony Danalis %A George Bosilca %A Robert J. Harrison %A Jack Dongarra %A Theresa L. Windus %K actinide %K Computational modeling %K HPC %K REE %X Computationally driven solutions in nuclear and radiochemistry heavily depend on efficient modeling of Rare Earth Elements (REEs) and actinides. Accurate modeling of REEs and actinides faces challenges stemming from limitations from an imbalanced hardware-software ecosystem and its implications on inefficient use of High Performance Computing (HPC). This chapter provides a historical perspective on the evolution of HPC hardware, its intersectionality with domain sciences, the importance of benchmarks for performance, and an overview of challenges and advances in modeling REEs and actinides. This chapter intends to provide an introduction for researchers at the intersection of scientific computing, software development for HPC, and applied computational modeling of REEs and actinides. The chapter is structured in five sections. First, the Introduction includes subsections focusing on the Importance of REEs and Actinides (1.1), Hardware, Software, and the HPC Ecosystem (1.2), and Electronic Structure Modeling of REEs and Actinides (1.3). Second, a section in High Performance Computing focuses on the TOP500 (2.1), HPC Performance (2.2), HPC Benchmarks: Processing, Bandwidth, and Latency (2.3), and HPC Benchmarks and their Relationship to Chemical Modeling (2.4). Third, the Software Challenges and Advances focus on NWChem/NWChemEx (3.1), MADNESS (3.2), and MPQC (3.3). The fourth section provides a short overview of Artificial Intelligence in HPC applications relevant to nuclear and radiochemistry. The fifth section illustrates A Protocol to Evaluate Complexation Preferences in Separations of REEs and Actinides through Computational Modeling. %B Rare Earth Elements and Actinides: Progress in Computational Science Applications %I American Chemical Society %C Washington, DC %V 1388 %P 3-53 %8 2021-10 %@ ISBN13: 9780841298255 eISBN: 9780841298248 %G eng %U https://pubs.acs.org/doi/10.1021/bk-2021-1388.ch001 %& 1 %R 10.1021/bk-2021-1388.ch001 %0 Conference Paper %B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021) %D 2021 %T Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems %A Qinglei Cao %A Yu Pei %A Kadir Akbudak %A George Bosilca %A Hatem Ltaief %A David Keyes %A Jack Dongarra %K asynchronous executions and load balancing %K dynamic runtime system %K environmental applications %K High-performance computing %K low-rank matrix computations %K task-based programming model %K user productivity %X The task-based programming model associated with dynamic runtime systems has gained popularity for challenging problems because of workload imbalance, heterogeneous resources, or extreme concurrency. During the last decade, lowrank matrix approximations, where the main idea consists of exploiting data sparsity typically by compressing off-diagonal tiles up to an application-specific accuracy threshold, have been adopted to address the curse of dimensionality at extreme scale. In this paper, we create a bridge between the runtime and the linear algebra by communicating knowledge of the data sparsity to the runtime. We design and implement this synergistic approach with high user productivity in mind, in the context of the PaRSEC runtime system and the HiCMA numerical library. This requires to extend PaRSEC with new features to integrate rank information into the dataflow so that proper decisions can be taken at runtime. We focus on the tile low-rank (TLR) Cholesky factorization for solving 3D data-sparse covariance matrix problems arising in environmental applications. In particular, we employ the 3D exponential model of Matern matrix kernel, which exhibits challenging nonuniform ´high ranks in off-diagonal tiles. We first provide a dynamic data structure management driven by a performance model to reduce extra floating-point operations. Next, we optimize the memory footprint of the application by relying on a dynamic memory allocator, and supported by a rank-aware data distribution to cope with the workload imbalance. Finally, we expose further parallelism using kernel recursive formulations to shorten the critical path. Our resulting high-performance implementation outperforms existing data-sparse TLR Cholesky factorization by up to 7-fold on a large-scale distributed-memory system, while minimizing the memory footprint up to a 44-fold factor. This multidisciplinary work highlights the need to empower runtime systems beyond their original duty of task scheduling for servicing next-generation low-rank matrix algebra libraries. %B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021) %I IEEE %C Portland, OR %8 2021-05 %G eng %0 Journal Article %J Journal of Open Source Software %D 2021 %T libCEED: Fast algebra for high-order element-based discretizations %A Jed Brown %A Ahmad Abdelfattah %A Valeria Barra %A Natalie Beams %A Jean-Sylvain Camier %A Veselin Dobrev %A Yohann Dudouit %A Leila Ghaffari %A Tzanio Kolev %A David Medina %A Will Pazner %A Thilina Ratnayaka %A Jeremy Thompson %A Stanimire Tomov %K finite elements %K high-order methods %K High-performance computing %K matrix-free %K spectral elements %X Finite element methods are widely used to solve partial differential equations (PDE) in science and engineering, but their standard implementation (Arndt et al., 2020; Kirk et al., 2006; Logg et al., 2012) relies on assembling sparse matrices. Sparse matrix multiplication and triangular operations perform a scalar multiply and add for each nonzero entry, just 2 floating point operations (flops) per scalar that must be loaded from memory (Williams et al., 2009). Modern hardware is capable of nearly 100 flops per scalar streamed from memory (Rupp, 2020) so sparse matrix operations cannot achieve more than about 2% utilization of arithmetic units. Matrix assembly becomes even more problematic when the polynomial degree p of the basis functions is increased, resulting in O(pd) storage and O(p2d) compute per degree of freedom (DoF) in d dimensions. Methods pioneered by the spectral element community (Deville et al., 2002; Orszag, 1980) exploit problem structure to reduce costs to O(1) storage and O(p) compute per DoF, with very high utilization of modern CPUs and GPUs. Unfortunately, highquality implementations have been relegated to applications and intrusive frameworks that are often difficult to extend to new problems or incorporate into legacy applications, especially when strong preconditioners are required. libCEED, the Code for Efficient Extensible Discretization (Abdelfattah et al., 2021), is a lightweight library that provides a purely algebraic interface for linear and nonlinear operators and preconditioners with element-based discretizations. libCEED provides portable performance via run-time selection of implementations optimized for CPUs and GPUs, including support for just-in-time (JIT) compilation. It is designed for convenient use in new and legacy software, and offers interfaces in C99 (International Standards Organisation, 1999), Fortran77 (ANSI, 1978), Python (Python, 2021), Julia (Bezanson et al., 2017), and Rust (Rust, 2021). Users and library developers can integrate libCEED at a low level into existing applications in place of existing matrix-vector products without significant refactoring of their own discretization infrastructure. Alternatively, users can utilize integrated libCEED support in MFEM (Anderson et al., 2020; MFEM, 2021). In addition to supporting applications and discretization libraries, libCEED provides a platform for performance engineering and co-design, as well as an algebraic interface for solvers research like adaptive p-multigrid, much like how sparse matrix libraries enable development and deployment of algebraic multigrid solvers %B Journal of Open Source Software %V 6 %P 2945 %G eng %U https://doi.org/10.21105/joss.02945 %R 10.21105/joss.02945 %0 Conference Proceedings %B IPDPS'2021, the 34th IEEE International Parallel and Distributed Processing Symposium %D 2021 %T Max-Stretch Minimization on an Edge-Cloud Platform %A Anne Benoit %A Redouane Elghazi %A Yves Robert %B IPDPS'2021, the 34th IEEE International Parallel and Distributed Processing Symposium %I IEEE Computer Society Press %G eng %0 Conference Paper %B EuroMPI'21 %D 2021 %T Quo Vadis MPI RMA? Towards a More Efficient Use of MPI One-Sided Communication %A Schuchart, Joseph %A Niethammer, Christoph %A Gracia, José %A George Bosilca %K Memory Handles %K MPI %K MPI-RMA %K RDMA %X The MPI standard has long included one-sided communication abstractions through the MPI Remote Memory Access (RMA) interface. Unfortunately, the MPI RMA chapter in the 4.0 version of the MPI standard still contains both well-known and lesser known short-comings for both implementations and users, which lead to potentially non-optimal usage patterns. In this paper, we identify a set of issues and propose ways for applications to better express anticipated usage of RMA routines, allowing the MPI implementation to better adapt to the application's needs. In order to increase the flexibility of the RMA interface, we add the capability to duplicate windows, allowing access to the same resources encapsulated by a window using different configurations. In the same vein, we introduce the concept of MPI memory handles, meant to provide life-time guarantees on memory attached to dynamic windows, removing the overhead currently present in using dynamically exposed memory. We will show that our extensions provide improved accumulate latencies, reduced overheads for multi-threaded flushes, and allow for zero overhead dynamic memory window usage. %B EuroMPI'21 %C Garching, Munich Germany %G eng %U https://arxiv.org/abs/2111.08142 %0 Journal Article %J Int. J. of Networking and Computing %D 2021 %T Resilient scheduling heuristics for rigid parallel jobs %A Anne Benoit %A Valentin Le Fèvre %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B Int. J. of Networking and Computing %V 11 %P 2-26 %G eng %0 Conference Proceedings %B 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2021 %T Revisiting Credit Distribution Algorithms for Distributed Termination Detection %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Le Fèvre, Valentin %A Robert, Yves %A Jack Dongarra %K control messages %K credit distribution algorithms %K task-based HPC application %K Termination detection %X This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. %B 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %P 611–620 %G eng %R 10.1109/IPDPSW52791.2021.00095 %0 Generic %D 2021 %T SLATE Performance Improvements: QR and Eigenvalues %A Kadir Akbudak %A Paul Bagwell %A Sebastien Cayrols %A Mark Gates %A Dalal Sukkari %A Asim YarKhan %A Jack Dongarra %B SLATE Working Notes %8 2021-04 %G eng %0 Generic %D 2021 %T SLATE Port to AMD and Intel Platforms %A Ahmad Abdelfattah %A Mohammed Al Farhan %A Cade Brown %A Mark Gates %A Dalal Sukkari %A Asim YarKhan %A Jack Dongarra %B SLATE Working Notes %8 2021-04 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2021 %T A survey of numerical linear algebra methods utilizing mixed-precision arithmetic %A Abdelfattah, Ahmad %A Anzt, Hartwig %A Boman, Erik G %A Carson, Erin %A Cojean, Terry %A Jack Dongarra %A Fox, Alyson %A Mark Gates %A Higham, Nicholas J %A Li, Xiaoye S %A others %K GPUs %K High-performance computing %K linear algebra %K Mixed-precision arithmetic %K numerical mathematics %X The efficient utilization of mixed-precision numerical linear algebra algorithms can offer attractive acceleration to scientific computing applications. Especially with the hardware integration of low-precision special-function units designed for machine learning applications, the traditional numerical algorithms community urgently needs to reconsider the floating point formats used in the distinct operations to efficiently leverage the available compute power. In this work, we provide a comprehensive survey of mixed-precision numerical linear algebra routines, including the underlying concepts, theoretical background, and experimental results for both dense and sparse linear algebra problems. %B The International Journal of High Performance Computing Applications %V 35 %P 344–369 %G eng %R 10.1177/10943420211003313 %0 Conference Proceedings %B Proceedings of the ACM International Conference on Supercomputing %D 2021 %T Task-graph scheduling extensions for efficient synchronization and communication %A Bak, Seonmyeong %A Hernandez, Oscar %A Mark Gates %A Piotr Luszczek %A Sarkar, Vivek %K Compilers %K Computing methodologies %K Parallel computing methodologies %K Parallel programming languages %K Runtime environments %K Software and its engineering %K Software notations and tools %X Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs. %B Proceedings of the ACM International Conference on Supercomputing %P 88–101 %G eng %R 10.1145/3447818.3461616 %0 Generic %D 2020 %T ASCR@40: Four Decades of Department of Energy Leadership in Advanced Scientific Computing Research %A Bruce Hendrickson %A Paul Messina %A Buddy Bland %A Jackie Chen %A Phil Colella %A Eli Dart %A Jack Dongarra %A Thom Dunning %A Ian Foster %A Richard Gerber %A Rachel Harken %A Wendy Huntoon %A Bill Johnston %A John Sarrao %A Jeff Vetter %I Advanced Scientific Computing Advisory Committee (ASCAC), US Department of Energy %8 2020-08 %G eng %U https://computing.llnl.gov/misc/ASCR@40-Highlights.pdf %0 Generic %D 2020 %T ASCR@40: Highlights and Impacts of ASCR’s Programs %A Bruce Hendrickson %A Paul Messina %A Buddy Bland %A Jackie Chen %A Phil Colella %A Eli Dart %A Jack Dongarra %A Thom Dunning %A Ian Foster %A Richard Gerber %A Rachel Harken %A Wendy Huntoon %A Bill Johnston %A John Sarrao %A Jeff Vetter %X The Office of Advanced Scientific Computing Research (ASCR) sits within the Office of Science in the Department of Energy (DOE). Per their web pages, “the mission of the ASCR program is to discover, develop, and deploy computational and networking capabilities to analyze, model, simulate, and predict complex phenomena important to the DOE.” This succinct statement encompasses a wide range of responsibilities for computing and networking facilities; for procuring, deploying, and operating high performance computing, networking, and storage resources; for basic research in mathematics and computer science; for developing and sustaining a large body of software; and for partnering with organizations across the Office of Science and beyond. While its mission statement may seem very contemporary, the roots of ASCR are quite deep—long predating the creation of DOE. Applied mathematics and advanced computing were both elements of the Theoretical Division of the Manhattan Project. In the early 1950s, the Manhattan Project scientist and mathematician John von Neumann, then a commissioner for the AEC (Atomic Energy Commission), advocated for the creation of a Mathematics program to support the continued development and applications of digital computing. Los Alamos National Laboratory (LANL) scientist John Pasta created such a program to fund researchers at universities and AEC laboratories. Under several organizational name changes, this program has persisted ever since, and would eventually grow to become ASCR. %I US Department of Energy’s Office of Advanced Scientific Computing Research %8 2020-06 %G eng %U https://www.osti.gov/servlets/purl/1631812 %R https://doi.org/10.2172/1631812 %0 Generic %D 2020 %T CEED ECP Milestone Report: Improve Performance and Capabilities of CEED-Enabled ECP Applications on Summit/Sierra %A Kolev, Tzanio %A Fischer, Paul %A Abdelfattah, Ahmad %A Ananthan, Shreyas %A Valeria Barra %A Natalie Beams %A Bleile, Ryan %A Brown, Jed %A Carson, Robert %A Camier, Jean-Sylvain %A Churchfield, Matthew %A Dobrev, Veselin %A Jack Dongarra %A Dudouit, Yohann %A Karakus, Ali %A Kerkemeier, Stefan %A Lan, YuHsiang %A Medina, David %A Merzari, Elia %A Min, Misun %A Parker, Scott %A Ratnayaka, Thilina %A Smith, Cameron %A Sprague, Michael %A Stitt, Thomas %A Thompson, Jeremy %A Tomboulides, Ananias %A Stanimire Tomov %A Tomov, Vladimir %A Vargas, Arturo %A Warburton, Tim %A Weiss, Kenneth %B ECP Milestone Reports %I Zenodo %8 2020-05 %G eng %U https://doi.org/10.5281/zenodo.3860804 %R https://doi.org/10.5281/zenodo.3860804 %0 Conference Paper %B 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2020 %T Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime %A Yu Pei %A Qinglei Cao %A George Bosilca %A Piotr Luszczek %A Victor Eijkhout %A Jack Dongarra %X Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2× speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively. %B 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %C New Orleans, LA %8 2020-05 %G eng %R https://doi.org/10.1109/IPDPSW50202.2020.00127 %0 Book %B Lecture Notes in Computer Science %D 2020 %T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part IV %A Valeria Krzhizhanovskaya %A Gábor Závodszky %A Michael Lees %A Jack Dongarra %A Peter Sloot %A Sérgio Brissos %A João Teixeira %B Lecture Notes in Computer Science %7 1 %I Springer International Publishing %P 668 %8 2020-06 %@ 978-3-030-50423-6 %G eng %R https://doi.org/10.1007/978-3-030-50423-6 %0 Book %B Lecture Notes in Computer Science %D 2020 %T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part VII %A Valeria Krzhizhanovskaya %A Gábor Závodszky %A Michael Lees %A Jack Dongarra %A Peter Sloot %A Sérgio Brissos %A João Teixeira %B Lecture Notes in Computer Science %7 1 %I Springer International Publishing %P 775 %8 2020-06 %@ 978-3-030-50436-6 %G eng %R https://doi.org/10.1007/978-3-030-50436-6 %0 Book %B Lecture Notes in Computer Science %D 2020 %T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part VI %A Valeria Krzhizhanovskaya %A Gábor Závodszky %A Michael Lees %A Jack Dongarra %A Peter Sloot %A Sérgio Brissos %A João Teixeira %B Lecture Notes in Computer Science %7 1 %I Springer International Publishing %P 667 %8 2020-06 %@ 978-3-030-50433-5 %G eng %R https://doi.org/10.1007/978-3-030-50433-5 %0 Book %B Lecture Notes in Computer Science %D 2020 %T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part V %A Valeria Krzhizhanovskaya %A Gábor Závodszky %A Michael Lees %A Jack Dongarra %A Peter Sloot %A Sérgio Brissos %A João Teixeira %B Lecture Notes in Computer Science %7 1 %I Springer International Publishing %P 618 %8 2020-06 %@ 978-3-030-50426-7 %G eng %R https://doi.org/10.1007/978-3-030-50426-7 %0 Book %B Lecture Notes in Computer Science %D 2020 %T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part III %A Valeria Krzhizhanovskaya %A Gábor Závodszky %A Michael Lees %A Jack Dongarra %A Peter Sloot %A Sérgio Brissos %A João Teixeira %B Lecture Notes in Computer Science %7 1 %I Springer International Publishing %P 648 %8 2020-06 %@ 978-3-030-50420-5 %G eng %R https://doi.org/10.1007/978-3-030-50420-5 %0 Book %B Lecture Notes in Computer Science %D 2020 %T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part II %A Valeria Krzhizhanovskaya %A Gábor Závodszky %A Michael Lees %A Jack Dongarra %A Peter Sloot %A Sérgio Brissos %A João Teixeira %B Lecture Notes in Computer Science %7 1 %I Springer International Publishing %P 697 %8 2020-06 %@ 978-3-030-50417-5 %G eng %R https://doi.org/10.1007/978-3-030-50417-5 %0 Book %B Lecture Notes in Computer Science %D 2020 %T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part I %A Valeria Krzhizhanovskaya %A Gábor Závodszky %A Michael Lees %A Jack Dongarra %A Peter Sloot %A Sérgio Brissos %A João Teixeira %B Lecture Notes in Computer Science %7 1 %I Springer International Publishing %P 707 %8 2020-06 %@ 978-3-030-50371-0 %G eng %R https://doi.org/10.1007/978-3-030-50371-0 %0 Conference Paper %B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) %D 2020 %T DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models %A Bogdan Nicolae %A Jiali Li %A Justin M. Wozniak %A George Bosilca %A Matthieu Dorier %A Franck Cappello %X In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead. %B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) %I IEEE %C Melbourne, VIC, Australia %8 2020-05 %G eng %R https://doi.org/10.1109/CCGrid49817.2020.00-76 %0 Conference Paper %B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020) %D 2020 %T Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs %A Anne Benoit %A Valentin Le Fèvre %A Padma Raghavan %A Yves Robert %A Hongyang Sun %B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020) %I IEEE Computer Society Press %C New Orleans, LA %8 2020-05 %G eng %0 Generic %D 2020 %T Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs %A Cade Brown %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %K AMD GPUs %K GPU computing %K HIP Runtime %K HPC %K numerical linear algebra %K Portability %X Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the HeterogeneousComputing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2020-08 %G eng %0 Conference Paper %B 2020 IEEE High Performance Extreme Computing Virtual Conference %D 2020 %T Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs %A Cade Brown %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %X Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the HeterogeneousComputing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements. %B 2020 IEEE High Performance Extreme Computing Virtual Conference %I IEEE %8 2020-09 %G eng %0 Generic %D 2020 %T DTE: PaRSEC Enabled Libraries and Applications (Poster) %A George Bosilca %A Thomas Herault %A Jack Dongarra %I 2020 Exascale Computing Project Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Generic %D 2020 %T DTE: PaRSEC Systems and Interfaces (Poster) %A George Bosilca %A Thomas Herault %A Jack Dongarra %I 2020 Exascale Computing Project Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Conference Paper %B 13th International Workshop on Parallel Tools for High Performance Computing %D 2020 %T Effortless Monitoring of Arithmetic Intensity with PAPI's Counter Analysis Toolkit %A Daniel Barry %A Anthony Danalis %A Heike Jagode %X With exascale computing forthcoming, performance metrics such as memory traffic and arithmetic intensity are increasingly important for codes that heavily utilize numerical kernels. Performance metrics in different CPU architectures can be monitored by reading the occurrences of various hardware events. However, from architecture to architecture, it becomes more and more unclear which native performance events are indexed by which event names, making it difficult for users to understand what specific events actually measure. This ambiguity seems particularly true for events related to hardware that resides beyond the compute core, such as events related to memory traffic. Still, traffic to memory is a necessary characteristic for determining arithmetic intensity. To alleviate this difficulty, PAPI’s Counter Analysis Toolkit measures the occurrences of events through a series of benchmarks, allowing its users to discover the high-level meaning of native events. We (i) leverage the capabilities of the Counter Analysis Toolkit to identify the names of hardware events for reading and writing bandwidth utilization in addition to floating-point operations, (ii) measure the occurrences of the events they index during the execution of important numerical kernels, and (iii) verify their identities by comparing these occurrence patterns to the expected arithmetic intensity of the numerical kernels. %B 13th International Workshop on Parallel Tools for High Performance Computing %I Springer International Publishing %C Dresden, Germany %8 2020-09 %G eng %0 Conference Paper %B Platform for Advanced Scientific Computing Conference (PASC20) %D 2020 %T Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications %A Qinglei Cao %A Yu Pei %A Kadir Akbudak %A Aleksandr Mikhalev %A George Bosilca %A Hatem Ltaief %A David Keyes %A Jack Dongarra %X Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving of large-scale linear systems that performs a Cholesky factorization on a symmetric positive-definite covariance matrix---a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix off-diagonal tiles by means of low-rank approximations; and, at the programming-paradigm level, we integrate PaRSEC, a dynamic, task-based runtime to reach unparalleled levels of efficiency for solving extreme-scale linear algebra matrix operations. The resulting solution leverages fine-grained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130, 000 cores, which represent a cornerstone toward fast and accurate predictions of environmental applications. %B Platform for Advanced Scientific Computing Conference (PASC20) %I ACM %C Geneva, Switzerland %8 2020-06 %G eng %R https://doi.org/10.1145/3394277.3401846 %0 Journal Article %J Future Generation Computer Systems %D 2020 %T Fault Tolerance of MPI Applications in Exascale Systems: The ULFM Solution %A Nuria Losada %A Patricia González %A María J. Martín %A George Bosilca %A Aurelien Bouteiller %A Keita Teranishi %K Application-level checkpointing %K MPI %K resilience %K ULFM %X The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient fault-tolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms. %B Future Generation Computer Systems %V 106 %P 467-481 %8 2020-05 %G eng %U https://www.sciencedirect.com/science/article/pii/S0167739X1930860X %R https://doi.org/10.1016/j.future.2020.01.026 %0 Conference Paper %B 9th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 20) %D 2020 %T FFT-Based Gradient Sparsification for the Distributed Training of Deep Neural Networks %A Linnan Wang %A Wei Wu %A Junyu Zhang %A Hang Liu %A George Bosilca %A Maurice Herlihy %A Rodrigo Fonseca %K FFT %K Gradient Compression %K Loosy Gradients %K Machine Learning %K Neural Networks %X The performance and efficiency of distributed training of Deep Neural Networks (DNN) highly depend on the performance of gradient averaging among participating processes, a step bound by communication costs. There are two major approaches to reduce communication overhead: overlap communications with computations (lossless), or reduce communications (lossy). The lossless solution works well for linear neural architectures, e.g. VGG, AlexNet, but more recent networks such as ResNet and Inception limit the opportunity for such overlapping. Therefore, approaches that reduce the amount of data (lossy) become more suitable. In this paper, we present a novel, explainable lossy method that sparsifies gradients in the frequency domain, in addition to a new range-based float point representation to quantize and further compress gradients. These dynamic techniques strike a balance between compression ratio, accuracy, and computational overhead, and are optimized to maximize performance in heterogeneous environments. Unlike existing works that strive for a higher compression ratio, we stress the robustness of our methods, and provide guidance to recover accuracy from failures. To achieve this, we prove how the FFT sparsification affects the convergence and accuracy, and show that our method is guaranteed to converge using a diminishing θ in training. Reducing θ can also be used to recover accuracy from the failure. Compared to STOA lossy methods, e.g., QSGD, TernGrad, and Top-k sparsification, our approach incurs less approximation error, thereby better in both the wall-time and accuracy. On an 8 GPUs, InfiniBand interconnected cluster, our techniques effectively accelerate AlexNet training up to 2.26x to the baseline of no compression, and 1.31x to QSGD, 1.25x to Terngrad and 1.47x to Top-K sparsification. %B 9th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 20) %I ACM %C Stockholm, Sweden %8 2020-06 %G eng %R https://doi.org/10.1145/3369583.3392681 %0 Conference Paper %B IEEE International Conference on Cluster Computing (Cluster 2020) %D 2020 %T Flexible Data Redistribution in a Task-Based Runtime System %A Qinglei Cao %A George Bosilca %A Wei Wu %A Dong Zhong %A Aurelien Bouteiller %A Jack Dongarra %X Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal to increase the efficiency and therefore decrease the time-to-solution for the algorithm. The classical redistribution problem focuses on optimally scheduling communications when reshuffling data between two regular, usually block-cyclic, data distributions. Recently, task-based runtime systems have gained popularity as a potential candidate to address the programming complexity on the way to exascale. In addition to an increase in portability against complex hardware and software systems, task-based runtime systems have the potential to be able to more easily cope with less-regular data distribution, providing a more balanced computational load during the lifetime of the execution. In this scenario, it becomes paramount to develop a general redistribution algorithm for task-based runtime systems, which could support all types of regular and irregular data distributions. In this paper, we detail a flexible redistribution algorithm, capable of dealing with redistribution problems without constraints of data distribution and data size and implement it in a task-based runtime system, PaRSEC. Performance results show great capability compared to ScaLAPACK, and applications highlight an increased efficiency with little overhead in terms of data distribution and data size. %B IEEE International Conference on Cluster Computing (Cluster 2020) %I IEEE %C Kobe, Japan %8 2020-09 %G eng %R https://doi.org/10.1109/CLUSTER49012.2020.00032 %0 Conference Paper %B IEEE Cluster Conference %D 2020 %T HAN: A Hierarchical AutotuNed Collective Communication Framework %A Xi Luo %A Wei Wu %A George Bosilca %A Yu Pei %A Qinglei Cao %A Thananon Patinyasakdikul %A Dong Zhong %A Jack Dongarra %X High-performance computing (HPC) systems keep growing in scale and heterogeneity to satisfy the increasing computational need, and this brings new challenges to the design of MPI libraries, especially with regard to collective operations. To address these challenges, we present "HAN," a new hierarchical autotuned collective communication framework in Open MPI, which selects suitable homogeneous collective communication modules as submodules for each hardware level, uses collective operations from the submodules as tasks, and organizes these tasks to perform efficient hierarchical collective operations. With a task-based design, HAN can easily swap out submodules, while keeping tasks intact, to adapt to new hardware. This makes HAN suitable for the current platform and provides a strong and flexible support for future HPC systems. To provide a fast and accurate autotuning mechanism, we present a novel cost model based on benchmarking the tasks instead of a whole collective operation. This method drastically reduces tuning time, as the cost of tasks can be reused across different message sizes, and is more accurate than existing cost models. Our cost analysis suggests the autotuning component can find the optimal configuration in most cases. The evaluation of the HAN framework suggests our design significantly improves the default Open MPI and achieves decent speedups against state-of-the-art MPI implementations on tested applications. %B IEEE Cluster Conference %I Best Paper Award, IEEE Computer Society Press %C Kobe, Japan %8 2020-09 %G eng %0 Book Section %B Fog Computing: Theory and Practice %D 2020 %T Harnessing the Computing Continuum for Programming Our World %A Pete Beckman %A Jack Dongarra %A Nicola Ferrier %A Geoffrey Fox %A Terry Moore %A Dan Reed %A Micah Beck %X This chapter outlines a vision for how best to harness the computing continuum of interconnected sensors, actuators, instruments, and computing systems, from small numbers of very large devices to large numbers of very small devices. The hypothesis is that only via a continuum perspective one can intentionally specify desired continuum actions and effectively manage outcomes and systemic properties—adaptability and homeostasis, temporal constraints and deadlines—and elevate the discourse from device programming to intellectual goals and outcomes. Development of a framework for harnessing the computing continuum would catalyze new consumer services, business processes, social services, and scientific discovery. Realizing and implementing a continuum programming model requires balancing conflicting constraints and translating the high‐level specification into a form suitable for execution on a unifying abstract machine model. In turn, the abstract machine must implement the mapping of specification demands to end‐to‐end resources. %B Fog Computing: Theory and Practice %I John Wiley & Sons, Inc. %@ 9781119551713 %G eng %& 7 %R https://doi.org/10.1002/9781119551713.ch7 %0 Conference Paper %B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) %D 2020 %T High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs %A Natalie Beams %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %A Tzanio Kolev %A Yohann Dudouit %K Batched linear algebra %K finite elements %K gpu %K high-order methods %K matrix-free FEM %K Tensor contractions %X We present new GPU implementations of the tensor contractions arising from basis-related computations for highorder finite element methods. We consider both tensor and nontensor bases. In the case of tensor bases, we introduce new kernels based on a series of fused device-level matrix multiplications (GEMMs), specifically designed to utilize the fast memory of the GPU. For non-tensor bases, we develop a tuned framework for choosing standard batch-BLAS GEMMs that will maximize performance across groups of elements. The implementations are included in a backend of the libCEED library. We present benchmark results for the diffusion and mass operators using libCEED integration through the MFEM finite element library and compare to those of the previously best-performing GPU backends for stand-alone basis computations. In tensor cases, we see improvements of approximately 10-30% for some cases, particularly for higher basis orders. For the non-tensor tests, the new batch-GEMMs implementation is twice as fast as what was previously available for basis function order greater than five and greater than approximately 105 degrees of freedom in the mesh; up to ten times speedup is seen for eighth-order basis functions. %B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) %I IEEE %8 2020-11 %G eng %0 Generic %D 2020 %T hipMAGMA v1.0 %A Cade Brown %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %I Zenodo %8 2020-03 %G eng %U https://doi.org/10.5281/zenodo.3908549 %R 10.5281/zenodo.3908549 %0 Generic %D 2020 %T hipMAGMA v2.0 %A Cade Brown %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %I Zenodo %8 2020-07 %G eng %U https://doi.org/10.5281/zenodo.3928667 %R 10.5281/zenodo.3928667 %0 Book Section %B Advances in Information and Communication: Proceedings of the 2019 Future of Information and Communication Conference (FICC) %D 2020 %T Interoperable Convergence of Storage, Networking, and Computation %A Micah Beck %A Terry Moore %A Piotr Luszczek %A Anthony Danalis %E Kohei Arai %E Rahul Bhatia %K active networks %K distributed cloud %K distributed processing %K distributed storage %K edge computing %K network convergence %K network layering %K scalability %X In every form of digital store-and-forward communication, intermediate forwarding nodes are computers, with attendant memory and processing resources. This has inevitably stimulated efforts to create a wide-area infrastructure that goes beyond simple store-and-forward to create a platform that makes more general and varied use of the potential of this collection of increasingly powerful nodes. Historically, these efforts predate the advent of globally routed packet networking. The desire for a converged infrastructure of this kind has only intensified over the last 30 years, as memory, storage, and processing resources have increased in both density and speed while simultaneously decreasing in cost. Although there is a general consensus that it should be possible to define and deploy such a dramatically more capable wide-area platform, a great deal of investment in research prototypes has yet to produce a credible candidate architecture. Drawing on technical analysis, historical examples, and case studies, we present an argument for the hypothesis that in order to realize a distributed system with the kind of convergent generality and deployment scalability that might qualify as "future-defining," we must build it from a small set of simple, generic, and limited abstractions of the low level resources (processing, storage and network) of its intermediate nodes. %B Advances in Information and Communication: Proceedings of the 2019 Future of Information and Communication Conference (FICC) %I Springer International Publishing %P 667-690 %@ 978-3-030-12385-7 %G eng %0 Journal Article %J Proceedings of the Royal Society A %D 2020 %T Mixed-Precision Iterative Refinement using Tensor Cores on GPUs to Accelerate Solution of Linear Systems %A Azzam Haidar %A Harun Bayraktar %A Stanimire Tomov %A Jack Dongarra %A Nicholas J. Higham %K GMRESLU factorization %K GPU computing %K half precision arithmetic %K iterative refinement %K mixed precision solvers %X Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced-precision and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability. The techniques we employ include multiprecision LU factorization, the preconditioned generalized minimal residual algorithm (GMRES), and scaling and auto-adaptive rounding to avoid overflow. We also show how to efficiently handle systems with multiple right-hand sides. On the NVIDIA Quadro GV100 (Volta) GPU, we achieve a 4×−5× performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability. %B Proceedings of the Royal Society A %V 476 %8 2020-11 %G eng %N 2243 %R https://doi.org/10.1098/rspa.2020.0110 %0 Generic %D 2020 %T Mixed-Precision Solution of Linear Systems Using Accelerator-Based Computing %A Azzam Haidar %A Harun Bayraktar %A Stanimire Tomov %A Jack Dongarra %A Nicholas J. Higham %X Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced- and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability. We achieve a 4×–5× performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2020-05 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2020 %T Overhead of Using Spare Nodes %A Atsushi Hori %A Kazumi Yoshinaga %A Thomas Herault %A Aurelien Bouteiller %A George Bosilca %A Yutaka Ishikawa %K communication performance %K fault mitigation %K Fault tolerance %K sliding method %K spare node %X With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even when spare nodes are present, they are not always substituted for failed nodes in an effective way. This article considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the optimal node-rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this article, several spare node allocation and failed node substitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. The proposed substitution methods are named sliding methods. The sliding methods are analyzed by using our developed simulation program and evaluated by using the K computer, Blue Gene/Q (BG/Q), and TSUBAME 2.5. It will be shown that when failures occur, the stencil communication performance on the K and BG/Q can be slowed around 10 times depending on the number of node failures. The barrier performance on the K can be cut in half. On BG/Q, barrier performance can be slowed by a factor of 10. Further, it will also be shown that almost no such communication performance degradation can be seen on TSUBAME 2.5. This is because TSUBAME 2.5 has an Infiniband network connected with a FatTree topology, while the K computer and BG/Q have dedicated Cartesian networks. Thus, the communication performance degradation depends on network characteristics. %B The International Journal of High Performance Computing Applications %8 2020-02 %G eng %U https://journals.sagepub.com/doi/10.1177/1094342020901885 %! The International Journal of High Performance Computing Applications %R https://doi.org/10.1177%2F1094342020901885 %0 Generic %D 2020 %T Performance Application Programming Interface for Extreme-Scale Environments (PAPI-EX) (Poster) %A Jack Dongarra %A Heike Jagode %A Anthony Danalis %A Daniel Barry %A Vince Weaver %I 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting %C Seattle, WA %8 2020-20 %G eng %0 Conference Paper %B 2020 IEEE International Conference on Cluster Computing (CLUSTER) %D 2020 %T Predicting MPI Collective Communication Performance Using Machine Learning %A Sascha Hunold %A Abhinav Bhatele %A George Bosilca %A Peter Knees %K Auto-tuning %K GAM %K KNN %K Machine Learning %K message passing interface %K Performance Prediction %K XGBoost %X The Message Passing Interface (MPI) defines the semantics of data communication operations, while the implementing libraries provide several parameterized algorithms for each operation. Each algorithm of an MPI collective operation may work best on a particular system and may be dependent on the specific communication problem. Internally, MPI libraries employ heuristics to select the best algorithm for a given communication problem when being called by an MPI application. The majority of MPI libraries allow users to override the default algorithm selection, enabling the tuning of this selection process. The problem then becomes how to select the best possible algorithm for a specific case automatically. In this paper, we address the algorithm selection problem for MPI collective communication operations. To solve this problem, we propose an auto-tuning framework for collective MPI operations based on machine-learning techniques. First, we execute a set of benchmarks of an MPI library and its entire set of collective algorithms. Second, for each algorithm, we fit a performance model by applying regression learners. Last, we use the regression models to predict the best possible (fastest) algorithm for an unseen communication problem. We evaluate our approach for different MPI libraries and several parallel machines. The experimental results show that our approach outperforms the standard algorithm selection heuristics, which are hard-coded into the MPI libraries, by a significant margin. %B 2020 IEEE International Conference on Cluster Computing (CLUSTER) %I IEEE %C Kobe, Japan %8 2020-09 %G eng %R https://doi.org/10.1109/CLUSTER49012.2020.00036 %0 Generic %D 2020 %T A Report of the MPI International Survey (Poster) %A Atsushi Hori %A Takahiro Ogura %A Balazs Gerofi %A Jie Yin %A Yutaka Ishikawa %A Emmanuel Jeannot %A George Bosilca %I EuroMPI/USA '20: 27th European MPI Users' Group Meeting %C Austin, TX %8 2020-09 %G eng %0 Conference Paper %B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020) %D 2020 %T Revisiting Dynamic DAG Scheduling under Memory Constraints for Shared-Memory Platforms %A Gabriel Bathie %A Loris Marchal %A Yves Robert %A Samuel Thibault %B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020) %I IEEE Computer Society Press %C New Orleans, LA %8 2020-05 %G eng %0 Generic %D 2020 %T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic %A Ahmad Abdelfattah %A Hartwig Anzt %A Erik Boman %A Erin Carson %A Terry Cojean %A Jack Dongarra %A Mark Gates %A Thomas Gruetzmacher %A Nicholas J. Higham %A Sherry Li %A Neil Lindquist %A Yang Liu %A Jennifer Loe %A Piotr Luszczek %A Pratik Nayak %A Sri Pranesh %A Siva Rajamanickam %A Tobias Ribizel %A Barry Smith %A Kasia Swirydowicz %A Stephen Thomas %A Stanimire Tomov %A Yaohung Tsai %A Ichitaro Yamazaki %A Urike Meier Yang %B SLATE Working Notes %I University of Tennessee %8 2020-07 %G eng %9 SLATE Working Notes %0 Conference Paper %B International Conference for High Performance Computing Networking, Storage, and Analysis (SC20) %D 2020 %T Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance %A Elliott Slaughter %A Wei Wu %A Yuankun Fu %A Legend Brandenburg %A Nicolai Garcia %A Wilhem Kautz %A Emily Marx %A Kaleb S. Morris %A Qinglei Cao %A George Bosilca %A Seema Mirchandaney %A Wonchan Lee %A Sean Treichler %A Patrick McCormick %A Alex Aiken %X We present Task Bench, a parameterized benchmark designed to explore the performance of distributed programming systems under a variety of application scenarios. Task Bench dramatically lowers the barrier to benchmarking and comparing multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench's parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications. To assess the effectiveness and overheads of the tested systems, we introduce a novel metric, minimum effective task granularity (METG). We conduct a comprehensive study with 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. Running at scale, 100μs-long tasks are the finest granularity that any system runs efficiently with current technologies. We also study each system's scalability, ability to hide communication and mitigate load imbalance. %B International Conference for High Performance Computing Networking, Storage, and Analysis (SC20) %I ACM %8 2020-11 %G eng %U https://dl.acm.org/doi/10.5555/3433701.3433783 %0 Conference Paper %B 2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2) %D 2020 %T The Template Task Graph (TTG) - An Emerging Practical Dataflow Programming Paradigm for Scientific Simulation at Extreme Scale %A George Bosilca %A Robert Harrison %A Thomas Herault %A Mohammad Mahdi Javanmard %A Poornima Nookala %A Edward Valeev %K dag %K dataflow %K exascale %K graph %K High-performance computing %K workflow %X We describe TESSE, an emerging general-purpose, open-source software ecosystem that attacks the twin challenges of programmer productivity and portable performance for advanced scientific applications on modern high-performance computers. TESSE builds upon and extends the ParsecDAG/-dataflow runtime with a new Domain Specific Languages (DSL) and new integration capabilities. Motivating this work is our belief that such a dataflow model, perhaps with applications composed in domain specific languages, can overcome many of the challenges faced by a wide variety of irregular applications that are poorly served by current programming and execution models. Two such applications from many-body physics and applied mathematics are briefly explored. This paper focuses upon the Template Task Graph (TTG), which is TESSE's main C++ Api that provides a powerful work/data-flow programming model. Algorithms on spatial trees, block-sparse tensors, and wave fronts are used to illustrate the API and associated concepts, as well as to compare with related approaches. %B 2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2) %I IEEE %8 2020-11 %G eng %R https://doi.org/10.1109/ESPM251964.2020.00011 %0 Conference Paper %B International Conference on Computational Science (ICCS 2020) %D 2020 %T Twenty Years of Computational Science %A Valeria Krzhizhanovskaya %A Gábor Závodszky %A Michael Lees %A Jack Dongarra %A Peter Sloot %A Sérgio Brissos %A João Teixeira %B International Conference on Computational Science (ICCS 2020) %C Amsterdam, Netherlands %8 2020-06 %G eng %0 Conference Paper %B EuroMPI/USA '20: 27th European MPI Users' Group Meeting %D 2020 %T Using Advanced Vector Extensions AVX-512 for MPI Reduction %A Dong Zhong %A Qinglei Cao %A George Bosilca %A Jack Dongarra %K Instruction level parallelism %K Intel AVX2/AVX-512 %K Long vector extension %K MPI reduction operation %K Single instruction multiple data %K Vector operation %X As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the timeto- solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction. %B EuroMPI/USA '20: 27th European MPI Users' Group Meeting %C Austin, TX %8 2020-09 %G eng %R https://doi.org/10.1145/3416315.3416316 %0 Generic %D 2020 %T Using Advanced Vector Extensions AVX-512 for MPI Reduction (Poster) %A Dong Zhong %A George Bosilca %A Qinglei Cao %A Jack Dongarra %I EuroMPI/USA '20: 27th European MPI Users' Group Meeting %C Austin, TX %8 2020-09 %G eng %0 Conference Paper %B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2020) %D 2020 %T Using Arm Scalable Vector Extension to Optimize Open MPI %A Dong Zhong %A Pavel Shamis %A Qinglei Cao %A George Bosilca %A Jack Dongarra %K ARMIE %K datatype pack and unpack %K local reduction %K non-contiguous accesses %K SVE %K Vector Length Agnostic %X As the scale of high-performance computing (HPC) systems continues to grow, increasing levels of parallelism must be implored to achieve optimal performance. Recently, the processors support wide vector extensions, vectorization becomes much more important to exploit the potential peak performance of target architecture. Novel processor architectures, such as the Armv8-A architecture, introduce Scalable Vector Extension (SVE) - an optional separate architectural extension with a new set of A64 instruction encodings, which enables even greater parallelisms. In this paper, we analyze the usage and performance of the SVE instructions in Arm SVE vector Instruction Set Architecture (ISA); and utilize those instructions to improve the memcpy and various local reduction operations. Furthermore, we propose new strategies to improve the performance of MPI operations including datatype packing/unpacking and MPI reduction. With these optimizations, we not only provide a higher-parallelism for a single node, but also achieve a more efficient communication scheme of message exchanging. The resulting efforts have been implemented in the context of OPEN MPI, providing efficient and scalable capabilities of SVE usage and extending the possible implementations of SVE to a more extensive range of programming and execution paradigms. The evaluation of the resulting software stack under different scenarios with both simulator and Fujitsu’s A64FX processor demonstrates that the solution is at the same time generic and efficient. %B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2020) %I IEEE/ACM %C Melbourne, Australia %8 2020-05 %G eng %R https://doi.org/10.1109/CCGrid49817.2020.00-71 %0 Generic %D 2020 %T xSDK4ECP: Extreme-scale Scientific Software Development Kit for ECP (Poster) %A Roscoe Bartlett %I 2020 Exascale Computing Project Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Journal Article %J Parallel Computing %D 2019 %T Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Marc Baboulin %A Joël Falcou %A Jack Dongarra %K Autotuning %K Batched GEMM %K HPC %K Matrix-matrix product %K optimization %K Small matrices %X Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. %B Parallel Computing %V 81 %P 1–21 %8 2019-01 %G eng %R https://doi.org/10.1016/j.parco.2018.10.003 %0 Conference Paper %B Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19) %D 2019 %T Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications %A Nuria Losada %A Aurelien Bouteiller %A George Bosilca %K checkpoint/restart %K Fault tolerance %K Message logging %K MPI %K ULFM %K User Level Fault Mitigation %X With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions. %B Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19) %8 2019-11 %G eng %U https://sc19.supercomputing.org/proceedings/workshops/workshop_files/ws_ftxs103s2-file1.pdf %0 Generic %D 2019 %T BDEC2 Platform White Paper %A Todd Gamblin %A Pete Beckman %A Kate Keahey %A Kento Sato %A Masaaki Kondo %A Gerofi Balazs %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2019-09 %G eng %0 Generic %D 2019 %T CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps %A Stanimire Tomov %A Ahmad Abdelfattah %A Valeria Barra %A Natalie Beams %A Jed Brown %A Jean-Sylvain Camier %A Veselin Dobrev %A Jack Dongarra %A Yohann Dudouit %A Paul Fischer %A Ali Karakus %A Stefan Kerkemeier %A Tzanio Kolev %A YuHsiang Lan %A Elia Merzari %A Misun Min %A Aleks Obabko %A Scott Parker %A Thilina Ratnayaka %A Jeremy Thompson %A Ananias Tomboulides %A Vladimir Tomov %A Tim Warburton %I Zenodo %8 2019-10 %G eng %R https://doi.org/10.5281/zenodo.3477618 %0 Generic %D 2019 %T CEED ECP Milestone Report: Public release of CEED 2.0 %A Jed Brown %A Ahmad Abdelfattah %A Valeria Barra %A Veselin Dobrev %A Yohann Dudouit %A Paul Fischer %A Tzanio Kolev %A David Medina %A Misun Min %A Thilina Ratnayaka %A Cameron Smith %A Jeremy Thompson %A Stanimire Tomov %A Vladimir Tomov %A Tim Warburton %I Zenodo %8 2019-04 %G eng %U https://doi.org/10.5281/zenodo.2641316 %R 10.5281/zenodo.2641316 %0 Conference Paper %B 2019 International Conference on Parallel Computing (ParCo2019) %D 2019 %T Characterization of Power Usage and Performance in Data-Intensive Applications using MapReduce over MPI %A Joshua Davis %A Tao Gao %A Sunita Chandrasekaran %A Heike Jagode %A Anthony Danalis %A Pavan Balaji %A Jack Dongarra %A Michela Taufer %B 2019 International Conference on Parallel Computing (ParCo2019) %C Prague, Czech Republic %8 2019-09 %G eng %0 Journal Article %J International Journal of Networking and Computing %D 2019 %T Checkpointing Strategies for Shared High-Performance Computing Platforms %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Dorian Arnold %A Kurt Ferreira %A George Bosilca %A Jack Dongarra %X Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Overall, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem. %B International Journal of Networking and Computing %V 9 %P 28–52 %G eng %U http://www.ijnc.org/index.php/ijnc/article/view/195 %0 Generic %D 2019 %T A Collection of Presentations from the BDEC2 Workshop in Kobe, Japan %A Rosa M. Badia %A Micah Beck %A François Bodin %A Taisuke Boku %A Franck Cappello %A Alok Choudhary %A Carlos Costa %A Ewa Deelman %A Nicola Ferrier %A Katsuki Fujisawa %A Kohei Fujita %A Maria Girone %A Geoffrey Fox %A Shantenu Jha %A Yoshinari Kameda %A Christian Kniep %A William Kramer %A James Lin %A Kengo Nakajima %A Yiwei Qiu %A Kishore Ramachandran %A Glenn Ricart %A Kim Serradell %A Dan Stanzione %A Lin Gan %A Martin Swany %A Christine Sweeney %A Alex Szalay %A Christine Kirkpatrick %A Kenton McHenry %A Alainna White %A Steve Tuecke %A Ian Foster %A Joe Mambretti %A William. M Tang %A Michela Taufer %A Miguel Vázquez %B Innovative Computing Laboratory Technical Report %I University of Tennessee, Knoxville %8 2019-02 %G eng %0 Generic %D 2019 %T A Collection of White Papers from the BDEC2 Workshop in Poznan, Poland %A Gabriel Antoniu %A Alexandru Costan %A Ovidiu Marcu %A Maria S. Pérez %A Nenad Stojanovic %A Rosa M. Badia %A Miguel Vázquez %A Sergi Girona %A Micah Beck %A Terry Moore %A Piotr Luszczek %A Ezra Kissel %A Martin Swany %A Geoffrey Fox %A Vibhatha Abeykoon %A Selahattin Akkas %A Kannan Govindarajan %A Gurhan Gunduz %A Supun Kamburugamuve %A Niranda Perera %A Ahmet Uyar %A Pulasthi Wickramasinghe %A Chathura Widanage %A Maria Girone %A Toshihiro Hanawa %A Richard Moreno %A Ariel Oleksiak %A Martin Swany %A Ryousei Takano %A M.P. van Haarlem %A J. van Leeuwen %A J.B.R. Oonk %A T. Shimwell %A L.V.E. Koopmans %B Innovative Computing Laboratory Technical Report %I University of Tennessee, Knoxville %8 2019-05 %G eng %0 Generic %D 2019 %T A Collection of White Papers from the BDEC2 Workshop in San Diego, CA %A Ilkay Altintas %A Kyle Marcus %A Volkan Vural %A Shweta Purawat %A Daniel Crawl %A Gabriel Antoniu %A Alexandru Costan %A Ovidiu Marcu %A Prasanna Balaprakash %A Rongqiang Cao %A Yangang Wang %A Franck Cappello %A Robert Underwood %A Sheng Di %A Justin M. Wozniak %A Jon C. Calhoun %A Cong Xu %A Antonio Lain %A Paolo Faraboschi %A Nic Dube %A Dejan Milojicic %A Balazs Gerofi %A Maria Girone %A Viktor Khristenko %A Tony Hey %A Erza Kissel %A Yu Liu %A Richard Loft %A Pekka Manninen %A Sebastian von Alfthan %A Takemasa Miyoshi %A Bruno Raffin %A Olivier Richard %A Denis Trystram %A Maryam Rahnemoonfar %A Robin Murphy %A Joel Saltz %A Kentaro Sano %A Rupak Roy %A Kento Sato %A Jian Guo %A Jen s Domke %A Weikuan Yu %A Takaki Hatsui %A Yasumasa Joti %A Alex Szalay %A William M. Tang %A Michael R. Wyatt II %A Michela Taufer %A Todd Gamblin %A Stephen Herbein %A Adam Moody %A Dong H. Ahn %A Rich Wolski %A Chandra Krintz %A Fatih Bakir %A Wei-tsung Lin %A Gareth George %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2019-10 %G eng %0 Journal Article %J International Journal of Networking and Computing %D 2019 %T Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors %A Anne Benoit %A Aurelien Cavelan %A Florina M. Ciorba %A Valentin Le Fèvre %A Yves Robert %K checkpoint %K fail-stop error; silent error %K HPC %K linear workflow %K Replication %X Large-scale platforms currently experience errors from two di?erent sources, namely fail-stop errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt data). This work combines checkpointing and replication for the reliable execution of linear work?ows on platforms subject to these two error types. While checkpointing and replication have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear work?ows in error-prone environments. Moreover, combined checkpointing and replication has not yet been studied in the presence of both fail-stop and silent errors. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it to ensure its reliable execution. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques, lead to improved performance. %B International Journal of Networking and Computing %V 9 %P 2-27 %8 2019 %G eng %U http://www.ijnc.org/index.php/ijnc/article/view/194 %0 Journal Article %J Parallel Computing %D 2019 %T Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms %A Valentin Le Fèvre %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Atsushi Hori %A George Bosilca %A Jack Dongarra %B Parallel Computing %V 85 %P 1–12 %8 2019-07 %G eng %R https://doi.org/10.1016/j.parco.2019.02.002 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2019 %T Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms %A Guillaume Aupy %A Anne Benoit %A Brice Goglin %A Loïc Pottier %A Yves Robert %K cache partitioning %K chip multiprocessor %K co-scheduling %K HPC application %X With the recent advent of many-core architectures such as chip multiprocessors (CMPs), the number of processing units accessing a global shared memory is constantly increasing. Co-scheduling techniques are used to improve application throughput on such architectures, but sharing resources often generates critical interferences. In this article, we focus on the interferences in the last level of cache (LLC) and use the Cache Allocation Technology (CAT) recently provided by Intel to partition the LLC and give each co-scheduled application their own cache area. We consider m iterative HPC applications running concurrently and answer to the following questions: (i) How to precisely model the behavior of these applications on the cache-partitioned platform? and (ii) how many cores and cache fractions should be assigned to each application to maximize the platform efficiency? Here, platform efficiency is defined as maximizing the performance either globally, or as guaranteeing a fixed ratio of iterations per second for each application. Through extensive experiments using CAT, we demonstrate the impact of cache partitioning when multiple HPC applications are co-scheduled onto CMP platforms. %B International Journal of High Performance Computing Applications %V 33 %P 1221-1239 %8 2019-11 %G eng %N 6 %R https://doi.org/10.1177/1094342019846956 %0 Conference Paper %B 5th EAI International Conference on Smart Objects and Technologies for Social Good %D 2019 %T Data Logistics: Toolkit and Applications %A Micah Beck %A Terry Moore %A Nancy French %A Erza Kissel %A Martin Swany %B 5th EAI International Conference on Smart Objects and Technologies for Social Good %C Valencia, Spain %8 2019-09 %G eng %0 Conference Paper %B PAW-ATM Workshop at SC19 %D 2019 %T Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization %A Yu Pei %A George Bosilca %A Ichitaro Yamazaki %A Akihiro Ida %A Jack Dongarra %B PAW-ATM Workshop at SC19 %I ACM %C Denver, CO %8 2019-11 %G eng %0 Conference Paper %B ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2019 %T Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC %A Thomas Herault %A Yves Robert %A George Bosilca %A Jack Dongarra %B ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %I IEEE %C Denver, CO %8 2019-11 %G eng %0 Conference Paper %B IEEE Cluster %D 2019 %T Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs %A Thananon Patinyasakdikul %A David Eberius %A George Bosilca %A Nathan Hjelm %K communication contention %K MPI %K thread %X The Message Passing Interface (MPI) has been one of the most prominent programming paradigms in highperformance computing (HPC) for the past decade. Lately, with changes in modern hardware leading to a drastic increase in the number of processor cores, developers of parallel applications are moving toward more integrated parallel programming paradigms, where MPI is used along with other, possibly node-level, programming paradigms, or MPI+X. MPI+threads emerged as one of the favorite choices in HPC community, according to a survey of the HPC community. However, threading support in MPI comes with many compromises to the overall performance delivered, and, therefore, its adoption is compromised. This paper studies in depth the MPI multi-threaded implementation design in one of the leading MPI implementations, Open MPI, and expose some of the shortcomings of the current design. We propose, implement, and evaluate a new design of the internal handling of communication progress which allows for a significant boost in multi-threading performance, increasing the viability of MPI in the MPI+X programming paradigm. %B IEEE Cluster %I IEEE %C Albuquerque, NM %8 2019-09 %G eng %0 Conference Paper %B Workshop on Exascale MPI (ExaMPI) at SC19 %D 2019 %T Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation %A Alan Ayala %A Stanimire Tomov %A Xi Luo %A Hejer Shaiek %A Azzam Haidar %A George Bosilca %A Jack Dongarra %K Collective MPI %K Exascale applications %K FFT %K Heterogeneous systems %K scalable %B Workshop on Exascale MPI (ExaMPI) at SC19 %C Denver, CO %8 2019-11 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2019 %T Local Rollback for Resilient MPI Applications with Application-Level Checkpointing and Message Logging %A Nuria Losada %A George Bosilca %A Aurelien Bouteiller %A Patricia González %A María J. Martín %K Application-level checkpointing %K Local rollback %K Message logging %K MPI %K resilience %X The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface – the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard – enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the ComPiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level—thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications. %B Future Generation Computer Systems %V 91 %P 450-464 %8 2019-02 %G eng %R https://doi.org/10.1016/j.future.2018.09.041 %0 Conference Paper %B ISC High Performance %D 2019 %T MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing %A Daniel Nichols %A Natalie-Sofia Tomov %A Frank Betancourt %A Stanimire Tomov %A Kwai Wong %A Jack Dongarra %X In this paper, we present work towards the development of a new data analytics and machine learning (ML) framework, called MagmaDNN. Our main goal is to provide scalable, high-performance data analytics and ML solutions for scientific applications running on current and upcoming heterogeneous many-core GPU-accelerated architectures. To this end, since many of the functionalities needed are based on standard linear algebra (LA) routines, we designed MagmaDNN to derive its performance power from the MAGMA library. The close integration provides the fundamental (scalable high-performance) LA routines available in MAGMA as a backend to MagmaDNN. We present some design issues for performance and scalability that are specific to ML using Deep Neural Networks (DNN), as well as the MagmaDNN designs towards overcoming them. In particular, MagmaDNN uses well established HPC techniques from the area of dense LA, including task-based parallelization, DAG representations, scheduling, mixed-precision algorithms, asynchronous solvers, and autotuned hyperparameter optimization. We illustrate these techniques and their incorporation and use to outperform other frameworks, currently available. %B ISC High Performance %I Springer International Publishing %C Frankfurt, Germany %8 2019-06 %G eng %R https://doi.org/10.1007/978-3-030-34356-9_37 %0 Conference Paper %B International Parallel and Distributed Processing Symposium (IPDPS) %D 2019 %T Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation %A Zhaojun Bai %A Jack Dongarra %A Ding Lu %A Ichitaro Yamazaki %X Some scientific and engineering applications need to compute a large number of eigenpairs of a large Hermitian matrix. Though the Lanczos method is effective for computing a few eigenvalues, it can be expensive for computing a large number of eigenpairs (e.g., in terms of computation and communication). To improve the performance of the method, in this paper, we study an s-step variant of thick-restart Lanczos (TRLan) combined with an explicit external deflation (EED). The s-step method generates a set of s basis vectors at a time and reduces the communication costs of generating the basis vectors. We then design a specialized matrix powers kernel (MPK) that reduces both the communication and computational costs by taking advantage of the special properties of the deflation matrix. We conducted numerical experiments of the new TRLan eigensolver using synthetic matrices and matrices from electronic structure calculations. The performance results on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) demonstrate the potential of the specialized MPK to significantly reduce the execution time of the TRLan eigensolver. The speedups of up to 3.1× and 5.3× were obtained in our sequential and parallel runs, respectively. %B International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Rio de Janeiro, Brazil %8 2019-05 %G eng %0 Generic %D 2019 %T New Robust ScaLAPACK Routine for Computing the QR Factorization with Column Pivoting %A Zvonimir Bujanovic %A Zlatko Drmac %X In this note we describe two modifications of the ScaLAPACK subroutines PxGEQPF for computing the QR factorization with the Businger-Golub column pivoting. First, we resolve a subtle numerical instability in the same way as we have done it for the LAPACK subroutines xGEQPF, xGEQP3 in 2006. [LAPACK Working Note 176 (2006); ACM Trans. Math. Softw. 2008]. The problem originates in the first release of LINPACK in the 1970's : due to severe cancellations in the down-dating of partial column norms, the pivoting procedure may be in the dark completely about the true norms of the pivot column candidates. This may cause miss-pivoting, and as a result loss of the important rank revealing structure of the computed triangular factor, with severe consequences on other solvers that rely on the rank revealing pivoting. The instability is so subtle that e.g. inserting a WRITE statement or changing the process topology can drastically change the result. Secondly, we also correct a programming error in the complex subroutines PCGEQPF, PZGEQPF, which also causes wrong pivoting because of erroneous use of PSCNRM2, PDZNRM2 for the explicit norm computation. %B LAPACK Working Note %I University of Tennessee %8 2019-10 %G eng %0 Conference Paper %B Practice and Experience in Advanced Research Computing (PEARC ’19) %D 2019 %T OpenDIEL: A Parallel Workflow Engine and DataAnalytics Framework %A Frank Betancourt %A Kwai Wong %A Efosa Asemota %A Quindell Marshall %A Daniel Nichols %A Stanimire Tomov %B Practice and Experience in Advanced Research Computing (PEARC ’19) %I ACM %C Chicago, IL %8 2019-07 %G eng %0 Conference Paper %B Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19 %D 2019 %T Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools %A Qinglei Cao %A Yu Pei %A Thomas Herault %A Kadir Akbudak %A Aleksandr Mikhalev %A George Bosilca %A Hatem Ltaief %A David Keyes %A Jack Dongarra %B Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19 %I ACM %C Denver, CO %8 2019-11 %G eng %0 Journal Article %J Parallel Computing %D 2019 %T Performance of Asynchronous Optimized Schwarz with One-sided Communication %A Ichitaro Yamazaki %A Edmond Chow %A Aurelien Bouteiller %A Jack Dongarra %X In asynchronous iterative methods on distributed-memory computers, processes update their local solutions using data from other processes without an implicit or explicit global synchronization that corresponds to advancing the global iteration counter. In this work, we test the asynchronous optimized Schwarz domain-decomposition iterative method using various one-sided (remote direct memory access) communication schemes with passive target completion. The results show that when one-sided communication is well-supported, the asynchronous version of optimized Schwarz can outperform the synchronous version even for perfectly balanced partitionings of the problem on a supercomputer with uniform nodes. %B Parallel Computing %V 86 %P 66-81 %8 2019-08 %G eng %U http://www.sciencedirect.com/science/article/pii/S0167819118301261 %R https://doi.org/10.1016/j.parco.2019.05.004 %0 Journal Article %J ACM Transactions on Mathematical Software %D 2019 %T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %A Maksims Abalenkovs %A Negin Bagherpour %A Sven Hammarling %A Jakub Sistek %B ACM Transactions on Mathematical Software %V 45 %8 2019-06 %G eng %N 2 %R https://doi.org/10.1145/3264491 %0 Conference Paper %B The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19) %D 2019 %T Replication is More Efficient Than You Think %A Anne Benoit %A Thomas Herault %A Valentin Le Fèvre %A Yves Robert %B The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19) %I ACM Press %C Denver, CO %8 2019-11 %G eng %0 Conference Paper %B European MPI Users' Group Meeting (EuroMPI '19) %D 2019 %T Runtime Level Failure Detection and Propagation in HPC Systems %A Dong Zhong %A Aurelien Bouteiller %A Xi Luo %A George Bosilca %X As the scale of high-performance computing (HPC) systems continues to grow, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. In order to efficiently run long computing jobs on these systems, handling system failures becomes a prime challenge. We present here the design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Multiple overlapping topologies are used to optimize the detection and propagation, minimizing the incurred overheads and guaranteeing the scalability of the entire framework. The resulting framework has been implemented in the context of a system-level runtime for parallel environments, PMIx Reference RunTime Environment (PRRTE), providing efficient and scalable capabilities of fault management to a large range of programming and execution paradigms. The experimental evaluation of the resulting software stack on different machines demonstrate that the solution is at the same time generic and efficient. %B European MPI Users' Group Meeting (EuroMPI '19) %I ACM %C Zürich, Switzerland %8 2019-09 %@ 978-1-4503-7175-9 %G eng %R https://doi.org/10.1145/3343211.3343225 %0 Book Section %B Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project %D 2019 %T System Software for Many-Core and Multi-Core Architectures %A Atsushi Hori %A Tsujita, Yuichi %A Shimada, Akio %A Yoshinaga, Kazumi %A Mitaro, Namiki %A Fukazawa, Go %A Sato, Mikiko %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %E Sato, Mitsuhisa %X In this project, the software technologies for the post-peta scale computing were explored. More specifically, OS technologies for heterogeneous architectures, lightweight thread, scalable I/O, and fault mitigation were investigated. As for the OS technologies, a new parallel execution model, Partitioned Virtual Address Space (PVAS), for the many-core CPU was proposed. For the heterogeneous architectures, where multi-core CPU and many-core CPU are connected with an I/O bus, an extension of PVAS, Multiple-PVAS, to have a unified virtual address space of multi-core and many-core CPUs was proposed. The proposed PVAS was also enhanced to have multiple processes where process context switch can take place at the user level (named User-Level Process: ULP). As for the scalable I/O, EARTH, optimization techniques for MPI collective I/O, was proposed. Lastly, for the fault mitigation, User Level Fault Mitigation, ULFM was improved to have faster agreement process, and sliding methods to substitute failed nodes with spare nodes was proposed. The funding of this project was ended in 2016; however, many proposed technologies are still being propelled. %B Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project %I Springer Singapore %C Singapore %P 59–75 %@ 978-981-13-1924-2 %G eng %U https://doi.org/10.1007/978-981-13-1924-2_4 %R 10.1007/978-981-13-1924-2_4 %0 Conference Paper %B 2019 European Conference on Parallel Processing (Euro-Par 2019) %D 2019 %T Towards Portable Online Prediction of Network Utilization Using MPI-Level Monitoring %A Shu-Mei Tseng %A Bogdan Nicolae %A George Bosilca %A Emmanuel Jeannot %A Aparna Chandramowlishwaran %A Franck Cappello %X Stealing network bandwidth helps a variety of HPC runtimes and services to run additional operations in the background without negatively affecting the applications. A key ingredient to make this possible is an accurate prediction of the future network utilization, enabling the runtime to plan the background operations in advance, such as to avoid competing with the application for network bandwidth. In this paper, we propose a portable deep learning predictor that only uses the information available through MPI introspection to construct a recurrent sequence-to-sequence neural network capable of forecasting network utilization. We leverage the fact that most HPC applications exhibit periodic behaviors to enable predictions far into the future (at least the length of a period). Our online approach does not have an initial training phase, it continuously improves itself during application execution without incurring significant computational overhead. Experimental results show better accuracy and lower computational overhead compared with the state-of-the-art on two representative applications. %B 2019 European Conference on Parallel Processing (Euro-Par 2019) %I Springer %C Göttingen, Germany %8 2019-08 %G eng %R https://doi.org/10.1007/978-3-030-29400-7_4 %0 Generic %D 2019 %T Understanding Native Event Semantics %A Anthony Danalis %A Heike Jagode %A Daniel Barry %A Jack Dongarra %I 9th JLESC Workshop %C Knoxville, TN %8 2019-04 %G eng %0 Conference Paper %B 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) %D 2019 %T Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training %A Jiali Li %A Bogdan Nicolae %A Justin M. Wozniak %A George Bosilca %X In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular. Despite the fact that high-performance computing (HPC) technologies have been designed to address such patterns efficiently, the behavior of data-parallel approaches on HPC platforms is not well understood. To address this issue, in this paper we study the behavior of Horovod, a popular data-parallel approach that relies on MPI, on Theta, a pre-Exascale machine at Argonne National Laboratory. Using two representative applications, we explore two aspects: (1) how performance and scalability is affected by important parameters such as number of nodes, number of workers, threads per node, batch size; (2) how computational phases are interleaved withall-reduce communication phases at fine granularity and what consequences this interleaving has in terms of potential bottlenecks. Our findings show that pipelining of back-propagation, gradient reduction and weight updates mitigate the effects of stragglers during all-reduce only partially. Furthermore, there can be significant delays between weights update, which can be leveraged to mask the overhead of additional background operations that are coupled with the training. %B 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) %I IEEE %C Denver, CO %8 2019-11 %G eng %R https://doi.org/10.1109/MLHPC49564.2019.00006 %0 Conference Paper %B The 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18) %D 2018 %T ADAPT: An Event-Based Adaptive Collective Communication Framework %A Xi Luo %A Wei Wu %A George Bosilca %A Thananon Patinyasakdikul %A Linnan Wang %A Jack Dongarra %X The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while mamtaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3X and 1.5X speedup for CPU data and 2X and 10X speedup for GPU data using ADAPT event-based broadcast and reduce operations. %B The 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18) %I ACM Press %C Tempe, Arizona %8 2018-06 %@ 9781450357852 %G eng %R 10.1145/3208040.3208054 %0 Generic %D 2018 %T Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Marc Baboulin %A Joël Falcou %A Jack Dongarra %X Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1; 600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. %B Innovative Computing Laboratory Technical Report %I Innovative Computing Laboratory, University of Tennessee %8 2018-09 %G eng %0 Journal Article %J Proceedings of the IEEE %D 2018 %T Autotuning in High-Performance Computing Applications %A Prasanna Balaprakash %A Jack Dongarra %A Todd Gamblin %A Mary Hall %A Jeffrey Hollingsworth %A Boyana Norris %A Richard Vuduc %K High-performance computing %K performance tuning programming systems %X Autotuning refers to the automatic generation of a search space of possible implementations of a computation that are evaluated through models and/or empirical measurement to identify the most desirable implementation. Autotuning has the potential to dramatically improve the performance portability of petascale and exascale applications. To date, autotuning has been used primarily in high-performance applications through tunable libraries or previously tuned application code that is integrated directly into the application. This paper draws on the authors' extensive experience applying autotuning to high-performance applications, describing both successes and future challenges. If autotuning is to be widely used in the HPC community, researchers must address the software engineering challenges, manage configuration overheads, and continue to demonstrate significant performance gains and portability across architectures. In particular, tools that configure the application must be integrated into the application build process so that tuning can be reapplied as the application and target architectures evolve. %B Proceedings of the IEEE %V 106 %P 2068–2083 %8 2018-11 %G eng %N 11 %R 10.1109/JPROC.2018.2841200 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2018 %T Big Data and Extreme-Scale Computing: Pathways to Convergence - Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry %A Mark Asch %A Terry Moore %A Rosa M. Badia %A Micah Beck %A Pete Beckman %A Thierry Bidot %A François Bodin %A Franck Cappello %A Alok Choudhary %A Bronis R. de Supinski %A Ewa Deelman %A Jack Dongarra %A Anshu Dubey %A Geoffrey Fox %A Haohuan Fu %A Sergi Girona %A Michael Heroux %A Yutaka Ishikawa %A Kate Keahey %A David Keyes %A William T. Kramer %A Jean-François Lavignon %A Yutong Lu %A Satoshi Matsuoka %A Bernd Mohr %A Stéphane Requena %A Joel Saltz %A Thomas Schulthess %A Rick Stevens %A Martin Swany %A Alexander Szalay %A William Tang %A Gaël Varoquaux %A Jean-Pierre Vilotte %A Robert W. Wisniewski %A Zhiwei Xu %A Igor Zacharov %X Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review. %B The International Journal of High Performance Computing Applications %V 32 %P 435–479 %8 2018-07 %G eng %N 4 %R https://doi.org/10.1177/1094342018778123 %0 Generic %D 2018 %T A Collection of White Papers from the BDEC2 Workshop in Bloomington, IN %A James Ahrens %A Christopher M. Biwer %A Alexandru Costan %A Gabriel Antoniu %A Maria S. Pérez %A Nenad Stojanovic %A Rosa Badia %A Oliver Beckstein %A Geoffrey Fox %A Shantenu Jha %A Micah Beck %A Terry Moore %A Sunita Chandrasekaran %A Carlos Costa %A Thierry Deutsch %A Luigi Genovese %A Tarek El-Ghazawi %A Ian Foster %A Dennis Gannon %A Toshihiro Hanawa %A Tevfik Kosar %A William Kramer %A Madhav V. Marathe %A Christopher L. Barrett %A Takemasa Miyoshi %A Alex Pothen %A Ariful Azad %A Judy Qiu %A Bo Peng %A Ravi Teja %A Sahil Tyagi %A Chathura Widanage %A Jon Koskey %A Maryam Rahnemoonfar %A Umakishore Ramachandran %A Miles Deegan %A William Tang %A Osamu Tatebe %A Michela Taufer %A Michel Cuende %A Ewa Deelman %A Trilce Estrada %A Rafael Ferreira Da Silva %A Harrel Weinstein %A Rodrigo Vargas %A Miwako Tsuji %A Kevin G. Yager %A Wanling Gao %A Jianfeng Zhan %A Lei Wang %A Chunjie Luo %A Daoyi Zheng %A Xu Wen %A Rui Ren %A Chen Zheng %A Xiwen He %A Hainan Ye %A Haoning Tang %A Zheng Cao %A Shujie Zhang %A Jiahui Dai %B Innovative Computing Laboratory Technical Report %I University of Tennessee, Knoxville %8 2018-11 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2018 %T Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing %A Anne Benoit %A Aurelien Cavelan %A Franck Cappello %A Padma Raghavan %A Yves Robert %A Hongyang Sun %K checkpointing %K fail-stop errors %K Fault tolerance %K High-performance computing %K Replication %K silent errors %X This paper provides a model and an analytical study of replication as a technique to cope with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale platforms. Compared with fail-stop errors that are immediately detected when they occur, silent errors require a detection mechanism. To detect silent errors, many application-specific techniques are available, either based on algorithms (e.g., ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject to only silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to even correct some errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. Otherwise, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model. %B Journal of Parallel and Distributed Computing %V 122 %P 209–225 %8 2018-12 %G eng %R https://doi.org/10.1016/j.jpdc.2018.08.002 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2018 %T Co-Scheduling Amdhal Applications on Cache-Partitioned Systems %A Guillaume Aupy %A Anne Benoit %A Sicheng Dai %A Loïc Pottier %A Padma Raghavan %A Yves Robert %A Manu Shantharam %K cache partitioning %K co-scheduling %K complexity results %X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are as follows: (i) which proportion of cache and (ii) how many processors should be given to each application? In this article, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. %B International Journal of High Performance Computing Applications %V 32 %P 123–138 %8 2018-01 %G eng %N 1 %R https://doi.org/10.1177/1094342017710806 %0 Conference Paper %B Cluster 2018 %D 2018 %T Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms %A Guillaume Aupy %A Anne Benoit %A Brice Goglin %A Loïc Pottier %A Yves Robert %B Cluster 2018 %I IEEE Computer Society Press %C Belfast, UK %8 2018-09 %G eng %0 Generic %D 2018 %T Data Movement Interfaces to Support Dataflow Runtimes %A Aurelien Bouteiller %A George Bosilca %A Thomas Herault %A Jack Dongarra %X This document presents the design study and reports on the implementation of a portable hosted accelerator device support in the PaRSEC Dataflow Tasking at Exascale runtime, undertaken as part of the ECP contract 17-SC-20-SC. The document discusses different technological approaches to transfer data to/from hosted accelerators, issues recommendations for technology providers, and presents the design of an OpenMP-based accelerator support in PaRSEC. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-05 %G eng %0 Generic %D 2018 %T Distributed Termination Detection for HPC Task-Based Environments %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Valentin Le Fèvre %A Yves Robert %A Jack Dongarra %X This paper revisits distributed termination detection algorithms in the context of high-performance computing applications in task systems. We first outline the need to efficiently detect termination in workflows for which the total number of tasks is data dependent and therefore not known statically but only revealed dynamically during execution. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). On the theoretical side, we analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. On the practical side, we provide a highly tuned implementation of each termination detection algorithm within PaRSEC and compare their performance for a variety of benchmarks, extracted from scientific applications that exhibit dynamic behaviors. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-06 %G eng %0 Conference Paper %B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids %D 2018 %T Do moldable applications perform better on failure-prone HPC platforms? %A Valentin Le Fèvre %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Atsushi Hori %A Yves Robert %A Jack Dongarra %X This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) Rigid applications, which use a constant number of processors throughout execution; (ii) Moldable applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GridShaped applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage. %B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids %S LNCS %I Springer Verlag %C Turin, Italy %8 2018-08 %G eng %0 Conference Proceedings %B OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence %D 2018 %T Evaluating Contexts in OpenSHMEM-X Reference Implementation %A Aurelien Bouteiller %A Pophale, Swaroop %A Swen Boehm %A Baker, Matthew B. %A Manjunath Gorentla Venkata %E Manjunath Gorentla Venkata %E Imam, Neena %E Pophale, Swaroop %X Many-core processors are now ubiquitous in supercomputing. This evolution pushes toward the adoption of mixed models in which cores are exploited with threading models (and related programming abstractions, such as OpenMP), while communication between distributed memory domains employ a communication Application Programming Interface (API). OpenSHMEM is a partitioned global address space communication specification that exposes one-sided and synchronization operations. As the threaded semantics of OpenSHMEM are being fleshed out by its standardization committee, it is important to assess the soundness of the proposed concepts. This paper implements and evaluate the ``context'' extension in relation to threaded operations. We discuss the implementation challenges of the context and the associated API in OpenSHMEM-X. We then evaluate its performance in threaded situations on the Infiniband network using micro-benchmarks and the Random Access benchmark and see that adding communication contexts significantly improves message rate achievable by the executing multi-threaded PEs. %B OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence %I Springer International Publishing %C Cham %P 50–62 %@ 978-3-319-73814-7 %G eng %R https://doi.org/10.1007/978-3-319-73814-7_4 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2018 %T A Failure Detector for HPC Platforms %A George Bosilca %A Aurelien Bouteiller %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Pierre Sens %A Jack Dongarra %K failure detection %K Fault tolerance %K MPI %X Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This article describes the design and evaluation of a robust failure detector that can maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage uses a nonuniform variant of a reliable broadcast over a circulant graph overlay network and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan Oak Ridge National Laboratory supercomputer, show that the algorithm performs extremely well and exhibits all the desired properties of an exascale-ready algorithm. %B The International Journal of High Performance Computing Applications %V 32 %P 139–158 %8 2018-01 %G eng %N 1 %R https://doi.org/10.1177/1094342017711505 %0 Journal Article %J Parallel Computing %D 2018 %T Incomplete Sparse Approximate Inverses for Parallel Preconditioning %A Hartwig Anzt %A Thomas Huckle %A Jürgen Bräckle %A Jack Dongarra %X In this paper, we propose a new preconditioning method that can be seen as a generalization of block-Jacobi methods, or as a simplification of the sparse approximate inverse (SAI) preconditioners. The “Incomplete Sparse Approximate Inverses” (ISAI) is in particular efficient in the solution of sparse triangular linear systems of equations. Those arise, for example, in the context of incomplete factorization preconditioning. ISAI preconditioners can be generated via an algorithm providing fine-grained parallelism, which makes them attractive for hardware with a high concurrency level. In a study covering a large number of matrices, we identify the ISAI preconditioner as an attractive alternative to exact triangular solves in the context of incomplete factorization preconditioning. %B Parallel Computing %V 71 %P 1–22 %8 2018-01 %G eng %U http://www.sciencedirect.com/science/article/pii/S016781911730176X %! Parallel Computing %R 10.1016/j.parco.2017.10.003 %0 Journal Article %J Journal of Computational Science %D 2018 %T Multi-Level Checkpointing and Silent Error Detection for Linear Workflows %A Anne Benoit %A Aurelien Cavelan %A Yves Robert %A Hongyang Sun %X We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for fail-stop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also show how to combine all these techniques and solve the problem with both fail-stop and silent errors. Simulation results demonstrate that these extensions lead to significantly improved performance compared to the standard single-level checkpointing algorithm. %B Journal of Computational Science %V 28 %P 398–415 %8 2018-09 %G eng %0 Conference Paper %B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award %D 2018 %T Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Dorian Arnold %A Kurt Ferreira %A George Bosilca %A Jack Dongarra %X In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance. %B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award %I IEEE %C Vancouver, BC, Canada %8 2018-05 %G eng %R 10.1109/IPDPSW.2018.00127 %0 Conference Paper %B The 47th International Conference on Parallel Processing (ICPP 2018) %D 2018 %T A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures %A Anne Benoit %A Swann Perarnau %A Loïc Pottier %A Yves Robert %X This work presents a realistic performance model to execute scientific workflows on high-bandwidth memory architectures such as the Intel Knights Landing. We provide a detailed analysis of the execution time on such platforms, taking into account transfers from both fast and slow memory and their overlap with computations. We discuss several scheduling and mapping strategies: not only tasks must be assigned to computing resource, but also one has to decide which fraction of input and output data will reside in fast memory, and which will have to stay in slow memory. Extensive simulations allow us to assess the impact of the mapping strategies on performance. We also conduct actual experiments for a simple 1D Gauss-Seidel kernel, which assess the accuracy of the model and further demonstrate the importance of a tuned memory management. Altogether, our model and results lay the foundations for further studies and experiments on dual-memory systems. %B The 47th International Conference on Parallel Processing (ICPP 2018) %I IEEE Computer Society Press %C Eugene, OR %8 2018-08 %G eng %0 Journal Article %J Parallel Computing %D 2018 %T PMIx: Process Management for Exascale Environments %A Ralph Castain %A Joshua Hursey %A Aurelien Bouteiller %A David Solt %B Parallel Computing %V 79 %P 9–29 %8 2018-01 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S0167819118302424https://api.elsevier.com/content/article/PII:S0167819118302424?httpAccept=text/xmlhttps://api.elsevier.com/content/article/PII:S0167819118302424?httpAccept=text/plain %! Parallel Computing %R 10.1016/j.parco.2018.08.002 %0 Generic %D 2018 %T Software-Defined Events (SDEs) in MAGMA-Sparse %A Heike Jagode %A Anthony Danalis %A Hartwig Anzt %A Ichitaro Yamazaki %A Mark Hoemmen %A Erik Boman %A Stanimire Tomov %A Jack Dongarra %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-12 %G eng %0 Generic %D 2018 %T Solver Interface & Performance on Cori %A Hartwig Anzt %A Ichitaro Yamazaki %A Mark Hoemmen %A Erik Boman %A Jack Dongarra %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-06 %G eng %0 Journal Article %J Concurrency Computation: Practice and Experience %D 2018 %T A Survey of MPI Usage in the US Exascale Computing Project %A David E. Bernholdt %A Swen Boehm %A George Bosilca %A Manjunath Gorentla Venkata %A Ryan E. Grant %A Thomas Naughton %A Howard P. Pritchard %A Martin Schulz %A Geoffroy R. Vallee %K exascale %K MPI %X The Exascale Computing Project (ECP) is currently the primary effort in theUnited States focused on developing “exascale” levels of computing capabilities, including hardware, software, and applications. In order to obtain amore thorough understanding of how the software projects under the ECPare using, and planning to use theMessagePassing Interface (MPI), and help guide the work of our own project within the ECP, we created a survey.Of the 97 ECP projects active at the time the survey was distributed, we received 77 responses, 56 of which reported that their projects were usingMPI. This paper reports the results of that survey for the benefit of the broader community of MPI developers. %B Concurrency Computation: Practice and Experience %8 2018-09 %G eng %9 Special Issue %R https://doi.org/10.1002/cpe.4851 %0 Generic %D 2018 %T Tensor Contraction on Distributed Hybrid Architectures using a Task-Based Runtime System %A George Bosilca %A Damien Genet %A Robert Harrison %A Thomas Herault %A Mohammad Mahdi Javanmard %A Chong Peng %A Edward Valeev %X The needs for predictive simulation of electronic structure in chemistry and materials science calls for fast/reduced-scaling formulations of quantum n-body methods that replace the traditional dense tensors with element-, block-, rank-, and block-rank-sparse (data-sparse) tensors. The resulting, highly irregular data structures are a poor match to imperative, bulk-synchronous parallel programming style due to the dynamic nature of the problem and to the lack of clear domain decomposition to guarantee a fair load-balance. TESSE runtime and the associated programming model aim to support performance-portable composition of applications involving irregular and dynamically changing data. In this paper we report an implementation of irregular dense tensor contraction in a paradigmatic electronic structure application based on the TESSE extension of PaRSEC, a distributed hybrid task runtime system, and analyze the resulting performance on a distributed memory cluster of multi-GPU nodes. Unprecedented strong scaling and promising efficiency indicate a viable future for task-based programming of complete production-quality reduced scaling models of electronic structure. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-12 %G eng %0 Generic %D 2017 %T Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %I SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation %C Atlanta, GA %8 2017-03 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2017 %T Argobots: A Lightweight Low-Level Threading and Tasking Framework %A Sangmin Seo %A Abdelhalim Amer %A Pavan Balaji %A Cyril Bordage %A George Bosilca %A Alex Brooks %A Philip Carns %A Adrian Castello %A Damien Genet %A Thomas Herault %A Shintaro Iwasaki %A Prateek Jindal %A Sanjay Kale %A Sriram Krishnamoorthy %A Jonathan Lifflander %A Huiwei Lu %A Esteban Meneses %A Mar Snir %A Yanhua Sun %A Kenjiro Taura %A Pete Beckman %K Argobots %K context switch %K I/O %K interoperability %K lightweight %K MPI %K OpenMP %K stackable scheduler %K tasklet %K user-level thread %X In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, are either too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing a rich set of controls to allow specialization by the user or high-level programming model. We describe the design, implementation, and optimization of Argobots and present integrations with three example high-level models: OpenMP, MPI, and co-located I/O service. Evaluations show that (1) Argobots outperforms existing generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency hiding capabilities; and (4) I/O service with Argobots reduces interference with co-located applications, achieving performance competitive with that of the Pthreads version. %B IEEE Transactions on Parallel and Distributed Systems %8 2017-10 %G eng %U http://ieeexplore.ieee.org/document/8082139/ %R 10.1109/TPDS.2017.2766062 %0 Conference Paper %B 19th Workshop on Advances in Parallel and Distributed Computational Models %D 2017 %T Co-Scheduling Algorithms for Cache-Partitioned Systems %A Guillaume Aupy %A Anne Benoit %A Loïc Pottier %A Padma Raghavan %A Yves Robert %A Manu Shantharam %K Computational modeling %K Degradation %K Interference %K Mathematical model %K Program processors %K Supercomputers %K Throughput %X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? Here, we assign rational numbers of processors to each application, since they can be shared across applications through multi-threading. In this paper, we provide answers to (i) and (ii) for perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for general applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. %B 19th Workshop on Advances in Parallel and Distributed Computational Models %I IEEE Computer Society Press %C Orlando, FL %8 2017-05 %G eng %R 10.1109/IPDPSW.2017.60 %0 Conference Proceedings %B ScalA17 %D 2017 %T Dynamic Task Discovery in PaRSEC- A data-flow task-based Runtime %A Reazul Hoque %A Thomas Herault %A George Bosilca %A Jack Dongarra %K data-flow %K dynamic task-graph %K parsec %K task-based runtime %X Successfully exploiting distributed collections of heterogeneous many-cores architectures with complex memory hierarchy through a portable programming model is a challenge for application developers. The literature is not short of proposals addressing this problem, including many evolutionary solutions that seek to extend the capabilities of current message passing paradigms with intranode features (MPI+X). A different, more revolutionary, solution explores data-flow task-based runtime systems as a substitute to both local and distributed data dependencies management. The solution explored in this paper, PaRSEC, is based on such a programming paradigm, supported by a highly efficient task-based runtime. This paper compares two programming paradigms present in PaRSEC, Parameterized Task Graph (PTG) and Dynamic Task Discovery (DTD) in terms of capabilities, overhead and potential benefits. %B ScalA17 %I ACM %C Denver %8 2017-09 %@ 978-1-4503-5125-6 %G eng %U https://dl.acm.org/citation.cfm?doid=3148226.3148233 %R 10.1145/3148226.3148233 %0 Conference Paper %B ACM MultiMedia Workshop 2017 %D 2017 %T Efficient Communications in Training Large Scale Neural Networks %A Yiyang Zhao %A Linnan Wan %A Wei Wu %A George Bosilca %A Richard Vuduc %A Jinmian Ye %A Wenqi Tang %A Zenglin Xu %X We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for sub-gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to P, where P is the number of GPUs, while the cost of more conventional Minimum Spanning Tree (MST) scales like O(logP). LP also demonstrate up to 2x faster bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD. %B ACM MultiMedia Workshop 2017 %I ACM %C Mountain View, CA %8 2017-10 %G eng %0 Journal Article %J ISC High Performance 2017 %D 2017 %T A Framework for Out of Memory SVD Algorithms %A Khairul Kabir %A Azzam Haidar %A Stanimire Tomov %A Aurelien Bouteiller %A Jack Dongarra %X Many important applications – from big data analytics to information retrieval, gene expression analysis, and numerical weather prediction – require the solution of large dense singular value decompositions (SVD). In many cases the problems are too large to fit into the computer’s main memory, and thus require specialized out-of-core algorithms that use disk storage. In this paper, we analyze the SVD communications, as related to hierarchical memories, and design a class of algorithms that minimizes them. This class includes out-of-core SVDs but can also be applied between other consecutive levels of the memory hierarchy, e.g., GPU SVD using the CPU memory for large problems. We call these out-of-memory (OOM) algorithms. To design OOM SVDs, we first study the communications for both classical one-stage blocked SVD and two-stage tiled SVD. We present the theoretical analysis and strategies to design, as well as implement, these communication avoiding OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models. %B ISC High Performance 2017 %P 158–178 %8 2017-06 %G eng %R https://doi.org/10.1007/978-3-319-58667-0_9 %0 Conference Paper %B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale %D 2017 %T Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale %A Anne Benoit %A Franck Cappello %A Aurelien Cavelan %A Yves Robert %A Hongyang Sun %X This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint. We provide a detailed analytical study of both scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that corroborates the analytical model. %B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale %I ACM %C Washington, DC %8 2017-06 %G eng %R 10.1145/3086157.3086162 %0 Journal Article %J International Journal of High Performance Computing and Networking %D 2017 %T A Look Back on 30 Years of the Gordon Bell Prize %A Gordon Bell %A David Bailey %A Alan H. Karp %A Jack Dongarra %A Kevin Walsh %K benchmarks %K Computational Science %K Gordon Bell Prize %K High Performance Computing %K HPC Cost-Performance %K HPC Progress %K HPC Recognition %K HPC special hardware %K HPPC Award. HPC Prize %K Technical Computing %X The Gordon Bell Prize is awarded each year by the Association for Computing Machinery to recognize outstanding achievement in high-performance computing (HPC). The purpose of the award is to track the progress of parallel computing with particular emphasis on rewarding innovation in applying HPC to applications in science, engineering, and large-scale data analytics. Prizes may be awarded for peak performance or special achievements in scalability and time-to-solution on important science and engineering problems. Financial support for the US$10,000 award is provided through an endowment by Gordon Bell, a pioneer in high-performance and parallel computing. This article examines the evolution of the Gordon Bell Prize and the impact it has had on the field. %B International Journal of High Performance Computing and Networking %V 31 %P 469–484 %G eng %U http://journals.sagepub.com/doi/10.1177/1094342017738610 %N 6 %0 Generic %D 2017 %T MAGMA-sparse Interface Design Whitepaper %A Hartwig Anzt %A Erik Boman %A Jack Dongarra %A Goran Flegar %A Mark Gates %A Mike Heroux %A Mark Hoemmen %A Jakub Kurzak %A Piotr Luszczek %A Sivasankaran Rajamanickam %A Stanimire Tomov %A Stephen Wood %A Ichitaro Yamazaki %X In this report we describe the logic and interface we develop for the MAGMA-sparse library to allow for easy integration as third-party library into a top-level software ecosystem. The design choices are based on extensive consultation with other software library developers, in particular the Trilinos software development team. The interface documentation is at this point not exhaustive, but a first proposal for setting a standard. Although the interface description targets the MAGMA-sparse software module, we hope that the design choices carry beyond this specific library, and are attractive for adoption in other packages. This report is not intended as static document, but will be updated over time to reflect the agile software development in the ECP 1.3.3.11 STMS11-PEEKS project. %B Innovative Computing Laboratory Technical Report %8 2017-09 %G eng %9 Technical Report %0 Conference Paper %B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale %D 2017 %T Optimal Checkpointing Period with replicated execution on heterogeneous platforms %A Anne Benoit %A Aurelien Cavelan %A Valentin Le Fèvre %A Yves Robert %X In this paper, we design and analyze strategies to replicate the execution of an application on two different platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~W for a periodic checkpointing strategy where both platforms concurrently try and execute W units of work before checkpointing. The first platform that completes its pattern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use first or second-order approximations to compute overheads and optimal pattern sizes, and show through extensive simulations that these models are very accurate. The simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively different speeds: in average, over a wide range of scenarios, the overhead is reduced by 30%. The simulations also demonstrate that the periodic checkpointing strategy is globally more efficient, unless platform speeds are quite close. %B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale %I IEEE Computer Society Press %C Washington, DC %8 2017-06 %G eng %R 10.1145/3086157.3086165 %0 Book Section %B Exascale Scientific Applications: Scalability and Performance Portability %D 2017 %T Performance Analysis and Debugging Tools at Scale %A Scott Parker %A John Mellor-Crummey %A Dong H. Ahn %A Heike Jagode %A Holger Brunst %A Sameer Shende %A Allen D. Malony %A David DelSignore %A Ronny Tschuter %A Ralph Castain %A Kevin Harms %A Philip Carns %A Ray Loy %A Kalyan Kumaran %X This chapter explores present-day challenges and those likely to arise as new hardware and software technologies are introduced on the path to exascale. It covers some of the underlying hardware, software, and techniques that enable tools and debuggers. Performance tools and debuggers are critical components that enable computational scientists to fully exploit the computing power of While high-performance computing systems. Instrumentation is the insertion of code to perform measurement in a program. It is vital step in performance analysis, especially for parallel programs. The essence of a debugging tool is enabling observation, exploration, and control of program state, such that a developer can, for example, verify that what is currently occurring correlates to what is intended. The increased complexity and volume of performance and debugging data likely to be seen on exascale systems risks overwhelming tool users. Tools and debuggers may need to develop advanced techniques such as automated filtering and analysis to reduce the complexity seen by the user. %B Exascale Scientific Applications: Scalability and Performance Portability %I Chapman & Hall / CRC Press %P 17-50 %8 2017-11 %@ 9781315277400 %G eng %& 2 %R https://doi.org/10.1201/b21930 %0 Generic %D 2017 %T PLASMA 17 Performance Report %A Maksims Abalenkovs %A Negin Bagherpour %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Samuel Relton %A Jakub Sistek %A David Stevens %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %A Mawussi Zounon %X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2017-06 %G eng %0 Generic %D 2017 %T PLASMA 17.1 Functionality Report %A Maksims Abalenkovs %A Negin Bagherpour %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Samuel Relton %A Jakub Sistek %A David Stevens %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %A Mawussi Zounon %X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2017-06 %G eng %0 Conference Proceedings %B Proceedings of the 24th European MPI Users' Group Meeting %D 2017 %T PMIx: Process Management for Exascale Environments %A Castain, Ralph H. %A David Solt %A Joshua Hursey %A Aurelien Bouteiller %X High-Performance Computing (HPC) applications have historically executed in static resource allocations, using programming models that ran independently from the resident system management stack (SMS). Achieving exascale performance that is both cost-effective and fits within site-level environmental constraints will, however, require that the application and SMS collaboratively orchestrate the flow of work to optimize resource utilization and compensate for on-the-fly faults. The Process Management Interface - Exascale (PMIx) community is committed to establishing scalable workflow orchestration by defining an abstract set of interfaces by which not only applications and tools can interact with the resident SMS, but also the various SMS components can interact with each other. This paper presents a high-level overview of the goals and current state of the PMIx standard, and lays out a roadmap for future directions. %B Proceedings of the 24th European MPI Users' Group Meeting %S EuroMPI '17 %I ACM %C New York, NY, USA %P 14:1–14:10 %@ 978-1-4503-4849-2 %G eng %U http://doi.acm.org/10.1145/3127024.3127027 %R 10.1145/3127024.3127027 %0 Journal Article %J International Journal of High Performance Computing Applications (IJHPCA) %D 2017 %T Resilient Co-Scheduling of Malleable Applications %A Anne Benoit %A Loïc Pottier %A Yves Robert %K co-scheduling %K complexity results %K heuristics %K Redistribution %K resilience %K simulations %X Recently, the benefits of co-scheduling several applications have been demonstrated in a fault-free context, both in terms of performance and energy savings. However, large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between applications and significantly degrade performance. In this article, we aim at minimizing the expected completion time of a set of co-scheduled applications. We propose to redistribute the resources assigned to each application upon the occurrence of failures, and upon the completion of some applications, in order to achieve this goal. First, we introduce a formal model and establish complexity results. The problem is NP-complete for malleable applications, even in a fault-free context. Therefore, we design polynomial-time heuristics that perform redistributions and account for processor failures. A fault simulator is used to perform extensive simulations that demonstrate the usefulness of redistribution and the performance of the proposed heuristics. %B International Journal of High Performance Computing Applications (IJHPCA) %8 2017-05 %G eng %R 10.1177/1094342017704979 %0 Generic %D 2017 %T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale %A Ahmad Abdelfattah %A Hartwig Anzt %A Aurelien Bouteiller %A Anthony Danalis %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Stephen Wood %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2017-06 %G eng %9 SLATE Working Notes %1 01 %0 Generic %D 2017 %T Small Tensor Operations on Advanced Architectures for High-Order Applications %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %B University of Tennessee Computer Science Technical Report %I Innovative Computing Laboratory, University of Tennessee %8 2017-04 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2017 %T Solving Dense Symmetric Indefinite Systems using GPUs %A Marc Baboulin %A Jack Dongarra %A Adrien Remy %A Stanimire Tomov %A Ichitaro Yamazaki %X This paper studies the performance of different algorithms for solving a dense symmetric indefinite linear system of equations on multicore CPUs with a Graphics Processing Unit (GPU). To ensure the numerical stability of the factorization, pivoting is required. Obtaining high performance of such algorithms on the GPU is difficult because all the existing pivoting strategies lead to frequent synchronizations and irregular data accesses. Until recently, there has not been any implementation of these algorithms on a hybrid CPU/GPU architecture. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive data transfer and synchronization between the CPU and GPU, or on the GPU (e.g., factorizing the matrix entirely on the GPU or in a communication-avoiding fashion). We also study the performance of the solver using iterative refinements along with the factorization without pivoting combined with the preprocessing technique based on random butterfly transformations, or with the mixed-precision algorithm where the matrix is factorized in single precision. This randomization algorithm only has a probabilistic proof on the numerical stability, and for this paper, we only focused on the mixed-precision algorithm without pivoting. However, they demonstrate that we can obtain good performance on the GPU by avoiding the pivoting and using the lower precision arithmetics, respectively. As illustrated with the application in acoustics studied in this paper, in many practical cases, the matrices can be factorized without pivoting. Because the componentwise backward error computed in the iterative refinement signals when the algorithm failed to obtain the desired accuracy, the user can use these potentially unstable but efficient algorithms in most of the cases and fall back to a more stable algorithm with pivoting only in the case of the failure. %B Concurrency and Computation: Practice and Experience %V 29 %8 2017-03 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/cpe.4055/full %N 9 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.4055 %0 Journal Article %J IEEE Transactions on Computers %D 2017 %T Towards Optimal Multi-Level Checkpointing %A Anne Benoit %A Aurelien Cavelan %A Valentin Le Fèvre %A Yves Robert %A Hongyang Sun %K checkpointing %K Dynamic programming %K Error analysis %K Heuristic algorithms %K Optimized production technology %K protocols %K Shape %B IEEE Transactions on Computers %V 66 %P 1212–1226 %8 2017-07 %G eng %N 7 %R 10.1109/TC.2016.2643660 %0 Conference Paper %B EuroMPI %D 2017 %T Using Software-Based Performance Counters to Expose Low-Level Open MPI Performance Information %A David Eberius %A Thananon Patinyasakdikul %A George Bosilca %K MPI %K Performance Counters %K Profiling %K Tools %X This paper details the implementation and usage of software-based performance counters to understand the performance of a particular implementation of the MPI standard, Open MPI. Such counters can expose intrinsic features of the software stack that are not available otherwise in a generic and portable way. The PMPI-interface is useful for instrumenting MPI applications at a user level, however it is insufficient for providing meaningful internal MPI performance details. While the Peruse interface provides more detailed information on state changes within Open MPI, it has not seen widespread adoption. We introduce a simple low-level approach that instruments the Open MPI code at key locations to provide fine-grained MPI performance metrics. We evaluate the overhead associated with adding these counters to Open MPI as well as their use in determining bottlenecks and areas for improvement both in user code and the MPI implementation itself. %B EuroMPI %I ACM %C Chicago, IL %8 2017-09 %@ 978-1-4503-4849-2/17/09 %G eng %U https://dl.acm.org/citation.cfm?id=3127024 %R https://doi.org/10.1145/3127024.3127039 %0 Journal Article %J VECPAR %D 2016 %T Accelerating the Conjugate Gradient Algorithm with GPU in CFD Simulations %A Hartwig Anzt %A Marc Baboulin %A Jack Dongarra %A Yvan Fournier %A Frank Hulsemann %A Amal Khabou %A Yushan Wang %X This paper illustrates how GPU computing can be used to accelerate computational fluid dynamics (CFD) simulations. For sparse linear systems arising from finite volume discretization, we evaluate and optimize the performance of Conjugate Gradient (CG) routines designed for manycore accelerators and compare against an industrial CPU-based implementation. We also investigate how the recent advances in preconditioning, such as iterative Incomplete Cholesky (IC, as symmetric case of ILU) preconditioning, match the requirements for solving real world problems. %B VECPAR %G eng %U http://hgpu.org/?p=16264 %0 Journal Article %J ACM Transactions on Parallel Computing %D 2016 %T Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors %A Anne Benoit %A Aurelien Cavelan %A Yves Robert %A Hongyang Sun %K checkpoint %K fail-stop error %K failure %K HPC %K resilience %K silent data corruption %K silent error %K verification %X In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bi-criteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via dynamic voltage and frequency scaling (DVFS). We determine in this scenario the optimal checkpointing and verification locations, as well as the optimal speed pairs. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds. %B ACM Transactions on Parallel Computing %8 2016-08 %G eng %R 10.1145/2897189 %0 Journal Article %J Parallel Computing %D 2016 %T Assessing the Cost of Redistribution followed by a Computational Kernel: Complexity and Performance Results %A Julien Herrmann %A George Bosilca %A Thomas Herault %A Loris Marchal %A Yves Robert %A Jack Dongarra %K Data partition %K linear algebra %K parsec %K QR factorization %K Redistribution %K Stencil %X The classical redistribution problem aims at optimally scheduling communications when reshuffling from an initial data distribution to a target data distribution. This target data distribution is usually chosen to optimize some objective for the algorithmic kernel under study (good computational balance or low communication volume or cost), and therefore to provide high efficiency for that kernel. However, the choice of a distribution minimizing the target objective is not unique. This leads to generalizing the redistribution problem as follows: find a re-mapping of data items onto processors such that the data redistribution cost is minimal, and the operation remains as efficient. This paper studies the complexity of this generalized problem. We compute optimal solutions and evaluate, through simulations, their gain over classical redistribution. We also show the NP-hardness of the problem to find the optimal data partition and processor permutation (defined by new subsets) that minimize the cost of redistribution followed by a simple computational kernel. Finally, experimental validation of the new redistribution algorithms are conducted on a multicore cluster, for both a 1D-stencil kernel and a more compute-intensive dense linear algebra routine. %B Parallel Computing %V 52 %P 22-41 %8 2016-02 %G eng %R doi:10.1016/j.parco.2015.09.005 %0 Generic %D 2016 %T Context Identifier Allocation in Open MPI %A George Bosilca %A Thomas Herault %A Jack Dongarra %X The concept of communicators is a central notion in Message Passing Interface, allowing on one side the MPI implemen- tation to specialize it’s matching and deliver messages in the right context, and on the other side the library developers to contextualize their message exchanges, and scope different algorithms to well-defined groups of processes. More pre- cisely, all communication objects in MPI are derived from a communicator at some point. All MPI functions allowing the creation of new communicators have a collective mean- ing, either over the group of processes from the parent com- municator or those from the target communicator. Thus, the perfromance of the communicator creation is tied to the col- lective communication performance, as well as the amount of data needed to be exchanged in order to consistently create this new communicator. We introduce several communica- tor creation algorithms, and present their implementation in the context of Open MPI. We explore the performance of these new algorithms and compare them with state-of-the- art algorithms available in other MPI implementations. %B University of Tennessee Computer Science Technical Report %I Innovative Computing Laboratory, University of Tennessee %8 2016-01 %G eng %0 Book Section %B Lecture Notes in Computer Science %D 2016 %T Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures %A Marc Baboulin %A Jack Dongarra %A Adrien Remy %A Stanimire Tomov %A Ichitaro Yamazaki %E Roman Wyrzykowski %E Ewa Deelman %E Konrad Karczewski %E Jacek Kitowski %E Kazimierz Wiatr %K Communication-avoiding %K Dense symmetric indefinite factorization %K gpu computation %K randomization %X We study the performance of dense symmetric indefinite factorizations (Bunch-Kaufman and Aasen’s algorithms) on multicore CPUs with a Graphics Processing Unit (GPU). Though such algorithms are needed in many scientific and engineering simulations, obtaining high performance of the factorization on the GPU is difficult because the pivoting that is required to ensure the numerical stability of the factorization leads to frequent synchronizations and irregular data accesses. As a result, until recently, there has not been any implementation of these algorithms on hybrid CPU/GPU architectures. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive communication and synchronization between the CPU and GPU, or on the GPU. We also study the performance of an LDL^T factorization with no pivoting combined with the preprocessing technique based on Random Butterfly Transformations. Though such transformations only have probabilistic results on the numerical stability, they avoid the pivoting and obtain a great performance on the GPU. %B Lecture Notes in Computer Science %S 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I %I Springer International Publishing %V 9573 %P 86-95 %8 2015-09 %@ 978-3-319-32149-3 %G eng %& Parallel Processing and Applied Mathematics %R 10.1007/978-3-319-32149-3_9 %0 Conference Proceedings %B Software for Exascale Computing - SPPEXA %D 2016 %T Domain Overlap for Iterative Sparse Triangular Solves on GPUs %A Hartwig Anzt %A Edmond Chow %A Daniel Szyld %A Jack Dongarra %E Hans-Joachim Bungartz %E Philipp Neumann %E Wolfgang E. Nagel %X Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution. %B Software for Exascale Computing - SPPEXA %S Lecture Notes in Computer Science and Engineering %I Springer International Publishing %V 113 %P 527–545 %8 2016-09 %G eng %R 10.1007/978-3-319-40528-5_24 %0 Conference Proceedings %B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16) %D 2016 %T Failure Detection and Propagation in HPC Systems %A George Bosilca %A Aurelien Bouteiller %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Pierre Sens %A Jack Dongarra %K failure detection %K fault-tolerance %K MPI %B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16) %I IEEE Press %C Salt Lake City, Utah %P 27:1-27:11 %8 2016-11 %@ 978-1-4673-8815-3 %G eng %U http://dl.acm.org/citation.cfm?id=3014904.3014941 %0 Conference Paper %B 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'16) %D 2016 %T GPU-Aware Non-contiguous Data Movement In Open MPI %A Wei Wu %A George Bosilca %A Rolf vandeVaart %A Sylvain Jeaugey %A Jack Dongarra %K datatype %K gpu %K hybrid architecture %K MPI %K non-contiguous data %X

Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applications. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non-contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance.

To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype packing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Uni ed Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.

%B 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'16) %I ACM %C Kyoto, Japan %8 2016-06 %G eng %R http://dx.doi.org/10.1145/2907294.2907317 %0 Conference Paper %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %D 2016 %T Heterogeneous Streaming %A Chris J. Newburn %A Gaurav Bansal %A Michael Wood %A Luis Crivelli %A Judit Planas %A Alejandro Duran %A Paulo Souza %A Leonardo Borges %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %A Hartwig Anzt %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Ichitaro Yamazaki %A Jesus Labarta %K plasma %X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application. %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %K Applications %K Batched linear algebra %K FEM %K gpu %K Tensor contractions %K Tensor HPC %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Generic %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-01 %G eng %0 Conference Paper %B 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2016 %T Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors %A Anne Benoit %A Aurelien Cavelan %A Yves Robert %A Hongyang Sun %K fail-stop errors %K multilevel checkpoint %K optimal pattern %K resilience %K silent errors %K verification %X This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop errors. Many others deal with silent errors (or silent data corruptions). But very few papers deal with fail-stop and silent errors simultaneously. However, HPC applications will obviously have to cope with both error sources. This paper presents a unified framework and optimal algorithmic solutions to this double challenge. Silent errors are handled via verification mechanisms (either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, and a full characterization of the optimal pattern. Our results nicely extend several published solutions and demonstrate how to make use of different techniques to solve the double threat of fail-stop and silent errors. Extensive simulations based on real data confirm the accuracy of the model, and show that patterns that combine all resilience mechanisms are required to provide acceptable overheads. %B 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Chicago, IL %8 2016-05 %G eng %R 10.1109/IPDPS.2016.39 %0 Conference Paper %B 2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16) %D 2016 %T Performance Analysis and Acceleration of Explicit Integration for Large Kinetic Networks using Batched GPU Computations %A Azzam Haidar %A Benjamin Brock %A Stanimire Tomov %A Michael Guidry %A Jay Jay Billings %A Daniel Shyles %A Jack Dongarra %X We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms that solve efficiently N coupled ordinary differential equations (subject to initial conditions) on modern GPUs. We take representative test cases (Type Ia supernova explosions) and demonstrate two or more orders of magnitude increase in efficiency for solving such systems (of realistic thermonuclear networks coupled to fluid dynamics). This implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications we present the computational techniques developed for our ongoing deployment of these new methods on modern GPU accelerators. We show that similarly to many other scientific applications, ranging from national security to medical advances, the computation can be split into many independent computational tasks, each of relatively small-size. As the size of each individual task does not provide sufficient parallelism for the underlying hardware, especially for accelerators, these tasks must be computed concurrently as a single routine, that we call batched routine, in order to saturate the hardware with enough work. %B 2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16) %I IEEE %C Waltham, MA %8 2016-09 %G eng %0 Book Section %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %E Julian M. Kunkel %E Pavan Balaji %E Jack Dongarra %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %I Springer International Publishing %P 21–38 %@ 978-3-319-41321-1 %G eng %U http://dx.doi.org/10.1007/978-3-319-41321-1_2 %R 10.1007/978-3-319-41321-1_2 %0 Journal Article %J International Journal of Networking and Computing %D 2016 %T Scheduling Computational Workflows on Failure-prone Platforms %A Guillaume Aupy %A Anne Benoit %A Henri Casanova %A Yves Robert %K checkpointing %K fault-tolerance %K reliability %K scheduling %K workflow %X We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomialtime optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations. %B International Journal of Networking and Computing %V 6 %P 2-26 %G eng %0 Conference Proceedings %B OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments %D 2016 %T Surviving Errors with OpenSHMEM %A Aurelien Bouteiller %A George Bosilca %A Manjunath Gorentla Venkata %E Manjunath Gorentla Venkata %E Imam, Neena %E Pophale, Swaroop %E Mintz, Tiffany M. %X Unexpected error conditions stem from a variety of underlying causes, including resource exhaustion, network failures, hardware failures, or program errors. As the scale of HPC systems continues to grow, so does the probability of encountering a condition that causes a failure; meanwhile, error recovery and run-through failure management are becoming mature, and interoperable HPC programming paradigms are beginning to feature advanced error management. As a result from these developments, it becomes increasingly desirable to gracefully handle error conditions in OpenSHMEM. In this paper, we present the design and rationale behind an extension of the OpenSHMEM API that can (1) notify user code of unexpected erroneous conditions, (2) permit customized user response to errors without incurring overhead on an error-free execution path, (3) propagate the occurence of an error condition to all Processing Elements, and (4) consistently close the erroneous epoch in order to resume the application. %B OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments %I Springer International Publishing %C Baltimore, MD, USA %P 66–81 %@ 978-3-319-50995-2 %G eng %0 Conference Paper %B 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) %D 2015 %T Accelerating NWChem Coupled Cluster through dataflow-based Execution %A Heike Jagode %A Anthony Danalis %A George Bosilca %A Jack Dongarra %K CCSD %K dag %K dataflow %K NWChem %K parsec %K ptg %K tasks %X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWChem, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelised in coarse chunks. In this paper, we present our effort of converting the NWChem’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PaRSEC (Parallel Runtime Scheduling and Execution Controller) – a software package designed to enable high performance computing at scale. We discuss the modularity of our approach and explain how the PaRSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWChem codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer grained tasks (compared to the original version of NWChem); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWChem, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation. %B 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) %I Springer International Publishing %C Krakow, Poland %8 2015-09 %G eng %0 Journal Article %J ACM Transactions on Parallel Computing %D 2015 %T Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures, and Accuracy %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Peng Du %A Jack Dongarra %E Phillip B. Gibbons %K ABFT %K algorithms %K fault-tolerance %K High Performance Computing %K linear algebra %X Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable node and the possibility of losing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead decreases inversely to the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. Applicability to tolerate multiple failures and accuracy after multiple recovery is also considered. %B ACM Transactions on Parallel Computing %V 1 %P 10:1-10:28 %8 2015-01 %G eng %N 2 %R 10.1145/2686892 %0 Journal Article %J International Journal of Networking and Computing %D 2015 %T Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Yves Robert %A Jack Dongarra %K ABFT %K checkpoint %K fault-tolerance %K High-performance computing %K model %K performance evaluation %K resilience %X Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFT- aware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic program- ming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint. %B International Journal of Networking and Computing %V 5 %P 2-15 %8 2015-01 %G eng %0 Conference Paper %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2015 %T Design for a Soft Error Resilient Dynamic Task-based Runtime %A Chongxiao Cao %A George Bosilca %A Thomas Herault %A Jack Dongarra %X As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms. %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Hyderabad, India %8 2015-05 %G eng %0 Journal Article %J International Journal on High Performance Computing Applications %D 2015 %T Efficient Checkpoint/Verification Patterns %A Anne Benoit %A Saurabh K. Raina %A Yves Robert %K checkpointing %K Fault tolerance %K High Performance Computing %K silent data corruption %K silent error %K verification %X Errors have become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (Mean Time Between Failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared to the base algorithm that always perform a verification just before taking a checkpoint (p = q = 1), and we exhibit gains of up to 19%. %B International Journal on High Performance Computing Applications %8 2015-07 %G eng %R 10.1177/1094342015594531 %0 Conference Proceedings %B OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies %D 2015 %T From MPI to OpenSHMEM: Porting LAMMPS %A Tang, Chunyan %A Aurelien Bouteiller %A Thomas Herault %A Manjunath Gorentla Venkata %A George Bosilca %E Manjunath Gorentla Venkata %E Shamis, Pavel %E Imam, Neena %E M. Graham Lopez %X This work details the opportunities and challenges of porting a Petascale, MPI-based application –-LAMMPS–- to OpenSHMEM. We investigate the major programming challenges stemming from the differences in communication semantics, address space organization, and synchronization operations between the two programming models. This work provides several approaches to solve those challenges for representative communication patterns in LAMMPS, e.g., by considering group synchronization, peer's buffer status tracking, and unpacked direct transfer of scattered data. The performance of LAMMPS is evaluated on the Titan HPC system at ORNL. The OpenSHMEM implementations are compared with MPI versions in terms of both strong and weak scaling. The results outline that OpenSHMEM provides a rich semantic to implement scalable scientific applications. In addition, the experiments demonstrate that OpenSHMEM can compete with, and often improve on, the optimized MPI implementation. %B OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies %I Springer International Publishing %C Annapolis, MD, USA %P 121–137 %@ 978-3-319-26428-8 %G eng %R 10.1007/978-3-319-26428-8_8 %0 Conference Paper %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2015 %T Hierarchical DAG scheduling for Hybrid Distributed Systems %A Wei Wu %A Aurelien Bouteiller %A George Bosilca %A Mathieu Faverge %A Jack Dongarra %K dense linear algebra %K gpu %K heterogeneous architecture %K PaRSEC runtime %X Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the pro-gramming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel effi-ciency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments. %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Hyderabad, India %8 2015-05 %G eng %0 Conference Paper %B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2015 %T Mixed-precision Block Gram Schmidt Orthogonalization %A Ichitaro Yamazaki %A Stanimire Tomov %A Jakub Kurzak %A Jack Dongarra %A Jesse Barlow %X The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a significant impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7:1 while maintaining about the same order of the numerical errors. %B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra %D 2015 %T Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs %A Ichitaro Yamazaki %A Jesse Barlow %A Stanimire Tomov %A Jakub Kurzak %A Jack Dongarra %X Orthogonalizing a set of dense vectors is an important computational kernel in subspace projection methods for solving large-scale problems. In this talk, we discuss our efforts to improve the performance of the kernel, while maintaining its numerical accuracy. Our experimental results demonstrate the effectiveness of our approaches. %B 2015 SIAM Conference on Applied Linear Algebra %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Conference Paper %B 2015 IEEE International Conference on Cluster Computing %D 2015 %T PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution %A Anthony Danalis %A Heike Jagode %A George Bosilca %A Jack Dongarra %K dag %K parsec %K ptg %K tasks %X Task-based execution has been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. The Parallel Runtime Scheduling and Execution Control (PARSEC) framework is a task-based runtime system that we designed to achieve high performance computing at scale. PARSEC offers a programming paradigm that is different than what has been traditionally used to develop large scale parallel scientific applications. In this paper, we discuss the use of PARSEC to convert a part of the Coupled Cluster (CC) component of the Quantum Chemistry package NWCHEM into a task-based form. We explain how we organized the computation of the CC methods in individual tasks with explicitly defined data dependencies between them and re-integrated the modified code into NWCHEM. We present a thorough performance evaluation and demonstrate that the modified code outperforms the original by more than a factor of two. We also compare the performance of different variants of the modified code and explain the different behaviors that lead to the differences in performance. %B 2015 IEEE International Conference on Cluster Computing %I IEEE %C Chicago, IL %8 2015-09 %G eng %0 Conference Paper %B 22nd European MPI Users' Group Meeting %D 2015 %T Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %X Advanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods. %B 22nd European MPI Users' Group Meeting %I ACM %C Bordeaux, France %8 2015-09 %G eng %R 10.1145/2802658.2802668 %0 Generic %D 2015 %T Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems: Formal Proof %A Thomas Herault %A Aurelien Bouteiller %A George Bosilca %A Marc Gamell %A Keita Teranishi %A Manish Parashar %A Jack Dongarra %B Innovative Computing Laboratory Technical Report %8 2015-04 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems %A Thomas Herault %A Aurelien Bouteiller %A George Bosilca %A Marc Gamell %A Keita Teranishi %A Manish Parashar %A Jack Dongarra %X The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications. %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Generic %D 2015 %T Towards a High-Performance Tensor Algebra Package for Accelerators %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %I moky Mountains Computational Sciences and Engineering Conference (SMC15) %C Gatlinburg, TN %8 2015-09 %G eng %0 Conference Proceedings %B 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects %D 2015 %T UCX: An Open Source Framework for HPC Network APIs and Beyond %A P. Shamis %A Manjunath Gorentla Venkata %A M. Graham Lopez %A M. B. Baker %A O. Hernandez %A Y. Itigin %A M. Dubman %A G. Shainer %A R. L. Graham %A L. Liss %A Y. Shahar %A S. Potluri %A D. Rossetti %A D. Becker %A D. Poole %A C. Lamb %A S. Kumar %A C. Stunkel %A George Bosilca %A Aurelien Bouteiller %K application program interfaces %K Bandwidth %K Electronics packaging %K Hardware %K high throughput computing %K highly-scalable network stack %K HPC %K HPC network APIs %K I/O bound applications %K Infiniband %K input-output programs %K Libraries %K Memory management %K message passing %K message passing interface %K Middleware %K MPI %K open source framework %K OpenSHMEM %K parallel programming %K parallel programming models %K partitioned global address space languages %K PGAS %K PGAS languages %K Programming %K protocols %K public domain software %K RDMA %K system libraries %K task-based paradigms %K UCX %K Unified Communication X %X This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware. %B 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects %I IEEE %C Santa Clara, CA, USA %P 40-43 %8 Aug %@ 978-1-4673-9160-3 %G eng %M 15573048 %R 10.1109/HOTI.2015.13 %0 Conference Paper %B Euro-Par 2014 %D 2014 %T Assembly Operations for Multicore Architectures using Task-Based Runtime Systems %A Damien Genet %A Abdou Guermouche %A George Bosilca %X Traditionally, numerical simulations based on finite element methods consider the algorithm as being divided in three major steps: the generation of a set of blocks and vectors, the assembly of these blocks in a matrix and a big vector, and the inversion of the matrix. In this paper we tackle the second step, the block assembly, where no parallel algorithm is widely available. Several strategies are proposed to decompose the assembly problem while relying on a scheduling middle-ware to maximize the overlap between stages and increase the parallelism and thus the performance. These strategies are quantified using examples covering two extremes in the field, large number of non-overlapping small blocks for CFD-like problems, and a smaller number of larger blocks with significant overlap which can be met in sparse linear algebra solvers. %B Euro-Par 2014 %I Springer International Publishing %C Porto, Portugal %8 2014-08 %G eng %0 Conference Paper %B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014 %D 2014 %T Assessing the Impact of ABFT and Checkpoint Composite Strategies %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Yves Robert %A Jack Dongarra %K ABFT %K checkpoint %K fault-tolerance %K High-performance computing %K resilience %X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFTunaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed. %B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Journal Article %J SIAM Journal on Matrix Analysis and Application %D 2014 %T Communication-Avoiding Symmetric-Indefinite Factorization %A Grey Ballard %A Dulceneia Becker %A James Demmel %A Jack Dongarra %A Alex Druinsky %A I Peled %A Oded Schwartz %A Sivan Toledo %A Ichitaro Yamazaki %K plasma %X We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen’s triangular tridiagonalization. It factors a dense symmetric matrix A as the product A = P LT L T P T where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. The current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal. %B SIAM Journal on Matrix Analysis and Application %V 35 %P 1364-1406 %8 2014-07 %G eng %N 4 %0 Conference Paper %B International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS) %D 2014 %T Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems %A Marc Baboulin %A Jack Dongarra %A Remi Lacroix %X This paper presents an efficient computation for least squares conditioning or estimates of it. We propose performance results using new routines on top of the multicore-GPU library MAGMA. This set of routines is based on an efficient computation of the variance-covariance matrix for which, to our knowledge, there is no implementation in current public domain libraries LAPACK and ScaLAPACK. %B International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS) %C Waterloo, Ontario, CA %8 2014-08 %G eng %0 Generic %D 2014 %T Design for a Soft Error Resilient Dynamic Task-based Runtime %A Chongxiao Cao %A Thomas Herault %A George Bosilca %A Jack Dongarra %X Abstract—As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms. %B ICL Technical Report %I University of Tennessee %8 2014-11 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14) %D 2014 %T Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster %A Ichitaro Yamazaki %A Sivasankaran Rajamanickam %A Eric G. Boman %A Mark Hoemmen %A Michael A. Heroux %A Stanimire Tomov %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14) %I IEEE %C New Orleans, LA %8 2014-11 %G eng %0 Generic %D 2014 %T Efficient checkpoint/verification patterns for silent error detection %A Anne Benoit %A Yves Robert %A Saurabh K. Raina %X Resilience has become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their particularities is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (Mean Time Between Failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared to the base algorithm that always perform a verification just before taking a checkpoint (p = q = 1), and we exhibit gains of up to 19%. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2014-05 %G eng %9 LAWN 287 %0 Journal Article %J Parallel Computing %D 2014 %T An Efficient Distributed Randomized Algorithm for Solving Large Dense Symmetric Indefinite Linear Systems %A Marc Baboulin %A Du Becker %A George Bosilca %A Anthony Danalis %A Jack Dongarra %K Distributed linear algebra solvers %K LDLT factorization %K PaRSEC runtime %K plasma %K Randomized algorithms %K Symmetric indefinite systems %X Randomized algorithms are gaining ground in high-performance computing applications as they have the potential to outperform deterministic methods, while still providing accurate results. We propose a randomized solver for distributed multicore architectures to efficiently solve large dense symmetric indefinite linear systems that are encountered, for instance, in parameter estimation problems or electromagnetism simulations. The contribution of this paper is to propose efficient kernels for applying random butterfly transformations and a new distributed implementation combined with a runtime (PaRSEC) that automatically adjusts data structures, data mappings, and the scheduling as systems scale up. Both the parallel distributed solver and the supporting runtime environment are innovative. To our knowledge, the randomization approach associated with this solver has never been used in public domain software for symmetric indefinite systems. The underlying runtime framework allows seamless data mapping and task scheduling, mapping its capabilities to the underlying hardware features of heterogeneous distributed architectures. The performance of our software is similar to that obtained for symmetric positive definite systems, but requires only half the execution time and half the amount of data storage of a general dense solver. %B Parallel Computing %V 40 %P 213-223 %8 2014-07 %G eng %N 7 %R 10.1016/j.parco.2013.12.003 %0 Conference Paper %B 8th International Conference on Partitioned Global Address Space Programming Models (PGAS) %D 2014 %T A Multithreaded Communication Substrate for OpenSHMEM %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %X OpenSHMEM scalability is strongly dependent on the capa- bility of its communication layer to efficiently handle multi- ple threads. In this paper, we present an early evaluation of the thread safety specification in the Unified Common Com- munication Substrate (UCCS) employed in OpenSHMEM. Results demonstrate that thread safety can be provided at an acceptable cost and can improve efficiency for some op- erations, compared to serializing communication. %B 8th International Conference on Partitioned Global Address Space Programming Models (PGAS) %C Eugene, OR %8 2014-10 %G eng %0 Conference Paper %B International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) %D 2014 %T PTG: An Abstraction for Unhindered Parallelism %A Anthony Danalis %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Jack Dongarra %K dte %K parsec %K plasma %X

Increased parallelism and use of heterogeneous computing resources is now an established trend in High Performance Computing (HPC), a trend that, looking forward to Exascale, seems bound to intensify. Despite the evolution of hardware over the past decade, the programming paradigm of choice was invariably derived from Coarse Grain Parallelism with explicit data movements. We argue that message passing has remained the de facto standard in HPC because, until now, the ever increasing challenges that application developers had to address to create efficient portable applications remained manageable for expert programmers.

Data-flow based programming is an alternative approach with significant potential. In this paper, we discuss the Parameterized Task Graph (PTG) abstraction and present the specialized input language that we use to specify PTGs in our data-flow task-based runtime system, PaRSEC. This language and the corresponding execution model are in contrast with the execution model of explicit message passing as well as the model of alternative task based runtime systems. The Parameterized Task Graph language decouples the expression of the parallelism in the algorithm from the control-flow ordering, load balance, and data distribution. Thus, programs are more adaptable and map more efficiently on challenging hardware, as well as maintain portability across diverse architectures. To support these claims, we discuss the different challenges of HPC programming and how PaRSEC can address them, and we demonstrate that in today’s large scale supercomputers, PaRSEC can significantly outperform state-of-the-art MPI applications and libraries, a trend that will increase with future architectural evolution.

%B International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) %I IEEE Press %C New Orleans, LA %8 2014-11 %G eng %0 Conference Paper %B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014 %D 2014 %T Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes %A Xavier Lacoste %A Mathieu Faverge %A Pierre Ramet %A Samuel Thibault %A George Bosilca %K DAG based runtime %K gpu %K Multicore %K Sparse linear solver %X The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of the computing resources. The pressure to maintain reasonable levels of performance and portability, forces the application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical architectures. In this paper, we study the replacement of the highly specialized internal scheduler in PaStiX by two generic runtime frameworks: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them with the opportunity to optimize it in order to maximize the algorithm eefficiency for a predefined execution environment. A comparative study of the performance of the PaStiX solver with the three schedulers { native PaStiX, StarPU and PaRSEC schedulers { on different execution contexts is performed. The analysis highlights the similarities from a performance point of view between the different execution supports. These results demonstrate that these generic DAG-based runtimes provide a uniform and portable programming interface across heterogeneous environments, and are, therefore, a sustainable solution for hybrid environments. %B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B 2014 IEEE International Conference on High Performance Computing and Communications (HPCC) %D 2014 %T Task-Based Programming for Seismic Imaging: Preliminary Results %A Lionel Boillot %A George Bosilca %A Emmanuel Agullo %A Henri Calandra %K plasma %X The level of hardware complexity of current supercomputers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms. While this paradigm has proved efficient for achieving such goals for dense and sparse linear solvers, it is yet to be demonstrated that industrial parallel codes—relying on the classical Message Passing Interface (MPI) standard and that accumulate dozens of years of expertise (and countless lines of code)—may be revisited to turn them into efficient task-based programs. In this paper, we study the applicability of task-based programming in the case of a Reverse Time Migration (RTM) application for Seismic Imaging. The initial MPI-based application is turned into a task-based code executed on top of the PaRSEC runtime system. Preliminary results show that the approach is competitive with (and even potentially superior to) the original MPI code on a homogeneous multicore node, and can more efficiently exploit complex hardware such as a cache coherent Non Uniform Memory Access (ccNUMA) node or an Intel Xeon Phi accelerator. %B 2014 IEEE International Conference on High Performance Computing and Communications (HPCC) %I IEEE %C Paris, France %8 2014-08 %G eng %0 Conference Paper %B 2014 IEEE International Conference on Cluster Computing %D 2014 %T Utilizing Dataflow-based Execution for Coupled Cluster Methods %A Heike McCraw %A Anthony Danalis %A George Bosilca %A Jack Dongarra %A Karol Kowalski %A Theresa Windus %X Computational chemistry comprises one of the driving forces of High Performance Computing. In particular, many-body methods, such as Coupled Cluster (CC) methods of the quantum chemistry package NWCHEM, are of particular interest for the applied chemistry community. Harnessing large fractions of the processing power of modern large scale computing platforms has become increasingly difficult. With the increase in scale, complexity, and heterogeneity of modern platforms, traditional programming models fail to deliver the expected performance scalability. On our way to Exascale and with these extremely hybrid platforms, dataflow-based programming models may be the only viable way for achieving and maintaining computation at scale. In this paper, we discuss a dataflow-based programming model and its applicability to NWCHEM’s CC methods. Our dataflow version of the CC kernels breaks down the algorithm into fine-grained tasks with explicitly defined data dependencies. As a result, many of the traditional synchronization points can be eliminated, allowing for a dynamic reshaping of the execution based on the ongoing availability of computational resources. We build this experiment using PARSEC – a task-based dataflow-driven execution engine – that enables efficient task scheduling on distributed systems, providing a desirable portability layer for application developers. %B 2014 IEEE International Conference on Cluster Computing %I IEEE %C Madrid, Spain %8 2014-09 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (also LAWN 246) %D 2013 %T Accelerating Linear System Solutions Using Randomization Techniques %A Marc Baboulin %A Jack Dongarra %A Julien Herrmann %A Stanimire Tomov %K algorithms %K dense linear algebra %K experimentation %K graphics processing units %K linear systems %K lu factorization %K multiplicative preconditioning %K numerical linear algebra %K performance %K plasma %K randomization %X We illustrate how linear algebra calculations can be enhanced by statistical techniques in the case of a square linear system Ax = b. We study a random transformation of A that enables us to avoid pivoting and then to reduce the amount of communication. Numerical experiments show that this randomization can be performed at a very affordable computational price while providing us with a satisfying accuracy when compared to partial pivoting. This random transformation called Partial Random Butterfly Transformation (PRBT) is optimized in terms of data storage and flops count. We propose a solver where PRBT and the LU factorization with no pivoting take advantage of the current hybrid multicore/GPU machines and we compare its Gflop/s performance with a solver implemented in a current parallel library. %B ACM Transactions on Mathematical Software (also LAWN 246) %V 39 %8 2013-02 %G eng %U http://dl.acm.org/citation.cfm?id=2427025 %N 2 %R 10.1145/2427023.2427025 %0 Generic %D 2013 %T Assessing the impact of ABFT and Checkpoint composite strategies %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Yves Robert %A Jack Dongarra %K ABFT %K checkpoint %K fault-tolerance %K High-performance computing %K resilience %X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFT-unaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed. %B University of Tennessee Computer Science Technical Report %G eng %0 Generic %D 2013 %T On the Combination of Silent Error Detection and Checkpointing %A Guillaume Aupy %A Anne Benoit %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %K checkpointing %K error recovery %K High-performance computing %K silent data corruption %K verification %X In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters. %B UT-CS-13-710 %I University of Tennessee Computer Science Technical Report %8 2013-06 %G eng %U http://www.netlib.org/lapack/lawnspdf/lawn278.pdf %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Correlated Set Coordination in Fault Tolerant Message Logging Protocols %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes. %B Concurrency and Computation: Practice and Experience %V 25 %P 572-585 %8 2013-03 %G eng %N 4 %R 10.1002/cpe.2859 %0 Conference Proceedings %B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2013 %T CPU-GPU Hybrid Bidiagonal Reduction With Soft Error Resilience %A Yulu Jia %A Piotr Luszczek %A George Bosilca %A Jack Dongarra %X Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur. %B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %C Montpellier, France %8 2013-11 %G eng %0 Journal Article %J Scalable Computing and Communications: Theory and Practice %D 2013 %T Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Piotr Luszczek %A Jack Dongarra %E Samee Khan %E Lin-Wang Wang %E Albert Zomaya %B Scalable Computing and Communications: Theory and Practice %I John Wiley & Sons %P 699-735 %8 2013-03 %G eng %0 Conference Paper %B 7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems %D 2013 %T Efficient Parallelization of Batch Pattern Training Algorithm on Many-core and Cluster Architectures %A Volodymyr Turchenko %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %K many-core system %K parallel batch pattern training %K parallelization efficiency %K recirculation neural network %X Abstract—The experimental research of the parallel batch pattern back propagation training algorithm on the example of recirculation neural network on many-core high performance computing systems is presented in this paper. The choice of recirculation neural network among the multilayer perceptron, recurrent and radial basis neural networks is proved. The model of a recirculation neural network and usual sequential batch pattern algorithm of its training are theoretically described. An algorithmic description of the parallel version of the batch pattern training method is presented. The experimental research is fulfilled using the Open MPI, Mvapich and Intel MPI message passing libraries. The results obtained on many-core AMD system and Intel MIC are compared with the results obtained on a cluster system. Our results show that the parallelization efficiency is about 95% on 12 cores located inside one physical AMD processor for the considered minimum and maximum scenarios. The parallelization efficiency is about 70-75% on 48 AMD cores for the minimum and maximum scenarios. These results are higher by 15-36% (depending on the version of MPI library) in comparison with the results obtained on 48 cores of a cluster system. The parallelization efficiency obtained on Intel MIC architecture is surprisingly low, asking for deeper analysis. %B 7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems %C Berlin, Germany %8 2013-09 %G eng %0 Journal Article %J Computing %D 2013 %T An evaluation of User-Level Failure Mitigation support in MPI %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A Joshua Hursey %A George Bosilca %A Jack Dongarra %K Fault tolerance %K MPI %K User-level fault mitigation %X As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures. %B Computing %V 95 %P 1171-1184 %8 2013-12 %G eng %N 12 %R 10.1007/s00607-013-0331-3 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X Most predictions of exascale machines picture billion ways parallelism, encompassing not only millions of cores but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: (i) traditional checkpoint-based approaches incur a steep overhead on failure free operations and (ii) the dominant programming paradigm for parallel applications (the message passing interface (MPI) Standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI Standard, to enable advanced forward recovery techniques, without incurring the overhead of customary periodic checkpointing. With our approach, when failure strikes, applications regain control to make a checkpoint before quitting execution. This checkpoint is in reaction to the failure occurrence rather than periodic. This checkpoint is reloaded in a new MPI application, which restores a sane environment for the forward, application-based recovery technique to repair the failure-damaged dataset. The validity and performance of this approach are evaluated on large-scale systems, using the QR factorization as an example. Published 2013. This article is a US Government work and is in the public domain in the USA. %B Concurrency and Computation: Practice and Experience %8 2013-07 %G eng %U http://doi.wiley.com/10.1002/cpe.3100 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.3100 %0 Journal Article %J IPDPS 2013 (submitted) %D 2013 %T Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures %A Ichitaro Yamazaki %A Dulceneia Becker %A Jack Dongarra %A Alex Druinsky %A I. Peled %A Sivan Toledo %A Grey Ballard %A James Demmel %A Oded Schwartz %X Factorization of a dense symmetric indefinite matrix is a key computational kernel in many scientific and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symmetry and guarantees numerical stability through pivoting at the same time. This is because such an algorithm exhibits many of the fundamental challenges in parallel programming like irregular data accesses and irregular task dependencies. In this paper, we address these challenges in a tiled implementation of a blocked Aasen’s algorithm using a dynamic scheduler. To fully exploit the limited parallelism in this left-looking algorithm, we study several performance enhancing techniques; e.g., parallel reduction to update a panel, tall-skinny LU factorization algorithms to factorize the panel, and a parallel implementation of symmetric pivoting. Our performance results on up to 48 AMD Opteron processors demonstrate that our implementation obtains speedups of up to 2.8 over MKL, while losing only one or two digits in the computed residual norms. %B IPDPS 2013 (submitted) %C Boston, MA %8 2013-00 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2013 %T Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %K Cluster %K Collective communication %K Hierarchical %K HPC %K MPI %K Multicore %X Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies. This calls for a collaborative approach between multiple layers of collective algorithms, dedicated to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications. In this work, we present HierKNEM, a kernel-assisted topology-aware collective framework, and the mechanisms deployed by this framework to orchestrate the collaboration between multiple layers of collective algorithms. The resulting scheme maximizes the overlap of intra- and inter-node communications. We demonstrate experimentally, by considering three of the most used collective operations (Broadcast, Allgather and Reduction), that (1) this approach is immune to modifications of the underlying process-core binding; (2) it outperforms state-of-art MPI libraries (Open MPI, MPICH2 and MVAPICH2) demonstrating up to a 30x speedup for synthetic benchmarks, and up to a 3x acceleration for a parallel graph application (ASP); (3) it furthermore demonstrates a linear speedup with the increase of the number of cores per compute node, a paramount requirement for scalability on future many-core hardware. %B Journal of Parallel and Distributed Computing %V 73 %P 1000-1010 %8 2013-07 %G eng %U http://www.sciencedirect.com/science/article/pii/S0743731513000166 %N 7 %R 10.1016/j.jpdc.2013.01.015 %0 Book Section %B Handbook of Linear Algebra %D 2013 %T LAPACK %A Zhaojun Bai %A James Demmel %A Jack Dongarra %A Julien Langou %A Jenny Wang %X With a substantial amount of new material, the Handbook of Linear Algebra, Second Edition provides comprehensive coverage of linear algebra concepts, applications, and computational software packages in an easy-to-use format. It guides you from the very elementary aspects of the subject to the frontiers of current research. Along with revisions and updates throughout, the second edition of this bestseller includes 20 new chapters. %B Handbook of Linear Algebra %7 Second %I CRC Press %C Boca Raton, FL %@ 9781466507289 %G eng %0 Generic %D 2013 %T Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization %A Aurelien Bouteiller %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %X Failures are increasingly threatening the eciency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to generalpurpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency. %B University of Tennessee Computer Science Technical Report %8 2013-02 %G eng %0 Conference Paper %B Euro-Par 2013 %D 2013 %T Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization %A Aurelien Bouteiller %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %X Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that roll- back recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the sys- tem batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted com- putation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application comple- tion time is unchanged, while it delivers near-perfect platform efficiency. %B Euro-Par 2013 %I Springer %C Aachen, Germany %8 2013-08 %G eng %0 Journal Article %J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications %D 2013 %T Multithreading in the PLASMA Library %A Jakub Kurzak %A Piotr Luszczek %A Asim YarKhan %A Mathieu Faverge %A Julien Langou %A Henricus Bouwmeester %A Jack Dongarra %E Mohamed Ahmed %E Reda Ammar %E Sanguthevar Rajasekaran %K plasma %B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications %I Taylor & Francis %8 2013-00 %G eng %0 Generic %D 2013 %T Optimal Checkpointing Period: Time vs. Energy %A Guillaume Aupy %A Anne Benoit %A Thomas Herault %A Yves Robert %A Jack Dongarra %B University of Tennessee Computer Science Technical Report (also LAWN 281) %I University of Tennessee %8 2013-10 %G eng %0 Conference Paper %B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013 %D 2013 %T Parallel Reduction to Hessenberg Form with Algorithm-Based Fault Tolerance %A Yulu Jia %A George Bosilca %A Piotr Luszczek %A Jack Dongarra %X This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases. %B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013 %C Denver, CO %8 2013-11 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS 2013) %D 2013 %T A Parallel Solver for Incompressible Fluid Flows %A Yushan Wang %A Marc Baboulin %A Joël Falcou %A Yann Fraigneau %A Olivier Le Maître %K ADI %K Navier-Stokes equations %K Parallel computing %K Partial diagonalization %K Prediction-projection %K SIMD %X The Navier-Stokes equations describe a large class of fluid flows but are difficult to solve analytically because of their nonlin- earity. We present in this paper a parallel solver for the 3-D Navier-Stokes equations of incompressible unsteady flows with constant coefficients, discretized by the finite difference method. We apply the prediction-projection method which transforms the Navier-Stokes equations into three Helmholtz equations and one Poisson equation. For each Helmholtz system, we apply the Alternating Direction Implicit (ADI) method resulting in three tridiagonal systems. The Poisson equation is solved using partial diagonalization which transforms the Laplacian operator into a tridiagonal one. We describe an implementation based on MPI where the computations are performed on each subdomain and information is exchanged on the interfaces, and where the tridiagonal system solutions are accelerated using vectorization techniques. We present performance results on a current multicore system. %B International Conference on Computational Science (ICCS 2013) %I Elsevier B.V. %C Barcelona, Spain %8 2013-06 %G eng %R DOI: 10.1016/j.procs.2013.05.207 %0 Journal Article %J IEEE Computing in Science and Engineering %D 2013 %T PaRSEC: Exploiting Heterogeneity to Enhance Scalability %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Thomas Herault %A Jack Dongarra %X New high-performance computing system designs with steeply escalating processor and core counts, burgeoning heterogeneity and accelerators, and increasingly unpredictable memory access times call for dramatically new programming paradigms. These new approaches must react and adapt quickly to unexpected contentions and delays, and they must provide the execution environment with sufficient intelligence and flexibility to rearrange the execution to improve resource utilization. %B IEEE Computing in Science and Engineering %V 15 %P 36-45 %8 2013-11 %G eng %N 6 %R 10.1109/MCSE.2013.98 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2013 %T Post-failure recovery of MPI communication capability: Design and rationale %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery. %B International Journal of High Performance Computing Applications %V 27 %P 244 - 254 %8 2013-01 %G eng %U http://hpc.sagepub.com/cgi/doi/10.1177/1094342013488238 %N 3 %! International Journal of High Performance Computing Applications %R 10.1177/1094342013488238 %0 Book Section %B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing %D 2013 %T Scalable Dense Linear Algebra on Heterogeneous Hardware %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Abstract. Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect, that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand, when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs), which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores. %B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing %G eng %0 Conference Paper %B 17th IEEE High Performance Extreme Computing Conference (HPEC '13) %D 2013 %T Standards for Graph Algorithm Primitives %A Tim Mattson %A David Bader %A Jon Berry %A Aydin Buluc %A Jack Dongarra %A Christos Faloutsos %A John Feo %A John Gilbert %A Joseph Gonzalez %A Bruce Hendrickson %A Jeremy Kepner %A Charles Lieserson %A Andrew Lumsdaine %A David Padua %A Steve W. Poole %A Steve Reinhardt %A Mike Stonebraker %A Steve Wallach %A Andrew Yoo %K algorithms %K graphs %K linear algebra %K software standards %X It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard. %B 17th IEEE High Performance Extreme Computing Conference (HPEC '13) %I IEEE %C Waltham, MA %8 2013-09 %G eng %R 10.1109/HPEC.2013.6670338 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale %A George Bosilca %A Aurelien Bouteiller %A Elisabeth Brunet %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %X In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation. %B Concurrency and Computation: Practice and Experience %8 2013-11 %G eng %R 10.1002/cpe.3173 %0 Conference Proceedings %B Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012 %D 2012 %T Algorithm-Based Fault Tolerance for Dense Matrix Factorization %A Peng Du %A Aurelien Bouteiller %A George Bosilca %A Thomas Herault %A Jack Dongarra %E J. Ramanujam %E P. Sadayappan %K ft-la %K ftmpi %X Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. %B Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012 %I ACM %C New Orleans, LA, USA %P 225-234 %8 2012-02 %G eng %R 10.1145/2145816.2145845 %0 Conference Proceedings %B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award) %D 2012 %T A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %E Christos Kaklamanis %E Theodore Papatheodorou %E Paul Spirakis %B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award) %I Springer-Verlag %C Rhodes, Greece %8 2012-08 %G eng %0 Conference Proceedings %B Proc. of the International Conference on Computational Science (ICCS) %D 2012 %T A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines %A Marc Baboulin %A Simplice Donfack %A Jack Dongarra %A Laura Grigori %A Adrien Remi %A Stanimire Tomov %K magma %B Proc. of the International Conference on Computational Science (ICCS) %V 9 %P 17-26 %8 2012-06 %G eng %0 Journal Article %J Parallel Computing %D 2012 %T DAGuE: A generic distributed DAG Engine for High Performance Computing. %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %K dague %K parsec %B Parallel Computing %I Elsevier %V 38 %P 27-51 %8 2012-00 %G eng %0 Journal Article %J High Performance Scientific Computing: Algorithms and Applications %D 2012 %T Dense Linear Algebra on Accelerated Multicore Hardware %A Jack Dongarra %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %E Michael Berry %E et al., %B High Performance Scientific Computing: Algorithms and Applications %I Springer-Verlag %C London, UK %8 2012-00 %G eng %0 Generic %D 2012 %T An efficient distributed randomized solver with application to large dense linear systems %A Marc Baboulin %A Dulceneia Becker %A George Bosilca %A Anthony Danalis %A Jack Dongarra %K dague %K dplasma %K parsec %K plasma %B ICL Technical Report %8 2012-07 %G eng %0 Conference Proceedings %B 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing %D 2012 %T Enabling Application Resilience With and Without the MPI Standard %A Wesley Bland %B 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing %C Ottawa, Canada %8 2012-05 %G eng %0 Conference Proceedings %B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012 %D 2012 %T An Evaluation of User-Level Failure Mitigation Support in MPI %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A Joshua Hursey %A George Bosilca %A Jack Dongarra %B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012 %I Springer %C Vienna, Austria %8 2012-09 %G eng %0 Generic %D 2012 %T Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %K ftmpi %B University of Tennessee Computer Science Technical Report %8 2012-00 %G eng %0 Conference Paper %B International European Conference on Parallel and Distributed Computing (Euro-Par '12) %D 2012 %T From Serial Loops to Parallel Execution on Distributed Systems %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Jack Dongarra %B International European Conference on Parallel and Distributed Computing (Euro-Par '12) %C Rhodes, Greece %8 2012-08 %G eng %0 Journal Article %J IPDPS 2012 (Best Paper) %D 2012 %T HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %B IPDPS 2012 (Best Paper) %C Shanghai, China %8 2012-05 %G eng %0 Journal Article %J Supercomputing '12 (poster) %D 2012 %T Matrices Over Runtime Systems at Exascale %A Emmanuel Agullo %A George Bosilca %A Cedric Castagnède %A Jack Dongarra %A Hatem Ltaeif %A Stanimire Tomov %B Supercomputing '12 (poster) %C Salt Lake City, Utah %8 2012-11 %G eng %0 Journal Article %J IPDPS 2012 %D 2012 %T A Parallel Tiled Solver for Symmetric Indefinite Systems On Multicore Architectures %A Marc Baboulin %A Dulceneia Becker %A Jack Dongarra %B IPDPS 2012 %C Shanghai, China %8 2012-05 %G eng %0 Conference Proceedings %B Third International Conference on Energy-Aware High Performance Computing %D 2012 %T Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems %A George Bosilca %A Jack Dongarra %A Hatem Ltaeif %B Third International Conference on Energy-Aware High Performance Computing %C Hamburg, Germany %8 2012-09 %G eng %0 Generic %D 2012 %T A Proposal for User-Level Failure Mitigation in the MPI-3 Standard %A Wesley Bland %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Jack Dongarra %K ftmpi %B University of Tennessee Electrical Engineering and Computer Science Technical Report %I University of Tennessee %8 2012-02 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2012 %T Recent Advances in the Message Passing Interface: 19th European MPI Users' Group Meeting, EuroMPI 2012 %E Jesper Larsson Träff %E Siegfried Benkner %E Jack Dongarra %B Lecture Notes in Computer Science %C Vienna, Austria %V 7490 %8 2012-00 %G eng %0 Journal Article %J Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011) %D 2012 %T Reducing the Amount of Pivoting in Symmetric Indefinite Systems %A Dulceneia Becker %A Marc Baboulin %A Jack Dongarra %E Roman Wyrzykowski %E Jack Dongarra %E Konrad Karczewski %E Jerzy Wasniewski %B Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011) %I Springer-Verlag Berlin Heidelberg %V 7203 %P 133-142 %8 2012-00 %G eng %0 Generic %D 2012 %T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale %A George Bosilca %A Aurelien Bouteiller %A Elisabeth Brunet %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %B University of Tennessee Computer Science Technical Report (also LAWN 269) %8 2012-06 %G eng %0 Conference Proceedings %B Euro-Par 2012: Parallel Processing Workshops %D 2012 %T User Level Failure Mitigation in MPI %A Wesley Bland %E Ioannis Caragiannis %E Michael Alexander %E Rosa M. Badia %E Mario Cannataro %E Alexandru Costan %E Marco Danelutto %E Frederic Desprez %E Bettina Krammer %E Sahuquillo, J. %E Stephen L. Scott %E J. Weidendorfer %K ftmpi %B Euro-Par 2012: Parallel Processing Workshops %I Springer Berlin Heidelberg %C Rhodes Island, Greece %V 7640 %P 499-504 %8 2012-08 %G eng %0 Conference Proceedings %B 73rd EAGE Conference & Exhibition incorporating SPE EUROPEC 2011, Vienna, Austria, 23-26 May %D 2011 %T 3-D parallel frequency-domain visco-acoustic wave modelling based on a hybrid direct/iterative solver %A Azzam Haidar %A Luc Giraud %A Hafedh Ben-Hadj-Ali %A Florent Sourbier %A Stéphane Operto %A Jean Virieux %B 73rd EAGE Conference & Exhibition incorporating SPE EUROPEC 2011, Vienna, Austria, 23-26 May %8 2011-00 %G eng %0 Journal Article %J INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11) %D 2011 %T Accelerating Linear System Solutions Using Randomization Techniques %A Marc Baboulin %A Jack Dongarra %A Julien Herrmann %A Stanimire Tomov %K magma %B INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11) %C Waterloo, Ontario, Canada %8 2011-07 %G eng %0 Generic %D 2011 %T Algorithm-based Fault Tolerance for Dense Matrix Factorizations %A Peng Du %A Aurelien Bouteiller %A George Bosilca %A Thomas Herault %A Jack Dongarra %K ft-la %B University of Tennessee Computer Science Technical Report %C Knoxville, TN %8 2011-08 %G eng %0 Conference Proceedings %B Proceedings of 17th International Conference, Euro-Par 2011, Part II %D 2011 %T Correlated Set Coordination in Fault Tolerant Message Logging Protocols %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %E Emmanuel Jeannot %E Raymond Namyst %E Jean Roman %K ftmpi %B Proceedings of 17th International Conference, Euro-Par 2011, Part II %I Springer %C Bordeaux, France %V 6853 %P 51-64 %8 2011-08 %G eng %0 Conference Proceedings %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %D 2011 %T DAGuE: A Generic Distributed DAG Engine for High Performance Computing %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %K dague %K parsec %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %I IEEE %C Anchorage, Alaska, USA %P 1151-1158 %8 2011-00 %G eng %0 Conference Proceedings %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %D 2011 %T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %I IEEE %C Anchorage, Alaska, USA %P 1432-1441 %8 2011-05 %G eng %0 Journal Article %J 18th EuroMPI %D 2011 %T Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW %A Teng Ma %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %K dague %B 18th EuroMPI %I Springer %C Santorini, Greece %P 247-254 %8 2011-09 %G eng %0 Journal Article %J International Journal of High Performance Computing %D 2011 %T The International Exascale Software Project Roadmap %A Jack Dongarra %A Pete Beckman %A Terry Moore %A Patrick Aerts %A Giovanni Aloisio %A Jean-Claude Andre %A David Barkai %A Jean-Yves Berthou %A Taisuke Boku %A Bertrand Braunschweig %A Franck Cappello %A Barbara Chapman %A Xuebin Chi %A Alok Choudhary %A Sudip Dosanjh %A Thom Dunning %A Sandro Fiore %A Al Geist %A Bill Gropp %A Robert Harrison %A Mark Hereld %A Michael Heroux %A Adolfy Hoisie %A Koh Hotta %A Zhong Jin %A Yutaka Ishikawa %A Fred Johnson %A Sanjay Kale %A Richard Kenway %A David Keyes %A Bill Kramer %A Jesus Labarta %A Alain Lichnewsky %A Thomas Lippert %A Bob Lucas %A Barney MacCabe %A Satoshi Matsuoka %A Paul Messina %A Peter Michielse %A Bernd Mohr %A Matthias S. Mueller %A Wolfgang E. Nagel %A Hiroshi Nakashima %A Michael E. Papka %A Dan Reed %A Mitsuhisa Sato %A Ed Seidel %A John Shalf %A David Skinner %A Marc Snir %A Thomas Sterling %A Rick Stevens %A Fred Streitz %A Bob Sugar %A Shinji Sumimoto %A William Tang %A John Taylor %A Rajeev Thakur %A Anne Trefethen %A Mateo Valero %A Aad van der Steen %A Jeffrey Vetter %A Peg Williams %A Robert Wisniewski %A Kathy Yelick %X Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project. %B International Journal of High Performance Computing %V 25 %P 3-60 %8 2011-01 %G eng %R https://doi.org/10.1177/1094342010391989 %0 Conference Proceedings %B Int'l Conference on Parallel Processing (ICPP '11) %D 2011 %T Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Brice Goglin %A J. Squyres %A Jack Dongarra %B Int'l Conference on Parallel Processing (ICPP '11) %C Taipei, Taiwan %8 2011-09 %G eng %0 Journal Article %J 18th EuroMPI %D 2011 %T OMPIO: A Modular Software Architecture for MPI I/O %A Mohamad Chaarawi %A Edgar Gabriel %A Rainer Keller %A Richard L. Graham %A George Bosilca %A Jack Dongarra %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %B 18th EuroMPI %I Springer %C Santorini, Greece %P 81-89 %8 2011-09 %G eng %0 Conference Paper %B International Conference on Parallel Processing (ICPP'11) %D 2011 %T Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs %A Allen D. Malony %A Scott Biersdorff %A Sameer Shende %A Heike Jagode %A Stanimire Tomov %A Guido Juckeland %A Robert Dietrich %A Duncan Poole %A Christopher Lamb %K magma %K mumi %K papi %X The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. This paper studies the problems associated with performance measurement of heterogeneous machines with GPUs. A heterogeneous computation model and alternative host-GPU measurement approaches are discussed to set the stage for reporting new capabilities for heterogeneous parallel performance measurement in three leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work leverages the new CUPTI tool support in NVIDIA's CUDA device library. Heterogeneous benchmarks from the SHOC suite are used to demonstrate the measurement methods and tool support. %B International Conference on Parallel Processing (ICPP'11) %I ACM %C Taipei, Taiwan %8 2011-09 %@ 978-0-7695-4510-3 %G eng %R 10.1109/ICPP.2011.71 %0 Generic %D 2011 %T A parallel tiled solver for dense symmetric indefinite systems on multicore architectures %A Marc Baboulin %A Dulceneia Becker %A Jack Dongarra %K plasma %K quark %B University of Tennessee Computer Science Technical Report %8 2011-10 %G eng %0 Journal Article %J IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC) %D 2011 %T Performance Portability of a GPU Enabled Factorization with the DAGuE Framework %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Narapat Ohm Saengpatsa %A Stanimire Tomov %A Jack Dongarra %K dague %K magma %K parsec %B IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC) %8 2011-06 %G eng %0 Conference Proceedings %B IEEE Int'l Conference on Cluster Computing (Cluster 2011) %D 2011 %T Process Distance-aware Adaptive MPI Collective Communications %A Teng Ma %A Thomas Herault %A George Bosilca %A Jack Dongarra %B IEEE Int'l Conference on Cluster Computing (Cluster 2011) %C Austin, Texas %8 2011-00 %G eng %0 Generic %D 2011 %T Reducing the Amount of Pivoting in Symmetric Indefinite Systems %A Dulceneia Becker %A Marc Baboulin %A Jack Dongarra %B University of Tennessee Innovative Computing Laboratory Technical Report %I Submitted to PPAM 2011 %C Knoxville, TN %8 2011-05 %G eng %0 Generic %D 2011 %T On Scalability for MPI Runtime Systems %A George Bosilca %A Thomas Herault %A A. Rezmerita %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %C Knoxville, TN %8 2011-05 %G eng %0 Conference Proceedings %B International Conference on Cluster Computing (CLUSTER) %D 2011 %T On Scalability for MPI Runtime Systems %A George Bosilca %A Thomas Herault %A A. Rezmerita %A Jack Dongarra %K harness %B International Conference on Cluster Computing (CLUSTER) %I IEEEE %C Austin, TX, USA %P 187-195 %8 2011-09 %G eng %0 Conference Proceedings %B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011 %D 2011 %T Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure %A George Bosilca %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %A A. Rezmerita %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %K ftmpi %B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011 %I Springer %C Santorini, Greece %V 6960 %P 342-344 %8 2011-09 %G eng %0 Journal Article %J To appear in Geophysical Prospecting journal. %D 2011 %T Three-dimensional parallel frequency-domain visco-acoustic wave modelling based on a hybrid direct/iterative solver. %A Florent Sourbier %A Azzam Haidar %A Luc Giraud %A Hafedh Ben-Hadj-Ali %A Stéphane Operto %A Jean Virieux %B To appear in Geophysical Prospecting journal. %8 2011-00 %G eng %0 Generic %D 2011 %T Towards a Parallel Tile LDL Factorization for Multicore Architectures %A Dulceneia Becker %A Mathieu Faverge %A Jack Dongarra %K plasma %K quark %B ICL Technical Report %C Seattle, WA %8 2011-04 %G eng %0 Conference Proceedings %B IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2011 %T A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Narapat Ohm Saengpatsa %A Stanimire Tomov %A Jack Dongarra %K dague %B IEEE International Parallel and Distributed Processing Symposium (submitted) %C Anchorage, AK %8 2011-05 %G eng %0 Generic %D 2010 %T Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers %A Stanimire Tomov %A George Bosilca %A Cedric Augonnet %I 2010 Symposium on Application Accelerators in. High-Performance Computing (SAAHPC'10), Tutorial %8 2010-07 %G eng %0 Journal Article %J Parallel Computing (to appear) %D 2010 %T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %B Parallel Computing (to appear) %8 2010-00 %G eng %0 Journal Article %J Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale %D 2010 %T Constructing Resiliant Communication Infrastructure for Runtime Environments in Advances in Parallel Computing %A George Bosilca %A Camille Coti %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %E Barbara Chapman %E Frederic Desprez %E Gerhard R. Joubert %E Alain Lichnewsky %E Frans Peters %E T. Priol %B Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale %V 19 %P 441-451 %G eng %R 10.3233/978-1-60750-530-3-441 %0 Generic %D 2010 %T DAGuE: A generic distributed DAG engine for high performance computing %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %K dague %B Innovative Computing Laboratory Technical Report %8 2010-04 %G eng %0 Generic %D 2010 %T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-10-660 %8 2010-09 %G eng %0 Generic %D 2010 %T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K plasma %B Innovative Computing Laboratory Technical Report %8 2010-00 %G eng %0 Conference Proceedings %B Proceedings of EuroMPI 2010 %D 2010 %T Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %E Jack Dongarra %E Michael Resch %E Rainer Keller %E Edgar Gabriel %K ftmpi %B Proceedings of EuroMPI 2010 %I Springer %C Stuttgart, Germany %8 2010-09 %G eng %0 Journal Article %J in Performance Tuning of Scientific Applications (to appear) %D 2010 %T Empirical Performance Tuning of Dense Linear Algebra Software %A Jack Dongarra %A Shirley Moore %E David Bailey %E Robert Lucas %E Sam Williams %B in Performance Tuning of Scientific Applications (to appear) %8 2010-00 %G eng %0 Conference Proceedings %B Proceedings of International Conference on Computational Science, ICCS 2010 (to appear) %D 2010 %T Improvement of parallelization efficiency of batch pattern BP training algorithm using Open MPI %A Volodymyr Turchenko %A Lucio Grandinetti %A George Bosilca %A Jack Dongarra %K hpcchallenge %B Proceedings of International Conference on Computational Science, ICCS 2010 (to appear) %I Elsevier %C Amsterdam The Netherlands %8 2010-06 %G eng %0 Generic %D 2010 %T International Exascale Software Project Roadmap v1.0 %A Jack Dongarra %A Pete Beckman %B University of Tennessee Computer Science Technical Report, UT-CS-10-654 %8 2010-05 %G eng %0 Generic %D 2010 %T Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Brice Goglin %A J. Squyres %A Jack Dongarra %B University of Tennessee Computer Science Technical Report, UT-CS-10-663 %8 2010-11 %G eng %0 Conference Proceedings %B Proceedings of the 17th EuroMPI conference %D 2010 %T Locality and Topology aware Intra-node Communication Among Multicore CPUs %A Teng Ma %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %B Proceedings of the 17th EuroMPI conference %I LNCS %C Stuttgart, Germany %8 2010-09 %G eng %0 Conference Proceedings %B Proceedings of the Cray Users' Group Meeting %D 2010 %T Performance Evaluation for Petascale Quantum Simulation Tools %A Stanimire Tomov %A Wenchang Lu %A %A Jerzy Bernholc %A Shirley Moore %A Jack Dongarra %B Proceedings of the Cray Users' Group Meeting %C Atlanta, GA %8 2010-05 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience (online version) %D 2010 %T Redesigning the Message Logging Model for High Performance %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %B Concurrency and Computation: Practice and Experience (online version) %8 2010-06 %G eng %0 Journal Article %J PARA 2010 %D 2010 %T Scalability Study of a Quantum Simulation Code %A Jerzy Bernholc %A Miroslav Hodak %A Wenchang Lu %A Shirley Moore %A Stanimire Tomov %B PARA 2010 %C Reykjavik, Iceland %8 2010-06 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2010 %T Scheduling Dense Linear Algebra Operations on Multicore Processors %A Jakub Kurzak %A Hatem Ltaeif %A Jack Dongarra %A Rosa M. Badia %K gridpac %K plasma %B Concurrency and Computation: Practice and Experience %V 22 %P 15-44 %8 2010-01 %G eng %0 Journal Article %J Journal of Scientific Computing %D 2010 %T Scheduling Two-sided Transformations using Tile Algorithms on Multicore Architectures %A Hatem Ltaeif %A Jakub Kurzak %A Jack Dongarra %A Rosa M. Badia %K plasma %B Journal of Scientific Computing %V 18 %P 33-50 %8 2010-00 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2010 %T Self-Healing Network for Scalable Fault-Tolerant Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %B Future Generation Computer Systems %V 26 %P 479-485 %8 2010-03 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience (to appear) %D 2010 %T SmartGridRPC: The new RPC model for high performance Grid Computing and Its Implementation in SmartGridSolve %A Thomas Brady %A Alexey Lastovetsky %A Keith Seymour %A Michele Guidolin %A Jack Dongarra %K netsolve %B Concurrency and Computation: Practice and Experience (to appear) %8 2010-01 %G eng %0 Journal Article %J Parallel Computing %D 2010 %T Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems %A Stanimire Tomov %A Jack Dongarra %A Marc Baboulin %K magma %B Parallel Computing %V 36 %P 232-240 %8 2010-00 %G eng %0 Journal Article %J Parallel Computing %D 2010 %T Using multiple levels of parallelism to enhance the performance of domain decomposition solvers %A Luc Giraud %A Azzam Haidar %A Stephane Pralet %E Costas Bekas %E Pascua D’Ambra %E Ananth Grama %E Yousef Saad %E Petko Yanev %B Parallel Computing %I Elsevier journals %V 36 %P 285-296 %8 2010-00 %G eng %0 Journal Article %J Computer Physics Communications %D 2009 %T Accelerating Scientific Computations with Mixed Precision Algorithms %A Marc Baboulin %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julie Langou %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %X On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. %B Computer Physics Communications %V 180 %P 2526-2533 %8 2009-12 %G eng %N 12 %R https://doi.org/10.1016/j.cpc.2008.11.005 %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2009 %T Algorithmic Based Fault Tolerance Applied to High Performance Computing %A Jack Dongarra %A George Bosilca %A Remi Delmas %A Julien Langou %B Journal of Parallel and Distributed Computing %V 69 %P 410-416 %8 2009-00 %G eng %0 Journal Article %J Parallel Computing %D 2009 %T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %K plasma %B Parallel Computing %V 35 %P 38-53 %8 2009-00 %G eng %0 Journal Article %J Numerical Linear Algebra with Applications %D 2009 %T Computing the Conditioning of the Components of a Linear Least-squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B Numerical Linear Algebra with Applications %V 16 %P 517-533 %8 2009-00 %G eng %0 Generic %D 2009 %T Constructing resiliant communication infrastructure for runtime environments %A George Bosilca %A Camille Coti %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %B Innovative Computing Laboratory Technical Report %8 2009-07 %G eng %0 Journal Article %J ParCo 2009 %D 2009 %T Constructing Resilient Communication Infrastructure for Runtime Environments %A Pierre Lemariner %A George Bosilca %A Camille Coti %A Thomas Herault %A Jack Dongarra %B ParCo 2009 %C Lyon France %8 2009-09 %G eng %0 Journal Article %J PPAM 2009 %D 2009 %T Dependency-Driven Scheduling of Dense Matrix Factorizations on Shared-Memory Systems %A Jakub Kurzak %A Hatem Ltaeif %A Jack Dongarra %A Rosa M. Badia %B PPAM 2009 %C Poland %8 2009-09 %G eng %0 Journal Article %J Euro-Par 2009, Lecture Notes in Computer Science %D 2009 %T Impact of Quad-core Cray XT4 System and Software Stack on Scientific Computation %A Sadaf Alam %A Richard F. Barrett %A Heike Jagode %A J. A. Kuehn %A Steve W. Poole %A R. Sankaran %K test %B Euro-Par 2009, Lecture Notes in Computer Science %I Springer Berlin / Heidelberg %C Delft, The Netherlands %V 5704/2009 %P 334-344 %8 2009-08 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications (to appear) %D 2009 %T The International Exascale Software Project: A Call to Cooperative Action by the Global High Performance Community %A Jack Dongarra %A Pete Beckman %A Patrick Aerts %A Franck Cappello %A Thomas Lippert %A Satoshi Matsuoka %A Paul Messina %A Terry Moore %A Rick Stevens %A Anne Trefethen %A Mateo Valero %B International Journal of High Performance Computing Applications (to appear) %8 2009-07 %G eng %0 Conference Proceedings %B SciDAC 2009, Journal of Physics: Conference Series %D 2009 %T Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team %A Bronis R. de Supinski %A Sadaf Alam %A David Bailey %A Laura Carrington %A Chris Daley %A Anshu Dubey %A Todd Gamblin %A Dan Gunter %A Paul D. Hovland %A Heike Jagode %A Karen Karavanic %A Gabriel Marin %A John Mellor-Crummey %A Shirley Moore %A Boyana Norris %A Leonid Oliker %A Catherine Olschanowsky %A Philip C. Roth %A Martin Schulz %A Sameer Shende %A Allan Snavely %K test %B SciDAC 2009, Journal of Physics: Conference Series %I IOP Publishing %C San Diego, California %V 180(2009)012039 %8 2009-07 %G eng %0 Journal Article %J in Cyberinfrastructure Technologies and Applications %D 2009 %T Parallel Dense Linear Algebra Software in the Multicore Era %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %E Junwei Cao %K plasma %B in Cyberinfrastructure Technologies and Applications %I Nova Science Publishers, Inc. %P 9-24 %8 2009-00 %G eng %0 Conference Proceedings %B Proceedings of CUG09 %D 2009 %T Performance evaluation for petascale quantum simulation tools %A Stanimire Tomov %A Wenchang Lu %A Jerzy Bernholc %A Shirley Moore %A Jack Dongarra %K doe-nano %B Proceedings of CUG09 %C Atlanta, GA %8 2009-05 %G eng %0 Conference Paper %B CLUSTER '09 %D 2009 %T Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery %A Aurelien Bouteiller %A Thomas Ropars %A George Bosilca %A Christine Morin %A Jack Dongarra %K fault tolerant computing %K libraries message passing %K parallel machines %K protocols %X With the growing scale of high performance computing platforms, fault tolerance has become a major issue. Among the various approaches for providing fault tolerance to MPI applications, message logging has been proved to tolerate higher failure rate. However, this advantage comes at the expense of a higher overhead on communications, due to latency intrusive logging of events to a stable storage. Previous work proposed and evaluated several protocols relaxing the synchronicity of event logging to moderate this overhead. Recently, the model of message logging has been refined to better match the reality of high performance network cards, where message receptions are decomposed in multiple interdependent events. According to this new model, deterministic and non-deterministic events are clearly discriminated, reducing the overhead induced by message logging. In this paper we compare, experimentally, a pessimistic and an optimistic message logging protocol, using this new model and implemented in the Open MPI library. Although pessimistic and optimistic message logging are, respectively, the most and less synchronous message logging paradigms, experiments show that most of the time their performance is comparable. %B CLUSTER '09 %I IEEE %C New Orleans %8 2009-08 %G eng %R 10.1109/CLUSTR.2009.5289157 %0 Generic %D 2009 %T Scheduling Linear Algebra Operations on Multicore Processors %A Jakub Kurzak %A Hatem Ltaeif %A Jack Dongarra %A Rosa M. Badia %B University of Tennessee Computer Science Department Technical Report, UT-CS-09-636 (Also LAPACK Working Note 213) %8 2009-00 %G eng %0 Journal Article %J Concurrency Practice and Experience (to appear) %D 2009 %T Scheduling Linear Algebra Operations on Multicore Processors %A Jakub Kurzak %A Hatem Ltaeif %A Jack Dongarra %A Rosa M. Badia %K plasma %B Concurrency Practice and Experience (to appear) %8 2009-00 %G eng %0 Conference Proceedings %B 8th International Conference on Computational Science (ICCS), Proceedings Parts I, II, and III, Lecture Notes in Computer Science %D 2008 %E Marian Bubak %E Geert Dick van Albada %E Jack Dongarra %E Peter M. Sloot %B 8th International Conference on Computational Science (ICCS), Proceedings Parts I, II, and III, Lecture Notes in Computer Science %I Springer Berlin %C Krakow, Poland %V 5101 %8 2008-01 %G eng %0 Generic %D 2008 %T Algorithmic Based Fault Tolerance Applied to High Performance Computing %A George Bosilca %A Remi Delmas %A Jack Dongarra %A Julien Langou %B University of Tennessee Computer Science Technical Report, UT-CS-08-620 (also LAPACK Working Note 205) %8 2008-01 %G eng %0 Journal Article %J VECPAR '08, High Performance Computing for Computational Science %D 2008 %T Computing the Conditioning of the Components of a Linear Least Squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B VECPAR '08, High Performance Computing for Computational Science %C Toulouse, France %8 2008-01 %G eng %0 Generic %D 2008 %T Enhancing the Performance of Dense Linear Algebra Solvers on GPUs (in the MAGMA Project) %A Marc Baboulin %A James Demmel %A Jack Dongarra %A Stanimire Tomov %A Vasily Volkov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC08) %C Austin, TX %8 2008-11 %G eng %0 Journal Article %J in High Performance Computing and Grids in Action %D 2008 %T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %E Lucio Grandinetti %B in High Performance Computing and Grids in Action %I IOS Press %C Amsterdam %8 2008-01 %G eng %0 Conference Proceedings %B 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008) %D 2008 %T Fault Tolerance Management for a Hierarchical GridRPC Middleware %A Aurelien Bouteiller %A Frederic Desprez %B 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008) %C Lyon, France %8 2008-01 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2008 %T Parallel Tiled QR Factorization for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %B Concurrency and Computation: Practice and Experience %V 20 %P 1573-1590 %8 2008-01 %G eng %0 Journal Article %J Proc. SciDAC 2008 %D 2008 %T PERI Auto-tuning %A David Bailey %A Jacqueline Chame %A Chun Chen %A Jack Dongarra %A Mary Hall %A Jeffrey K. Hollingsworth %A Paul D. Hovland %A Shirley Moore %A Keith Seymour %A Jaewook Shin %A Ananta Tiwari %A Sam Williams %A Haihang You %K gco %B Proc. SciDAC 2008 %I Journal of Physics %C Seatlle, Washington %V 125 %8 2008-01 %G eng %0 Journal Article %J Computing in Science and Engineering %D 2008 %T The PlayStation 3 for High Performance Scientific Computing %A Jakub Kurzak %A Alfredo Buttari %A Piotr Luszczek %A Jack Dongarra %B Computing in Science and Engineering %P 80-83 %8 2008-01 %G eng %0 Generic %D 2008 %T The PlayStation 3 for High Performance Scientific Computing %A Jakub Kurzak %A Alfredo Buttari %A Piotr Luszczek %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %8 2008-01 %G eng %0 Conference Proceedings %B International Supercomputer Conference (ISC 2008) %D 2008 %T Redesigning the Message Logging Model for High Performance %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %B International Supercomputer Conference (ISC 2008) %C Dresden, Germany %8 2008-01 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2008 %T Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization %A Jakub Kurzak %A Alfredo Buttari %A Jack Dongarra %B IEEE Transactions on Parallel and Distributed Systems %V 19 %P 1-11 %8 2008-01 %G eng %0 Conference Proceedings %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %D 2008 %T Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures %A Marc Baboulin %A Stanimire Tomov %A Jack Dongarra %K magma %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %C Trondheim Norway %8 2008-05 %G eng %0 Generic %D 2008 %T Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures %A Marc Baboulin %A Jack Dongarra %A Stanimire Tomov %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-08-615 (also LAPACK Working Note 200) %8 2008-01 %G eng %0 Generic %D 2008 %T Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems %A Stanimire Tomov %A Jack Dongarra %A Marc Baboulin %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-08-632 (also LAPACK Working Note 210) %8 2008-01 %G eng %0 Generic %D 2008 %T Using dual techniques to derive componentwise and mixed condition numbers for a linear functional of a linear least squares solution %A Marc Baboulin %A Serge Gratton %B University of Tennessee Computer Science Technical Report, UT-CS-08-622 (also LAPACK Working Note 207) %8 2008-01 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software %D 2008 %T Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %K plasma %B ACM Transactions on Mathematical Software %V 34 %P 17-22 %8 2008-00 %G eng %0 Conference Proceedings %B Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07) %D 2007 %T Binomial Graph: A Scalable and Fault- Tolerant Logical Network Topology %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07) %I Springer %C Niagara Falls, Canada %8 2007-08 %G eng %0 Generic %D 2007 %T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report %8 2007-01 %G eng %0 Journal Article %J Cray User Group, CUG 2007 %D 2007 %T A Comparison of Application Performance Using Open MPI and Cray MPI %A Richard L. Graham %A George Bosilca %A Jelena Pjesivac–Grbovic %B Cray User Group, CUG 2007 %8 2007-05 %G eng %0 Generic %D 2007 %T Computing the Conditioning of the Components of a Linear Least Squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B University of Tennessee Computer Science Technical Report %8 2007-01 %G eng %0 Journal Article %J DOE SciDAC Review (to appear) %D 2007 %T Creating Software Technology to Harness the Power of Leadership-class Computing Systems %A John Mellor-Crummey %A Pete Beckman %A Jack Dongarra %A Barton Miller %A Katherine Yelick %B DOE SciDAC Review (to appear) %8 2007-06 %G eng %0 Journal Article %J Euro-Par 2007 %D 2007 %T Decision Trees and MPI Collective Algorithm Selection Problem %A Jelena Pjesivac–Grbovic %A George Bosilca %A Graham Fagg %A Thara Angskun %A Jack Dongarra %K ftmpi %B Euro-Par 2007 %I Springer %C Rennes, France %P 105–115 %8 2007-08 %G eng %0 Journal Article %J in Petascale Computing: Algorithms and Applications (to appear) %D 2007 %T Disaster Survival Guide in Petascale Computing: An Algorithmic Approach %A Jack Dongarra %A Zizhong Chen %A George Bosilca %A Julien Langou %B in Petascale Computing: Algorithms and Applications (to appear) %I Chapman & Hall - CRC Press %8 2007-00 %G eng %0 Journal Article %J EuroPVM/MPI 2007 %D 2007 %T An Evaluation of Open MPI's Matching Transport Layer on the Cray XT %A Richard L. Graham %A Ron Brightwell %A Brian Barrett %A George Bosilca %A Jelena Pjesivac–Grbovic %B EuroPVM/MPI 2007 %8 2007-09 %G eng %0 Journal Article %J In High Performance Computing and Grids in Action (to appear) %D 2007 %T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Julie Langou %A Piotr Luszczek %A Stanimire Tomov %E Lucio Grandinetti %B In High Performance Computing and Grids in Action (to appear) %I IOS Press %C Amsterdam %8 2007-00 %G eng %0 Generic %D 2007 %T Limitations of the Playstation 3 for High Performance Cluster Computing %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %B University of Tennessee Computer Science Technical Report, UT-CS-07-597 (Also LAPACK Working Note 185) %8 2007-00 %G eng %0 Journal Article %J International Journal of High Performance Computer Applications (to appear) %D 2007 %T Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems %A Alfredo Buttari %A Jack Dongarra %A Julien Langou %A Julie Langou %A Piotr Luszczek %A Jakub Kurzak %B International Journal of High Performance Computer Applications (to appear) %8 2007-08 %G eng %0 Journal Article %J Parallel Computing (Special Edition: EuroPVM/MPI 2006) %D 2007 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A George Bosilca %A Graham Fagg %A Thara Angskun %A Jack Dongarra %K ftmpi %B Parallel Computing (Special Edition: EuroPVM/MPI 2006) %I Elsevier %8 2007-00 %G eng %0 Conference Proceedings %B Journal of Physics: Conference Series, SciDAC 2007 %D 2007 %T Multithreading for synchronization tolerance in matrix factorization %A Alfredo Buttari %A Jack Dongarra %A Parry Husbands %A Jakub Kurzak %A Katherine Yelick %B Journal of Physics: Conference Series, SciDAC 2007 %V 78 %8 2007-01 %G eng %0 Book Section %B Distributed and Parallel Systems %D 2007 %T A New Approach to MPI Collective Communication Implementations %A Torsten Hoefler %A Jeffrey M. Squyres %A Graham Fagg %A George Bosilca %A Wolfgang Rehm %A Andrew Lumsdaine %K Automatic Selection %K Collective Operation %K Framework %K Message Passing (MPI) %K Open MPI %X Recent research into the optimization of collective MPI operations has resulted in a wide variety of algorithms and corresponding implementations, each typically only applicable in a relatively narrow scope: on a specific architecture, on a specific network, with a specific number of processes, with a specific data size and/or data-type – or any combination of these (or other) factors. This situation presents an enormous challenge to portable MPI implementations which are expected to provide optimized collective operation performance on all platforms. Many portable implementations have attempted to provide a token number of algorithms that are intended to realize good performance on most systems. However, many platform configurations are still left without well-tuned collective operations. This paper presents a proposal for a framework that will allow a wide variety of collective algorithm implementations and a flexible, multi-tiered selection process for choosing which implementation to use when an application invokes an MPI collective function. %B Distributed and Parallel Systems %I Springer US %P 45-54 %@ 978-0-387-69857-1 %G eng %R 10.1007/978-0-387-69858-8_5 %0 Conference Proceedings %B The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT) %D 2007 %T Optimal Routing in Binomial Graph Networks %A Thara Angskun %A George Bosilca %A Brad Vander Zanden %A Jack Dongarra %K ftmpi %B The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT) %I IEEE Computer Society %C Adelaide, Australia %8 2007-12 %G eng %0 Generic %D 2007 %T Parallel Tiled QR Factorization for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-598 (also LAPACK Working Note 190) %8 2007-00 %G eng %0 Journal Article %J Cluster computing %D 2007 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B Cluster computing %I Springer Netherlands %V 10 %P 127-143 %8 2007-06 %G eng %0 Journal Article %J SIAM SISC (to appear) %D 2007 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A Julien Langou %A Zizhong Chen %A George Bosilca %A Jack Dongarra %B SIAM SISC (to appear) %8 2007-05 %G eng %0 Conference Proceedings %B Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) %D 2007 %T Reliability Analysis of Self-Healing Network using Discrete-Event Simulation %A Thara Angskun %A George Bosilca %A Graham Fagg %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %B Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) %I IEEE Computer Society %P 437-444 %8 2007-05 %G eng %0 Journal Article %J Accepted for Euro PVM/MPI 2007 %D 2007 %T Retrospect: Deterministic Relay of MPI Applications for Interactive Distributed Debugging %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %K ftmpi %B Accepted for Euro PVM/MPI 2007 %I Springer %8 2007-09 %G eng %0 Generic %D 2007 %T SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3 %A Alfredo Buttari %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %A George Bosilca %K multi-core %B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-595 %8 2007-00 %G eng %0 Conference Proceedings %B 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007) %D 2007 %T Self-Healing in Binomial Graph Networks %A Thara Angskun %A George Bosilca %A Jack Dongarra %B 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007) %C Vilamoura, Algarve, Portugal %8 2007-11 %G eng %0 Generic %D 2007 %T Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization %A Jakub Kurzak %A Alfredo Buttari %A Jack Dongarra %K lapack %B UT Computer Science Technical Report (Also LAPACK Working Note 184) %8 2007-01 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications (submitted) %D 2006 %T Application of Machine Learning to the Selection of Sparse Linear Solvers %A Sanjukta Bhowmick %A Victor Eijkhout %A Yoav Freund %A Erika Fuentes %A David Keyes %K salsa %K sans %B International Journal of High Performance Computing Applications (submitted) %8 2006-00 %G eng %0 Journal Article %J University of Tennessee Computer Science Tech Report %D 2006 %T Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy %A Julien Langou %A Julien Langou %A Piotr Luszczek %A Jakub Kurzak %A Alfredo Buttari %A Jack Dongarra %K iter-ref %B University of Tennessee Computer Science Tech Report %8 2006-04 %G eng %0 Journal Article %J 2006 Euro PVM/MPI (submitted) %D 2006 %T Flexible collective communication tuning architecture applied to Open MPI %A Graham Fagg %A Jelena Pjesivac–Grbovic %A George Bosilca %A Thara Angskun %A Jack Dongarra %K ftmpi %B 2006 Euro PVM/MPI (submitted) %C Bonn, Germany %8 2006-01 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2006 %T FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study %A David Dewolfs %A Jan Broeckhove %A Vaidy Sunderam %A Graham Fagg %K ftmpi %B Lecture Notes in Computer Science %I Springer Berlin / Heidelberg %V 4192 %P 133-140 %8 2006-00 %G eng %0 Journal Article %J Euro PVM/MPI 2006 %D 2006 %T High Performance RDMA Protocols in HPC %A Galen M. Shipman %A George Bosilca %A Maccabe, Arthur B. %B Euro PVM/MPI 2006 %C Bonn, Germany %8 2006-09 %G eng %0 Journal Article %J HeteroPar 2006 %D 2006 %T A High-Performance, Heterogeneous MPI %A Richard L. Graham %A Galen M. Shipman %A Brian Barrett %A Ralph Castain %A George Bosilca %A Andrew Lumsdaine %B HeteroPar 2006 %C Barcelona, Spain %8 2006-09 %G eng %0 Conference Proceedings %B SC06 Conference Tutorial %D 2006 %T The HPC Challenge (HPCC) Benchmark Suite %A Piotr Luszczek %A David Bailey %A Jack Dongarra %A Jeremy Kepner %A Robert Lucas %A Rolf Rabenseifner %A Daisuke Takahashi %K hpcc %K hpcchallenge %B SC06 Conference Tutorial %I IEEE %C Tampa, Florida %8 2006-11 %G eng %0 Journal Article %J PARA 2006 %D 2006 %T The Impact of Multicore on Math Software %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %K plasma %B PARA 2006 %C Umea, Sweden %8 2006-06 %G eng %0 Journal Article %J Euro PVM/MPI 2006 %D 2006 %T Implementation and Usage of the PERUSE-Interface in Open MPI %A Rainer Keller %A George Bosilca %A Graham Fagg %A Michael Resch %A Jack Dongarra %B Euro PVM/MPI 2006 %C Bonn, Germany %8 2006-09 %G eng %0 Generic %D 2006 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B ICL Technical Report %8 2006-00 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2006 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B Lecture Notes in Computer Science %I Springer Berlin / Heidelberg %V 4192 %P 40-48 %8 2006-09 %G eng %0 Journal Article %J J. Phys.: Conf. Ser. 46 %D 2006 %T Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures %A Alex Zunger %A Alberto Franceschetti %A Gabriel Bester %A Wesley B. Jones %A Kwiseon Kim %A Peter A. Graf %A Lin-Wang Wang %A Andrew Canning %A Osni Marques %A Christof Voemel %A Jack Dongarra %A Julien Langou %A Stanimire Tomov %K DOE_NANO %B J. Phys.: Conf. Ser. 46 %V :101088/1742-6596/46/1/040 %P 292-298 %8 2006-01 %G eng %0 Journal Article %J PARA 2006 %D 2006 %T Prospectus for the Next LAPACK and ScaLAPACK Libraries %A James Demmel %A Jack Dongarra %A B. Parlett %A William Kahan %A Ming Gu %A David Bindel %A Yozo Hida %A Xiaoye Li %A Osni Marques %A Jason E. Riedy %A Christof Voemel %A Julien Langou %A Piotr Luszczek %A Jakub Kurzak %A Alfredo Buttari %A Julien Langou %A Stanimire Tomov %B PARA 2006 %C Umea, Sweden %8 2006-06 %G eng %0 Journal Article %J 2006 Euro PVM/MPI %D 2006 %T Scalable Fault Tolerant Protocol for Parallel Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %B 2006 Euro PVM/MPI %C Bonn, Germany %8 2006-00 %G eng %0 Journal Article %J IBM Journal of Research and Development %D 2006 %T Self Adapting Numerical Software SANS Effort %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Victor Eijkhout %A Graham Fagg %A Erika Fuentes %A Julien Langou %A Piotr Luszczek %A Jelena Pjesivac–Grbovic %A Keith Seymour %A Haihang You %A Sathish Vadhiyar %K gco %B IBM Journal of Research and Development %V 50 %P 223-238 %8 2006-01 %G eng %0 Conference Proceedings %B DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems %D 2006 %T Self-Healing Network for Scalable Fault Tolerant Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %B DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems %C Innsbruck, Austria %8 2006-01 %G eng %0 Conference Proceedings %B In Proceedings of the International Conference on Parallel Processing %D 2005 %T Automatic Experimental Analysis of Communication Patterns in Virtual Topologies %A Nikhil Bhatia %A Fengguang Song %A Felix Wolf %A Jack Dongarra %A Bernd Mohr %A Shirley Moore %K kojak %B In Proceedings of the International Conference on Parallel Processing %I IEEE Computer Society %C Oslo, Norway %8 2005-06 %G eng %0 Conference Proceedings %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %D 2005 %T Fault Tolerant High Performance Computing by a Coding Approach %A Zizhong Chen %A Graham Fagg %A Edgar Gabriel %A Julien Langou %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %K grads %K lacsi %K sans %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %C Chicago, Illinois %8 2005-01 %G eng %0 Conference Proceedings %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %D 2005 %T Hash Functions for Datatype Signatures in MPI %A George Bosilca %A Jack Dongarra %A Graham Fagg %A Julien Langou %E Beniamino Di Martino %K ftmpi %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %I Springer-Verlag Berlin %C Sorrento (Naples), Italy %V 3666 %P 76-83 %8 2005-09 %G eng %0 Journal Article %D 2005 %T Introduction to the HPC Challenge Benchmark Suite %A Piotr Luszczek %A Jack Dongarra %A David Koester %A Rolf Rabenseifner %A Bob Lucas %A Jeremy Kepner %A John McCalpin %A David Bailey %A Daisuke Takahashi %K hpcc %K hpcchallenge %8 2005-03 %G eng %0 Journal Article %J Journal of Physics: Conference Series %D 2005 %T NanoPSE: A Nanoscience Problem Solving Environment for Atomistic Electronic Structure of Semiconductor Nanostructures %A Wesley B. Jones %A Gabriel Bester %A Andrew Canning %A Alberto Franceschetti %A Peter A. Graf %A Kwiseon Kim %A Julien Langou %A Lin-Wang Wang %A Jack Dongarra %A Alex Zunger %X Researchers at the National Renewable Energy Laboratory and their collaborators have developed over the past ~10 years a set of algorithms for an atomistic description of the electronic structure of nanostructures, based on plane-wave pseudopotentials and configuration interaction. The present contribution describes the first step in assembling these various codes into a single, portable, integrated set of software packages. This package is part of an ongoing research project in the development stage. Components of NanoPSE include codes for atomistic nanostructure generation and passivation, valence force field model for atomic relaxation, code for potential field generation, empirical pseudopotential method solver, strained linear combination of bulk bands method solver, configuration interaction solver for excited states, selection of linear algebra methods, and several inverse band structure solvers. Although not available for general distribution at this time as it is being developed and tested, the design goal of the NanoPSE software is to provide a software context for collaboration. The software package is enabled by fcdev, an integrated collection of best practice GNU software for open source development and distribution augmented to better support FORTRAN. %B Journal of Physics: Conference Series %P 277-282 %8 2005-06 %G eng %U https://iopscience.iop.org/article/10.1088/1742-6596/16/1/038/meta %N 16 %R https://doi.org/10.1088/1742-6596/16/1/038 %0 Journal Article %J International Journal of Parallel Programming %D 2005 %T New Grid Scheduling and Rescheduling Methods in the GrADS Project %A Francine Berman %A Henri Casanova %A Andrew Chien %A Keith Cooper %A Holly Dail %A Anshuman Dasgupta %A Wei Deng %A Jack Dongarra %A Lennart Johnsson %A Ken Kennedy %A Charles Koelbel %A Bo Liu %A Xu Liu %A Anirban Mandal %A Gabriel Marin %A Mark Mazina %A John Mellor-Crummey %A Celso Mendes %A A. Olugbile %A Jignesh M. Patel %A Dan Reed %A Zhiao Shi %A Otto Sievert %A H. Xia %A Asim YarKhan %K grads %B International Journal of Parallel Programming %I Springer %V 33 %P 209-229 %8 2005-06 %G eng %0 Conference Proceedings %B Workshop on Patterns in High Performance Computing %D 2005 %T A Pattern-Based Approach to Automated Application Performance Analysis %A Nikhil Bhatia %A Shirley Moore %A Felix Wolf %A Jack Dongarra %A Bernd Mohr %K kojak %B Workshop on Patterns in High Performance Computing %C University of Illinois at Urbana-Champaign %8 2005-05 %G eng %0 Conference Proceedings %B 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05) %D 2005 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05) %C Denver, Colorado %8 2005-04 %G eng %0 Journal Article %J Cluster Computing Journal (to appear) %D 2005 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B Cluster Computing Journal (to appear) %8 2005-01 %G eng %0 Generic %D 2005 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Julien Langou %K ft-la %B University of Tennessee Computer Science Department Technical Report, UT-CS-04-538 %8 2005-00 %G eng %0 Conference Proceedings %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %D 2005 %T Scalable Fault Tolerant MPI: Extending the Recovery Algorithm %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %E Beniamino Di Martino %K ftmpi %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %I Springer-Verlag Berlin %C Sorrento (Naples) , Italy %V 3666 %P 67 %8 2005-09 %G eng %0 Conference Proceedings %B 4th International Symposium on Cluster Computing and the Grid (CCGrid 2004)(submitted) %D 2004 %T Active Logistical State Management in the GridSolve/L %A Micah Beck %A Jack Dongarra %A Jian Huang %A Terry Moore %A James Plank %K netsolve %B 4th International Symposium on Cluster Computing and the Grid (CCGrid 2004)(submitted) %C Chicago, Illinois %8 2004-01 %G eng %0 Conference Proceedings %B 2004 International Conference on Parallel Processing (ICCP-04) %D 2004 %T An Algebra for Cross-Experiment Performance Analysis %A Fengguang Song %A Felix Wolf %A Nikhil Bhatia %A Jack Dongarra %A Shirley Moore %K kojak %B 2004 International Conference on Parallel Processing (ICCP-04) %C Montreal, Quebec, Canada %8 2004-08 %G eng %0 Journal Article %J Oak Ridge National Laboratory Report %D 2004 %T Cray X1 Evaluation Status Report %A Pratul Agarwal %A R. A. Alexander %A E. Apra %A Satish Balay %A Arthur S. Bland %A James Colgan %A Eduardo D'Azevedo %A Jack Dongarra %A Tom Dunigan %A Mark Fahey %A Al Geist %A M. Gordon %A Robert Harrison %A Dinesh Kaushik %A M. Krishnakumar %A Piotr Luszczek %A Tony Mezzacapa %A Jeff Nichols %A Jarek Nieplocha %A Leonid Oliker %A T. Packwood %A M. Pindzola %A Thomas C. Schulthess %A Jeffrey Vetter %A James B White %A T. Windus %A Patrick H. Worley %A Thomas Zacharia %B Oak Ridge National Laboratory Report %V /-2004/13 %8 2004-01 %G eng %0 Conference Proceedings %B International Conference on Computational Science %D 2004 %T Design of an Interactive Environment for Numerically Intensive Parallel Linear Algebra Calculations %A Piotr Luszczek %A Jack Dongarra %E Marian Bubak %E Geert Dick van Albada %E Peter M. Sloot %E Jack Dongarra %K lacsi %K lfc %B International Conference on Computational Science %I Springer Verlag %C Poland %8 2004-06 %G eng %R 10.1007/978-3-540-25944-2_35 %0 Conference Proceedings %B Proceedings of ISC2004 (to appear) %D 2004 %T Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems %A Graham Fagg %A Edgar Gabriel %A George Bosilca %A Thara Angskun %A Zizhong Chen %A Jelena Pjesivac–Grbovic %A Kevin London %A Jack Dongarra %K ftmpi %K lacsi %B Proceedings of ISC2004 (to appear) %C Heidelberg, Germany %8 2004-06 %G eng %0 Generic %D 2004 %T Performance Optimization and Modeling of Blocked Sparse Kernels %A Alfredo Buttari %A Victor Eijkhout %A Julien Langou %A Salvatore Filippone %K sans %B ICL Technical Report %8 2004-00 %G eng %0 Journal Article %J International Journal for High Performance Applications and Supercomputing (to appear) %D 2004 %T Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing %A Graham Fagg %A Edgar Gabriel %A Zizhong Chen %A Thara Angskun %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %K lacsi %B International Journal for High Performance Applications and Supercomputing (to appear) %8 2004-04 %G eng %0 Generic %D 2004 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Julien Langou %B ICL Technical Report %8 2004-01 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2004 %T The Virtual Instrument: Support for Grid-enabled Scientific Simulations %A Henri Casanova %A Thomas Bartol %A Francine Berman %A Adam Birnbaum %A Jack Dongarra %A Mark Ellisman %A Marcio Faerman %A Erhan Gockay %A Michelle Miller %A Graziano Obertelli %A Stuart Pomerantz %A Terry Sejnowski %A Joel Stiles %A Rich Wolski %B International Journal of High Performance Computing Applications %V 18 %P 3-17 %8 2004-01 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2003 %T Computational Science — ICCS 2003 %A Peter M. Sloot %A David Abramson %A Alexander V. Bogdanov %A Jack Dongarra %A Albert Zomaya %A Yuriy Gorbachev %B Lecture Notes in Computer Science %I Springer-Verlag, Berlin %C ICCS 2003, International Conference. Melbourne, Australia %V 2657-2660 %8 2003-06 %G eng %0 Journal Article %J ICL Tech Report %D 2003 %T Distributed Storage in RIB %A Thomas B. Boehmann %K rib %B ICL Tech Report %8 2003-03 %G eng %0 Conference Proceedings %B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented) %D 2003 %T Fault Tolerant Communication Library and Applications for High Performance Computing %A Graham Fagg %A Edgar Gabriel %A Zizhong Chen %A Thara Angskun %A George Bosilca %A Antonin Bukovsky %A Jack Dongarra %K ftmpi %K lacsi %B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented) %C Santa Fe, NM %8 2003-10 %G eng %0 Conference Proceedings %B 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science %D 2003 %T A Fault-Tolerant Communication Library for Grid Environments %A Edgar Gabriel %A Graham Fagg %A Antonin Bukovsky %A Thara Angskun %A Jack Dongarra %K ftmpi %K lacsi %B 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science %C San Francisco %8 2003-06 %G eng %0 Conference Proceedings %B Lecture Notes in Computer Science, Proceedings of the 9th International Euro-Par Conference %D 2003 %T GrADSolve - RPC for High Performance Computing on the Grid %A Sathish Vadhiyar %A Jack Dongarra %A Asim YarKhan %E Harald Kosch %E Laszlo Boszormenyi %E Hermann Hellwagner %K netsolve %B Lecture Notes in Computer Science, Proceedings of the 9th International Euro-Par Conference %I Springer-Verlag, Berlin %C Klagenfurt, Austria %V 2790 %P 394-403 %8 2003-01 %G eng %R 10.1007/978-3-540-45209-6_58 %0 Journal Article %J Making the Global Infrastructure a Reality %D 2003 %T NetSolve: Past, Present, and Future - A Look at a Grid Enabled Server %A Sudesh Agrawal %A Jack Dongarra %A Keith Seymour %A Sathish Vadhiyar %E Francine Berman %E Geoffrey Fox %E Anthony Hey %K netsolve %B Making the Global Infrastructure a Reality %I Wiley Publishing %8 2003-00 %G eng %0 Conference Proceedings %B Proceedings of the IPDPS 2003, NGS Workshop %D 2003 %T Optimizing Performance and Reliability in Distributed Computing Systems Through Wide Spectrum Storage %A James Plank %A Micah Beck %A Jack Dongarra %A Rich Wolski %A Henri Casanova %B Proceedings of the IPDPS 2003, NGS Workshop %C Nice, France %P 209 %8 2003-01 %G eng %0 Conference Proceedings %B DOE/NSF Workshop on New Directions in Cyber-Security in Large-Scale Networks: Development Obstacles %D 2003 %T Scalable, Trustworthy Network Computing Using Untrusted Intermediaries: A Position Paper %A Micah Beck %A Jack Dongarra %A Victor Eijkhout %A Mike Langston %A Terry Moore %A James Plank %K netsolve %B DOE/NSF Workshop on New Directions in Cyber-Security in Large-Scale Networks: Development Obstacles %C National Conference Center - Landsdowne, Virginia %8 2003-03 %G eng %0 Journal Article %J Resource Management in the Grid %D 2003 %T Scheduling in the Grid Application Development Software Project %A Holly Dail %A Otto Sievert %A Francine Berman %A Henri Casanova %A Asim YarKhan %A Sathish Vadhiyar %A Jack Dongarra %A Chuang Liu %A Lingyun Yang %A Dave Angulo %A Ian Foster %K grads %B Resource Management in the Grid %I Kluwer Publishers %8 2003-03 %G eng %0 Journal Article %J Statistical Data Mining and Knowledge Discovery %D 2003 %T The Semantic Conference Organizer %A Kevin Heinrich %A Michael Berry %A Jack Dongarra %A Sathish Vadhiyar %E Hamparsum Bozdogan %K netsolve %B Statistical Data Mining and Knowledge Discovery %I CRC Press %8 2003-00 %G eng %0 Journal Article %J Journal of Digital Information special issue on Interactivity in Digital Libraries %D 2002 %T Active Netlib: An Active Mathematical Software Collection for Inquiry-based Computational Science and Engineering Education %A Shirley Moore %A A.J. Baker %A Jack Dongarra %A Christian Halloy %A Chung Ng %K activenetlib %K rib %B Journal of Digital Information special issue on Interactivity in Digital Libraries %V 2 %8 2002-00 %G eng %0 Conference Proceedings %B Proceedings of the second IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID 2002) %D 2002 %T The Internet BackPlane Protocol: A Study in Resource Sharing %A Alessandro Bassi %A Micah Beck %A Graham Fagg %A Terry Moore %A James Plank %A Martin Swany %A Rich Wolski %K ftmpi %B Proceedings of the second IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID 2002) %C Berlin, Germany %8 2002-10 %G eng %0 Journal Article %J Parallel Computing %D 2002 %T Middleware for the Use of Storage in Communication %A Micah Beck %A Dorian Arnold %A Alessandro Bassi %A Francine Berman %A Henri Casanova %A Jack Dongarra %A Terry Moore %A Graziano Obertelli %A James Plank %A Martin Swany %A Sathish Vadhiyar %A Rich Wolski %K netsolve %B Parallel Computing %V 28 %P 1773-1788 %8 2002-08 %G eng %0 Journal Article %J International Journal of High Performance Applications and Supercomputing %D 2002 %T Numerical Libraries and Tools for Scalable Parallel Cluster Computing %A Shirley Browne %A Jack Dongarra %A Anne Trefethen %B International Journal of High Performance Applications and Supercomputing %V 15 %P 175-180 %8 2002-10 %G eng %0 Conference Proceedings %B International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops %D 2002 %T Toward a Framework for Preparing and Executing Adaptive Grid Programs %A Ken Kennedy %A John Mellor-Crummey %A Keith Cooper %A Linda Torczon %A Francine Berman %A Andrew Chien %A Dave Angulo %A Ian Foster %A Dennis Gannon %A Lennart Johnsson %A Carl Kesselman %A Jack Dongarra %A Sathish Vadhiyar %K grads %B International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops %C Fort Lauderdale, FL %P 0171 %8 2002-04 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software %D 2002 %T An Updated Set of Basic Linear Algebra Subprograms (BLAS) %A Susan Blackford %A James Demmel %A Jack Dongarra %A Iain Duff %A Sven Hammarling %A Greg Henry %A Michael Heroux %A Linda Kaufman %A Andrew Lumsdaine %A Antoine Petitet %A Roldan Pozo %A Karin Remington %A Clint Whaley %B ACM Transactions on Mathematical Software %V 28 %P 135-151 %8 2002-12 %G eng %R 10.1145/567806.567807 %0 Generic %D 2002 %T Users' Guide to NetSolve v1.4.1 %A Sudesh Agrawal %A Dorian Arnold %A Susan Blackford %A Jack Dongarra %A Michelle Miller %A Kiran Sagi %A Zhiao Shi %A Keith Seymour %A Sathish Vadhiyar %K netsolve %B ICL Technical Report %8 2002-06 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing (submitted) %D 2002 %T The Virtual Instrument: Support for Grid-enabled Scientific Simulations %A Henri Casanova %A Thomas Bartol %A Francine Berman %A Adam Birnbaum %A Jack Dongarra %A Mark Ellisman %A Marcio Faerman %A Erhan Gockay %A Michelle Miller %A Graziano Obertelli %A Stuart Pomerantz %A Terry Sejnowski %A Joel Stiles %A Rich Wolski %B Journal of Parallel and Distributed Computing (submitted) %8 2002-10 %G eng %0 Journal Article %J (an update), submitted to ACM TOMS %D 2001 %T Basic Linear Algebra Subprograms (BLAS) %A Susan Blackford %A James Demmel %A Jack Dongarra %A Iain Duff %A Sven Hammarling %A Greg Henry %A Michael Heroux %A Linda Kaufman %A Andrew Lumsdaine %A Antoine Petitet %A Roldan Pozo %A Karin Remington %A Clint Whaley %B (an update), submitted to ACM TOMS %8 2001-02 %G eng %0 Conference Proceedings %B Tenth International World Wide Web Conference Proceedings (to appear), %D 2001 %T Enabling Full Service Surrogates Using the Portable Channel Representation %A Micah Beck %A Terry Moore %A Leif Abrahamsson %A Chistophe Achouiantz %A Patrik Johansson %B Tenth International World Wide Web Conference Proceedings (to appear), %C Hong Kong %8 2001-05 %G eng %0 Conference Proceedings %B Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science %D 2001 %T Fault Tolerant MPI for the HARNESS Meta-Computing System %A Graham Fagg %A Antonin Bukovsky %A Jack Dongarra %E Benjoe A. Juliano %E R. Renner %E K. Tan %K ftmpi %K harness %B Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science %I Springer Verlag %C Berlin %V 2073 %P 355-366 %8 2001-00 %G eng %R 10.1007/3-540-45545-0_44 %0 Journal Article %J International Journal of High Performance Applications and Supercomputing %D 2001 %T The GrADS Project: Software Support for High-Level Grid Application Development %A Francine Berman %A Andrew Chien %A Keith Cooper %A Jack Dongarra %A Ian Foster %A Dennis Gannon %A Lennart Johnsson %A Ken Kennedy %A Carl Kesselman %A John Mellor-Crummey %A Dan Reed %A Linda Torczon %A Rich Wolski %K grads %B International Journal of High Performance Applications and Supercomputing %V 15 %P 327-344 %8 2001-01 %G eng %0 Journal Article %J Parallel Computing %D 2001 %T HARNESS and Fault Tolerant MPI %A Graham Fagg %A Antonin Bukovsky %A Jack Dongarra %B Parallel Computing %V 27 %P 1479-1496 %8 2001-01 %G eng %0 Generic %D 2001 %T Internet Backplane Protocol: API 1.0 %A Alessandro Bassi %A Micah Beck %A James Plank %A Rich Wolski %B University of Tennessee Computer Science Technical Report %8 2001-01 %G eng %0 Generic %D 2001 %T Internet Backplane Protocol - Test Language v. 1.0 %A Alessandro Bassi %A Xiaoye Li %B University of Tennessee Computer Science Technical Report %8 2001-01 %G eng %0 Journal Article %J submitted to SC2001 %D 2001 %T Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication %A Micah Beck %A Dorian Arnold %A Alessandro Bassi %A Francine Berman %A Henri Casanova %A Jack Dongarra %A Terry Moore %A Graziano Obertelli %A James Plank %A Martin Swany %A Sathish Vadhiyar %A Rich Wolski %K netsolve %B submitted to SC2001 %C Denver, Colorado %8 2001-11 %G eng %0 Journal Article %J International Journal of High Performance Applications and Supercomputing %D 2001 %T Numerical Libraries and The Grid %A Antoine Petitet %A Susan Blackford %A Jack Dongarra %A Brett Ellis %A Graham Fagg %A Kenneth Roche %A Sathish Vadhiyar %K grads %B International Journal of High Performance Applications and Supercomputing %V 15 %P 359-374 %8 2001-01 %G eng %0 Generic %D 2001 %T Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK %A Antoine Petitet %A Susan Blackford %A Jack Dongarra %A Brett Ellis %A Graham Fagg %A Kenneth Roche %A Sathish Vadhiyar %K grads %K scalapack %B University of Tennessee Computer Science Technical Report %8 2001-01 %G eng %0 Generic %D 2001 %T Repository in a Box Toolkit for Software and Resource Sharing %A Shirley Browne %A Paul McMahan %A Scott Wells %K rib %B University of Tennessee Computer Science Department Technical Report %8 2001-00 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2001 %T Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries %A Ken Kennedy %A Bradley Broom %A Keith Cooper %A Jack Dongarra %A Rob Fowler %A Dennis Gannon %A Lennart Johnsson %A John Mellor-Crummey %A Linda Torczon %B Journal of Parallel and Distributed Computing %V 61 %P 1803-1826 %8 2001-12 %G eng %0 Generic %D 2000 %T The GrADS Project: Software Support for High-Level Grid Application Development %A Francine Berman %A Andrew Chien %A Keith Cooper %A Jack Dongarra %A Ian Foster %A Dennis Gannon %A Lennart Johnsson %A Ken Kennedy %A Carl Kesselman %A Dan Reed %A Linda Torczon %A Rich Wolski %K grads %B Technical Report %8 2000-02 %G eng %0 Journal Article %J In Active Middleware Services, Ed. Salim Hariri, Craig A. Lee, Cauligi S. Raghavendra (2000), Kluwer Academic %D 2000 %T Logistical Networking: Sharing More Than the Wires %A Micah Beck %A Terry Moore %A James Plank %A Martin Swany %B In Active Middleware Services, Ed. Salim Hariri, Craig A. Lee, Cauligi S. Raghavendra (2000), Kluwer Academic %C Norwell, MA %8 2000-01 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2000 %T A Portable Programming Interface for Performance Evaluation on Modern Processors %A Shirley Browne %A Jack Dongarra %A Nathan Garner %A George Ho %A Phil Mucci %K papi %B The International Journal of High Performance Computing Applications %V 14 %P 189-204 %8 2000-09 %G eng %R https://doi.org/10.1177/109434200001400303 %0 Generic %D 2000 %T A Portable Programming Interface for Performance Evaluation on Modern Processors %A Shirley Browne %A Jack Dongarra %A Nathan Garner %A Kevin London %A Phil Mucci %B University of Tennessee Computer Science Technical Report, UT-CS-00-444 %8 2000-07 %G eng %0 Conference Proceedings %B Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing %D 2000 %T Request Sequencing: Optimizing Communication for the Grid %A Dorian Arnold %A Dieter Bachmann %A Jack Dongarra %K netsolve %B Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing %C (Germany: Springer Verlag 2000) %P V1900,1213-1222 %8 2000-01 %G eng %0 Conference Proceedings %B Proceedings of SuperComputing 2000 (SC'00) %D 2000 %T A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters %A Shirley Browne %A Jack Dongarra %A Nathan Garner %A Kevin London %A Phil Mucci %K papi %B Proceedings of SuperComputing 2000 (SC'00) %C Dallas, TX %8 2000-11 %G eng %0 Conference Proceedings %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %D 2000 %T Seamless Access to Adaptive Solver Algorithms %A Dorian Arnold %A Susan Blackford %A Jack Dongarra %A Victor Eijkhout %A Tinghua Xu %K netsolve %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %C Lausanne, Switzerland %8 2000-08 %G eng %0 Generic %D 2000 %T Secure Remote Access to Numerical Software and Computation Hardware %A Dorian Arnold %A Shirley Browne %A Jack Dongarra %A Graham Fagg %A Keith Moore %B University of Tennessee Computer Science Technical Report, UT-CS-00-446 %8 2000-07 %G eng %0 Conference Proceedings %B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000 %D 2000 %T Secure Remote Access to Numerical Software and Computational Hardware %A Dorian Arnold %A Shirley Browne %A Jack Dongarra %A Graham Fagg %A Keith Moore %K netsolve %B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000 %C Albuquerque, NM %8 2000-06 %G eng %0 Journal Article %J Parallel Processing Letters %D 1999 %T Algorithmic Issues on Heterogeneous Computing Platforms %A Pierre Boulet %A Jack Dongarra %A Fabrice Rastello %A Yves Robert %A Frederic Vivien %B Parallel Processing Letters %V 9 %P 197-213 %8 1999-01 %G eng %0 Journal Article %J SIAM News %D 1999 %T Atlanta Organizers Put Mathematics to Work For the Math Sciences Community %A Michael Berry %A Jack Dongarra %B SIAM News %V 32 %8 1999-01 %G eng %0 Journal Article %J Future Generation Computer Systems %D 1999 %T Deploying Fault-tolerance and Task Migration with NetSolve %A Henri Casanova %A James Plank %A Micah Beck %A Jack Dongarra %K netsolve %B Future Generation Computer Systems %I Elsevier %V 15 %P 745-755 %8 1999-10 %G eng %0 Journal Article %J International Journal on Future Generation Computer Systems %D 1999 %T HARNESS: A Next Generation Distributed Virtual Machine %A Micah Beck %A Jack Dongarra %A Graham Fagg %A Al Geist %A Paul Gray %A James Kohl %A Mauro Migliardi %A Keith Moore %A Terry Moore %A Philip Papadopoulous %A Stephen L. Scott %A Vaidy Sunderam %K harness %B International Journal on Future Generation Computer Systems %V 15 %P 571-582 %8 1999-01 %G eng %0 Generic %D 1999 %T IBP - Internet Backplane Protocol: Infrastructure for Distributed Storage (V O.2) %A Wael Elwasif %A Micah Beck %A James Plank %B University of Tennessee Computer Science Department Technical Report %8 1999-02 %G eng %0 Journal Article %J Philadelphia: Society for Industrial and Applied Mathematics %D 1999 %T LAPACK Users' Guide, 3rd ed. %A Ed Anderson %A Zhaojun Bai %A Christian Bischof %A Susan Blackford %A James Demmel %A Jack Dongarra %A Jeremy Du Croz %A Anne Greenbaum %A Sven Hammarling %A Alan McKenney %A Danny Sorensen %B Philadelphia: Society for Industrial and Applied Mathematics %8 1999-01 %G eng %0 Journal Article %J Computer Communications %D 1999 %T Logistical Quality of Service in NetSolve %A Micah Beck %A Henri Casanova %A Jack Dongarra %A Terry Moore %A James Plank %A Francine Berman %A Rich Wolski %K netsolve %B Computer Communications %V 22 %P 1034-1044 %8 1999-01 %G eng %0 Journal Article %J IEEE Cluster Computing BOF at SC99 %D 1999 %T Numerical Libraries and Tools for Scalable Parallel Cluster Computing %A Shirley Browne %A Jack Dongarra %A Anne Trefethen %B IEEE Cluster Computing BOF at SC99 %C Portland, Oregon %8 1999-01 %G eng %0 Conference Proceedings %B Proceedings of Department of Defense HPCMP Users Group Conference %D 1999 %T PAPI: A Portable Interface to Hardware Performance Counters %A Shirley Browne %A Christine Deane %A George Ho %A Phil Mucci %K papi %B Proceedings of Department of Defense HPCMP Users Group Conference %8 1999-06 %G eng %0 Conference Proceedings %B 4th Intl. Web Caching Workshop %D 1999 %T Portable Representation of Internet Content Channels in I2-DSI %A Micah Beck %A Rajeev Chawla %A Bert Dempsey %A Terry Moore %B 4th Intl. Web Caching Workshop %C San Diego, CA %8 1999-03 %G eng %0 Journal Article %J Parallel Computing %D 1999 %T Static Tiling for Heterogeneous Computing Platforms %A Pierre Boulet %A Jack Dongarra %A Yves Robert %A Frederic Vivien %B Parallel Computing %V 25 %P 547-568 %8 1999-01 %G eng %0 Journal Article %J D-Lib Magazine %D 1998 %T National HPCC Software Exchange (NHSE): Uniting the High Performance Computing and Communications Community %A Shirley Browne %A Jack Dongarra %A Jeff Horner %A Paul McMahan %A Scott Wells %K rib %B D-Lib Magazine %8 1998-01 %G eng