%0 Generic
%D 2024
%T CholeskyQR with Randomization and Pivoting for Tall Matrices (CQRRPT)
%A Maksim Melnichenko
%A Oleg Balabanov
%A Riley Murray
%A James Demmel
%A Michael W. Mahoney
%A Piotr Luszczek
%X This paper develops and analyzes a new algorithm for QR decomposition with column pivoting (QRCP) of rectangular matrices with large row counts. The algorithm combines methods from randomized numerical linear algebra in a particularly careful way in order to accelerate both pivot decisions for the input matrix and the process of decomposing the pivoted matrix into the QR form. The source of the latter acceleration is a use of randomized preconditioning and CholeskyQR. Comprehensive analysis is provided in both exact and finite-precision arithmetic to characterize the algorithm's rank-revealing properties and its numerical stability granted probabilistic assumptions of the sketching operator. An implementation of the proposed algorithm is described and made available inside the open-source RandLAPACK library, which itself relies on RandBLAS - also available in open-source format. Experiments with this implementation on an Intel Xeon Gold 6248R CPU demonstrate order-of-magnitude speedups relative to LAPACK's standard function for QRCP, and comparable performance to a specialized algorithm for unpivoted QR of tall matrices, which lacks the strong rank-revealing properties of the proposed method.
%I arXiv
%8 2024-02
%G eng
%U https://arxiv.org/abs/2311.08316

%0 Journal Article
%J Physical Chemistry Chemical Physics
%D 2024
%T Economical Quasi-Newton Unitary Optimization of Electronic Orbitals
%A Slattery, Samuel A
%A Surjuse, Kshitijkumar A
%A Peterson, Charles
%A Penchoff, Deborah A
%A Valeev, Edward
%X We present an efficient quasi-Newton orbital solver optimized to reduce the number of gradient evaluations and other computational steps of comparable cost. The solver optimizes orthogonal orbitals by sequences of unitary rotations generated by the (preconditioned) limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm equipped with trust-region step restriction. The low-rank structure of the L-BFGS inverse Hessian is exploited when solving the trust-region problem. The efficiency of the proposed ``Quasi-Newton Unitary Optimization with Trust-Region'' (QUOTR) solver is compared to that of the standard Roothaan-Hall approach accelerated by the Direct Inversion of Iterative Subspace (DIIS), and other exact and approximate Newton solvers for mean-field (Hartree-Fock and Kohn-Sham) problems.
%B Physical Chemistry Chemical Physics
%8 2023-12
%G eng
%U http://pubs.rsc.org/en/Content/ArticleLanding/2024/CP/D3CP05557
%! Phys. Chem. Chem. Phys.
%R 10.1039/D3CP05557D

%0 Generic
%D 2024
%T XaaS: Acceleration as a Service to Enable Productive High-Performance Cloud Computing
%A Torsten Hoefler
%A Marcin Copik
%A Pete Beckman
%A Andrew Jones
%A Ian Foster
%A Manish Parashar
%A Daniel Reed
%A Matthias Troyer
%A Thomas Schulthess
%A Dan Ernst
%A Jack Dongarra
%X HPC and Cloud have evolved independently, specializing their innovations into performance or productivity. Acceleration as a Service (XaaS) is a recipe to empower both fields with a shared execution platform that provides transparent access to computing resources, regardless of the underlying cloud or HPC service provider. Bridging HPC and cloud advancements, XaaS presents a unified architecture built on performance-portable containers. Our converged model concentrates on low-overhead, high-performance communication and computing, targeting resource-intensive workloads from climate simulations to machine learning. XaaS lifts the restricted allocation model of Function-as-a-Service (FaaS), allowing users to benefit from the flexibility and efficient resource utilization of serverless while supporting long-running and performance-sensitive workloads from HPC.
%I arXiv
%8 2024-01
%G eng
%U https://arxiv.org/abs/2401.04552

%0 Conference Paper
%B Lecture Notes in Computer Science
%D 2023
%T AI Benchmarking for Science: Efforts from the MLCommons Science Working Group
%A Thiyagalingam, Jeyan
%A von Laszewski, Gregor
%A Yin, Junqi
%A Emani, Murali
%A Papay, Juri
%A Barrett, Gregg
%A Luszczek, Piotr
%A Tsaris, Aristeidis
%A Kirkpatrick, Christine
%A Wang, Feiyi
%A Gibbs, Tom
%A Vishwanath, Venkatram
%A Shankar, Mallikarjun
%A Fox, Geoffrey
%A Hey, Tony
%E Anzt, Hartwig
%E Bienz, Amanda
%E Luszczek, Piotr
%E Baboulin, Marc
%X With machine learning (ML) becoming a transformative tool for science, the scientific community needs a clear catalogue of ML techniques, and their relative benefits on various scientific problems, if they were to make significant advances in science using AI. Although this comes under the purview of benchmarking, conventional benchmarking initiatives are focused on performance, and as such, science, often becomes a secondary criteria.    In this paper, we describe a community effort from a working group, namely, MLCommons Science Working Group, in developing science-specific AI benchmarking for the international scientific community. Since the inception of the working group in 2020, the group has worked very collaboratively with a number of national laboratories, academic institutions and industries, across the world, and has developed four science-specific AI benchmarks. We will describe the overall process, the resulting benchmarks along with some initial results. We foresee that this initiative is likely to be very transformative for the AI for Science, and for performance-focused communities.
%B Lecture Notes in Computer Science
%I Springer International Publishing
%V 13387
%P 47 - 64
%8 2023-01
%@ 978-3-031-23219-0
%G eng
%U https://link.springer.com/chapter/10.1007/978-3-031-23220-6_4
%R 10.1007/978-3-031-23220-610.1007/978-3-031-23220-6_4

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2023
%T Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors
%A Deshmukh, Sameer
%A Yokota, Rio
%A Bosilca, George
%X Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be designed to efficiently exploit modern hardware architectures. The majority of existing work relies on batched BLAS libraries to handle the computation of many small dense matrices. We show that through careful analysis of the cache utilization, register accumulation using SIMD registers and a redesign of the implementation, one can achieve significantly higher throughput for these types of batched low-rank matrices across a large range of block and batch sizes. We test our algorithm on three CPUs using diverse ISAs – the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148 using AVX-512, and AMD EPYC 7502 using AVX-2, and show that our new batching methodology is able to obtain more than twice the throughput of vendor optimized libraries for all CPU architectures and problem sizes.
%B ACM Transactions on Mathematical Software
%V 49
%P 1 - 29
%8 2023-09
%G eng
%U https://dl.acm.org/doi/10.1145/3595178
%N 3
%! ACM Trans. Math. Softw.
%R 10.1145/3595178

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2023
%T Combining multitask and transfer learning with deep Gaussian processes for autotuning-based performance engineering
%A Luszczek, Piotr
%A Sid-Lakhdar, Wissam M
%A Dongarra, Jack
%X We combine deep Gaussian processes (DGPs) with multitask and transfer learning for the performance modeling and optimization of HPC applications. Deep Gaussian processes merge the uncertainty quantification advantage of Gaussian processes (GPs) with the predictive power of deep learning. Multitask and transfer learning allow for improved learning efficiency when several similar tasks are to be learned simultaneously and when previous learned models are sought to help in the learning of new tasks, respectively. A comparison with state-of-the-art autotuners shows the advantage of our approach on two application problems. In this article, we combine DGPs with multitask and transfer learning to allow for both an improved tuning of an application parameters on problems of interest but also the prediction of parameters on any potential problem the application might encounter.
%B The International Journal of High Performance Computing Applications
%8 2023-03
%G eng
%U http://journals.sagepub.com/doi/10.1177/10943420231166365
%! The International Journal of High Performance Computing Applications
%R 10.1177/10943420231166365

%0 Journal Article
%J Journal of Chemical Theory and Computation
%D 2023
%T Direct Determination of Optimal Real-Space Orbitals for Correlated Electronic Structure of Molecules
%A Valeev, Edward F.
%A Harrison, Robert J.
%A Holmes, Adam A.
%A Peterson, Charles C.
%A Penchoff, Deborah A.
%X We demonstrate how to determine numerically nearly exact orthonormal orbitals that are optimal for the evaluation of the energy of arbitrary (correlated) states of atoms and molecules by minimization of the energy Lagrangian. Orbitals are expressed in real space using a multiresolution spectral element basis that is refined adaptively to achieve the user-specified target precision while avoiding the ill-conditioning issues that plague AO basis set expansions traditionally used for correlated models of molecular electronic structure. For light atoms, the orbital solver, in conjunction with a variational electronic structure model [selected Configuration Interaction (CI)] provides energies of comparable precision to a state-of-the-art atomic CI solver. The computed electronic energies of atoms and molecules are significantly more accurate than the counterparts obtained with the orbital sets of the same rank expanded in Gaussian AO bases, and can be determined even when linear dependence issues preclude the use of the AO bases. It is feasible to optimize more than 100 fully correlated numerical orbitals on a single computer node, and significant room exists for additional improvement. These findings suggest that real-space orbital representations might be the preferred alternative to AO representations for high-end models of correlated electronic states of molecules and materials.
%B Journal of Chemical Theory and Computation
%V 19
%P 7230 - 7241
%8 2023-10
%G eng
%U https://pubs.acs.org/doi/10.1021/acs.jctc.3c00732
%N 20
%! J. Chem. Theory Comput.
%R 10.1021/acs.jctc.3c00732

%0 Conference Paper
%B 52nd International Conference on Parallel Processing (ICPP 2023)
%D 2023
%T O(N) distributed direct factorization of structured dense matrices using runtime systems
%A Sameer Deshmukh
%A Rio Yokota
%A George Bosilca
%A Qinxiang Ma
%B 52nd International Conference on Parallel Processing (ICPP 2023)
%I ACM
%C Salt Lake City, Utah
%8 2023-08
%@ 9798400708435
%G eng
%U https://dl.acm.org/doi/proceedings/10.1145/3605573
%R 10.1145/3605573.3605606

%0 Generic
%D 2023
%T Earth Virtualization Engines - A Technical Perspective
%A Torsten Hoefler
%A Bjorn Stevens
%A Andreas F. Prein
%A Johanna Baehr
%A Thomas Schulthess
%A Thomas F. Stocker
%A John Taylor
%A Daniel Klocke
%A Pekka Manninen
%A Piers M. Forster
%A Tobias Kölling
%A Nicolas Gruber
%A Hartwig Anzt
%A Claudia Frauen
%A Florian Ziemen
%A Milan Klöwer
%A Karthik Kashinath
%A Christoph Schär
%A Oliver Fuhrer
%A Bryan N. Lawrence
%X Participants of the Berlin Summit on Earth Virtualization Engines (EVEs) discussed ideas and concepts to improve our ability to cope with climate change. EVEs aim to provide interactive and accessible climate simulations and data for a wide range of users. They combine high-resolution physics-based models with machine learning techniques to improve the fidelity, efficiency, and interpretability of climate projections. At their core, EVEs offer a federated data layer that enables simple and fast access to exabyte-sized climate data through simple interfaces. In this article, we summarize the technical challenges and opportunities for developing EVEs, and argue that they are essential for addressing the consequences of climate change.
%8 2023-09
%G eng
%U https://arxiv.org/abs/2309.09002

%0 Conference Paper
%B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
%D 2023
%T Elastic deep learning through resilient collective operations
%A Li, Jiali
%A Bosilca, George
%A Bouteiller, Aurélien
%A Nicolae, Bogdan
%X A robust solution that incorporates fault tolerance and elastic scaling capabilities for distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level Failure Mitigation (ULFM), this novel approach promotes efficient and lightweight failure management and encourages smooth scaling in volatile computational settings. The proposed ULFM MPI-centered mechanism outperforms the only officially supported elastic learning framework, Elastic Horovod (using Gloo and NCCL), by a significant factor. These results reinforce the capability of MPI extension to deal with resiliency, and promote ULFM as an effective technique for fault management, minimizing downtime, and thereby enhancing the overall performance of distributed applications, in particular elastic training in high-performance computing (HPC) environments and machine learning applications.
%B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
%I ACM
%C Denver, CO
%8 2023-11
%@ 9798400707858
%G eng
%U https://dl.acm.org/doi/abs/10.1145/3624062.3626080
%R 10.1145/3624062.3626080

%0 Generic
%D 2023
%T Generalizing Random Butterfly Transforms to Arbitrary Matrix Sizes
%A Neil Lindquist
%A Piotr Luszczek
%A Jack Dongarra
%X Parker and Lê introduced random butterfly transforms (RBTs) as a preprocessing technique to replace pivoting in dense LU factorization. Unfortunately, their FFT-like recursive structure restricts the dimensions of the matrix. Furthermore, on multi-node systems, efficient management of the communication overheads restricts the matrix's distribution even more. To remove these limitations, we have generalized the RBT to arbitrary matrix sizes by truncating the dimensions of each layer in the transform. We expanded Parker's theoretical analysis to generalized RBT, specifically that in exact arithmetic, Gaussian elimination with no pivoting will succeed with probability 1 after transforming a matrix with full-depth RBTs. Furthermore, we experimentally show that these generalized transforms improve performance over Parker's formulation by up to 62\% while retaining the ability to replace pivoting. This generalized RBT is available in the SLATE numerical software library.
%I arXiv
%8 2023-12
%G eng
%U https://arxiv.org/abs/2312.09376

%0 Conference Paper
%B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
%D 2023
%T GPU-based LU Factorization and Solve on Batches of Matrices with Band Structure
%A Abdelfattah, Ahmad
%A Tomov, Stanimire
%A Luszczek, Piotr
%A Anzt, Hartwig
%A Dongarra, Jack
%X This paper presents a portable and performance-efficient approach to solve a batch of linear systems of equations using Graphics Processing Units (GPUs). Each system is represented using a special type of matrices with a band structure above and/or below the diagonal. Each matrix is factorized using an LU factorization with partial pivoting for numerical stability. Subsequently, the factors are used to find the solution for as many right hand sides as needed. The width of the band is often small enough that performing a fully dense LU factorization results in poor performance. We follow the standard LAPACK specifications for addressing this type of problems and develop a dedicated solver that runs efficiently on GPUs. No similar solver is currently available in the vendor’s software stack, so performance results are shown on both NVIDIA and AMD GPUs relative to a parallel CPU solution utilizing OpenMP for thread-level parallelization.
%B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
%I ACM
%C Denver, CO
%8 2023-11
%@ 9798400707858
%G eng
%U https://dl.acm.org/doi/abs/10.1145/3624062.3624247
%R 10.1145/3624062.3624247

%0 Journal Article
%J Communications of the ACM
%D 2023
%T HPC Forecast: Cloudy and Uncertain
%A Reed, Daniel
%A Gannon, Dennis
%A Dongarra, Jack
%X An examination of how the technology landscape has changed and possible future directions for HPC operations and innovation.
%B Communications of the ACM
%V 66
%P 82 - 90
%8 2023-01
%G eng
%U https://dl.acm.org/doi/pdf/10.1145/3552309
%N 2
%! Commun. ACM
%R 10.1145/3552309

%0 Conference Paper
%B 52nd International Conference on Parallel Processing (ICPP 2023)
%D 2023
%T Improving the Scaling of an Asynchronous Many-Task Runtime with a Lightweight Communication Engine
%A Omri Mor
%A George Bosilca
%A Marc Snir
%K asynchronous many-task
%K dynamic runtime
%K lightweight communication
%K low-rank Cholesky
%K message-passing
%K MPI
%K strong scaling
%X There is a growing interest in Asynchronous Many-Task (AMT) runtimes as an efficient way to map irregular and dynamic parallel applications onto heterogeneous computing resources. In this work, we show that AMTs nonetheless struggle with communication bottlenecks when scaling computations strongly and that the design of commonly-used communication libraries such as MPI contribute to these bottlenecks. We replace MPI with LCI, a Lightweight Communication Interface that is designed for dynamic, asynchronous frameworks, as the communication layer for the PaRSEC runtime. The result is a significant reduction of end-to-end latency in communication microbenchmarks and a reduction of overall time-tosolution by up to 12% in HiCMA, a tile-based low-rank Cholesky factorization package.
%B 52nd International Conference on Parallel Processing (ICPP 2023)
%I ACM
%C Salt Lake City, Utah
%8 2023-09
%G eng
%U http://snir.cs.illinois.edu/listed/icpp2023-69.pdf
%R 10.1145/3605573.3605642

%0 Generic
%D 2023
%T Memory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements
%A Daniel Barry
%A Heike Jagode
%A Anthony Danalis
%A Jack Dongarra
%I 28th HIPS Workshop
%C St. Petersburg, FL
%8 2023-05
%G eng

%0 Conference Proceedings
%B 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2023
%T Memory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements
%A Daniel Barry
%A Heike Jagode
%A Anthony Danalis
%A Jack Dongarra
%K GPU power
%K High Performance Computing
%K network traffic
%K papi
%K performance analysis
%K Performance Counters
%X Some of the most important categories of performance events count the data traffic between the processing cores and the main memory. However, since these counters are not core-private, applications require elevated privileges to access them. PAPI offers a component that can access this information on IBM systems through the Performance Co-Pilot (PCP); however, doing so adds an indirection layer that involves querying the PCP daemon. This paper performs a quantitative study of the accuracy of the measurements obtained through this component on the Summit supercomputer. We use two linear algebra kernels---a generalized matrix multiply, and a modified matrix-vector multiply---as benchmarks and a distributed, GPU-accelerated 3D-FFT mini-app (using cuFFT) to compare the measurements obtained through the PAPI PCP component against the expected values across different problem sizes. We also compare our measurements against an in-house machine with a very similar architecture to Summit, where elevated privileges allow PAPI to access the hardware counters directly (without using PCP) to show that measurements taken via PCP are as accurate as the those taken directly. Finally, using both QMCPACK and the 3D-FFT, we demonstrate the diverse hardware activities that can be monitored simultaneously via PAPI hardware components.
%B 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C St. Petersburg, Florida
%8 2023-08
%G eng
%U https://ieeexplore.ieee.org/document/10196656
%R 10.1109/IPDPSW59300.2023.00070

%0 Conference Paper
%B Parallel Processing and Applied Mathematics (PPAM 2022)
%D 2023
%T Mixed Precision Algebraic Multigrid on GPUs
%A Tsai, Yu-Hsiang Mike
%A Natalie Beams
%A Anzt, Hartwig
%E Wyrzykowski, Roman
%E Dongarra, Jack
%E Deelman, Ewa
%E Karczewski, Konrad
%K Algebraic multigrid
%K GPUs
%K mixed precision
%K Portability
%X In this paper, we present the first GPU-native platform-portable algebraic multigrid (AMG) implementation that allows the user to use different precision formats for the distinct multigrid levels. The AMG we present uses an aggregation size 2 parallel graph match as the AMG coarsening strategy. The implementation provides a high level of flexibility in terms of configuring the bottom-level solver and the precision format for the distinct levels. We present convergence and performance results on the GPUs from AMD, Intel, and NVIDIA, and compare against corresponding functionality available in other libraries.
%B Parallel Processing and Applied Mathematics (PPAM 2022)
%I Springer International Publishing
%C Cham
%V 13826
%8 2023-04
%@ 978-3-031-30441-5
%G eng
%U https://link.springer.com/10.1007/978-3-031-30442-2
%R 10.1007/978-3-031-30442-2_9

%0 Conference Paper
%B Sustained Simulation Performance 2021
%D 2023
%T MPI Continuations And How To Invoke Them
%A Schuchart, Joseph
%A George Bosilca
%X Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and non-blocking operations, it falls short of adequately supporting the asynchrony of separate but dependent parts of an application coupled by the start and completion of a communication operation. Correctly and efficiently handling MPI communication in differentAPMmodels is still a challenge. We have previously proposed an extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations. This interface is flexible enough to accommodate a wide range of different APMs. In this paper, we discuss different variations of the callback signature and how to best pass data from the code starting the communication operation to the code reacting to its completion. We establish three requirements (efficiency, usability, safety) and evaluate different variations against them. Finally, we find that the current choice is not the best design in terms of both efficiency and safety and propose a simpler, possibly more efficient and safe interface. We also show how the transfer of information into the continuation callback can be largely automated using C++ lambda captures.
%B Sustained Simulation Performance 2021
%I Springer International Publishing
%C Cham
%P 67 - 83
%8 2023-02
%@ 978-3-031-18045-3
%G eng
%U https://link.springer.com/10.1007/978-3-031-18046-0
%R 10.1007/978-3-031-18046-010.1007/978-3-031-18046-0_5

%0 Conference Paper
%B 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2023
%T PAQR: Pivoting Avoiding QR factorization
%A Sid-Lakhdar, Wissam
%A Cayrols, Sebastien
%A Bielich, Daniel
%A Abdelfattah, Ahmad
%A Luszczek, Piotr
%A Gates, Mark
%A Tomov, Stanimire
%A Johansen, Hans
%A Williams-Young, David
%A Davis, Timothy
%A Dongarra, Jack
%A Anzt, Hartwig
%B 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C St. Petersburg, FL, USA
%G eng
%U https://ieeexplore.ieee.org/document/10177407/
%R 10.1109/IPDPS54959.2023.00040

%0 Conference Paper
%B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
%D 2023
%T Parallel Symbolic Cholesky Factorization
%A Ribizel, Tobias
%A Anzt, Hartwig
%X We present a hybrid sequential/parallel symbolic Cholesky factorization algorithm that computes the sparsity pattern of the symbolic factors in parallel. We evaluate the performance on a large subset of the SuiteSparse matrix collection and multicore CPUs as well as flagship GPUs by AMD and NVIDIA, achieving speedups of an order of magnitude compared to a state-of-the-art sequential symbolic Cholesky factorization.
%B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
%I ACM
%C Denver, CO
%8 2023-11
%@ 9798400707858
%G eng
%U https://dl.acm.org/doi/proceedings/10.1145/3624062
%R 10.1145/3624062.3624253

%0 Conference Paper
%B 2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)
%D 2023
%T Performance Insights into Device-initiated RMA Using Kokkos Remote Spaces
%A Mishler, Daniel
%A Ciesko, Jan
%A Olivier, Stephen
%A Bosilca, George
%X Achieving scalable performance on supercomputers requires careful coordination of communication and computation. Often, MPI applications rely on buffering, packing, and sorting techniques to accommodate a two-sided API, minimize communication overhead, and achieve performance goals. As interconnects between accelerators become more performant and scalable, programming models such as SHMEM may have the opportunity to enable bandwidth maximization along with ease of programming. In this work, we take a closer look at device-initiated PGAS programming models using NVIDIA Corp’s NVSHMEM communication library and our interface through the Kokkos Remote Spaces project. We show that benchmarks can benefit from this programming model in terms of performance and programmability. We anticipate similar results for miniapplications.
%B 2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)
%I IEEE
%C Santa Fe, NM, USA
%8 2023-11
%G eng
%U https://ieeexplore.ieee.org/document/10321871/
%R 10.1109/CLUSTERWorkshops61457.2023.00028

%0 Conference Paper
%B Smoky Mountains Computational Sciences and Engineering Conference
%D 2023
%T Preconditioners for Batched Iterative Linear Solvers on GPUs
%A Aggarwal, Isha
%A Nayak, Pratik
%A Kashi, Aditya
%A Anzt, Hartwig
%E Doug, Kothe
%E Al, Geist
%E Pophale, Swaroop
%E Liu, Hong
%E Parete-Koon, Suzanne
%X Batched iterative solvers can be an attractive alternative to batched direct solvers if the linear systems allow for fast convergence. In non-batched settings, iterative solvers are often enhanced with sophisticated preconditioners to improve convergence. In this paper, we develop preconditioners for batched iterative solvers that improve the iterative solver convergence without incurring detrimental resource overhead and preserving much of the iterative solver flexibility. We detail the design and implementation considerations, present a user-friendly interface to the batched preconditioners, and demonstrate the convergence and runtime benefits over non-preconditioned batched iterative solvers on state-of-the-art GPUs for a variety of benchmark problems from finite difference stencil matrices, the Suitesparse matrix collection and a computational chemistry application.
%B Smoky Mountains Computational Sciences and Engineering Conference
%I Springer Nature Switzerland
%V 169075
%P 38 - 53
%8 2023-01
%@ 978-3-031-23605-1
%G eng
%U https://link.springer.com/chapter/10.1007/978-3-031-23606-8_3
%R 10.1007/978-3-031-23606-810.1007/978-3-031-23606-8_3

%0 Conference Paper
%B 2023 IEEE International Conference on Cluster Computing (CLUSTER)
%D 2023
%T Reducing Data Motion and Energy Consumption of Geospatial Modeling Applications Using Automated Precision Conversion
%A Cao, Qinglei
%A Abdulah, Sameh
%A Ltaief, Hatem
%A Genton, Marc G.
%A Keyes, David
%A Bosilca, George
%X The burgeoning interest in large-scale geospatial modeling, particularly within the domains of climate and weather prediction, underscores the concomitant critical importance of accuracy, scalability, and computational speed. Harnessing these complex simulations’ potential, however, necessitates innovative computational strategies, especially considering the increasing volume of data involved. Recent advancements in Graphics Processing Units (GPUs) have opened up new avenues for accelerating these modeling processes. In particular, their efficient utilization necessitates new strategies, such as mixed-precision arithmetic, that can balance the trade-off between computational speed and model accuracy. This paper leverages PaRSEC runtime system and delves into the opportunities provided by mixed-precision arithmetic to expedite large-scale geospatial modeling in heterogeneous environments. By using an automated conversion strategy, our mixed-precision approach significantly improves computational performance (up to 3X) on Summit supercomputer and reduces the associated energy consumption on various Nvidia GPU generations. Importantly, this implementation ensures the requisite accuracy in environmental applications, a critical factor in their operational viability. The findings of this study bear significant implications for future research and development in high-performance computing, underscoring the transformative potential of mixed-precision arithmetic on GPUs in addressing the computational demands of large-scale geospatial modeling and making a stride toward a more sustainable, efficient, and accurate future in large-scale environmental applications.
%B 2023 IEEE International Conference on Cluster Computing (CLUSTER)
%I IEEE
%C Santa Fe, NM, USA
%8 2023-11
%G eng
%U https://ieeexplore.ieee.org/document/10319946/
%R 10.1109/CLUSTER52292.2023.00035

%0 Report
%D 2023
%T Revisiting I/O bandwidth-sharing strategies for HPC applications
%A Anne Benoit
%A Thomas Herault
%A Lucas Perotin
%A Yves Robert
%A Frederic Vivien
%K bandwidth sharing
%K HPC applications
%K I/O
%K scheduling strategy
%X This work revisits I/O bandwidth-sharing strategies for HPC applications. When several applications post concurrent I/O operations, well-known approaches include serializing these operations (First-Come First-Served) or fair-sharing the bandwidth across them (FairShare). Another recent approach, I/O-Sets, assigns priorities to the applications, which are classified into different sets based upon the average length of their iterations. We introduce several new bandwidth-sharing strategies, some of them simple greedy algorithms, and some of them more complicated to implement, and we compare them with existing ones. Our new strategies do not rely on any a-priori knowledge of the behavior of the applications, such as the length of work phases, the volume of I/O operations, or some expected periodicity. We introduce a rigorous framework, namely steady-state windows, which enables to derive bounds on the competitive ratio of all bandwidth-sharing strategies for three different objectives: minimum yield, platform utilization, and global efficiency. To the best of our knowledge, this work is the first to provide a quantitative assessment of the online competitiveness of any bandwidth-sharing strategy. This theory-oriented assessment is complemented by a comprehensive set of simulations, based upon both synthetic and realistic traces. The main conclusion is that our simple and low-complexity greedy strategies significantly outperform First-Come First-Served, FairShare and I/O-Sets, and we recommend that the I/O community implements them for further assessment.
%B INRIA Research Report
%I INRIA
%8 2023-03
%G eng
%U https://hal.inria.fr/hal-04038011

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2023
%T Sparse matrix-vector and matrix-multivector products for the truncated SVD on graphics processors
%A José I. Aliaga
%A Hartwig Anzt
%A Enrique S. Quintana-Orti
%A Andres E. Thomas
%K graphics processing units
%K Singular value decomposition
%K sparse matrix-multivector product
%K sparse matrix-vector product
%X Many practical algorithms for numerical rank computations implement an iterative procedure that involves repeated multiplications of a vector, or a collection of vectors, with both a sparse matrix A and its transpose. Unfortunately, the realization of these sparse products on current high performance libraries often deliver much lower arithmetic throughput when the matrix involved in the product is transposed. In this work, we propose a hybrid sparse matrix layout, named CSRC, that combines the flexibility of some well-known sparse formats to offer a number of appealing properties: (1) CSRC can be obtained at low cost from the popular CSR (compressed sparse row) format; (2) CSRC has similar storage requirements as CSR; and especially, (3) the implementation of the sparse product kernels delivers high performance for both the direct product and its transposed variant on modern graphics accelerators thanks to a significant reduction of atomic operations compared to a conventional implementation based on CSR. This solution thus renders considerably higher performance when integrated into an iterative algorithm for the truncated singular value decomposition (SVD), such as the randomized SVD or, as demonstrated in the experimental results, the block Golub–Kahan–Lanczos algorithm.
%B Concurrency and Computation: Practice and Experience
%8 2023-08
%G eng
%U https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.7871
%! Concurrency and Computation
%R 10.1002/cpe.7871

%0 Conference Proceedings
%B EUROMPI '23: 30th European MPI Users' Group Meeting
%D 2023
%T Synchronizing MPI Processes in Space and Time
%A Schuchart, Joseph
%A Hunold, Sascha
%A Bosilca, George
%X Performance benchmarks are an integral part of the development and evaluation of parallel algorithms, both in distributed applications as well as MPI implementations themselves. The initial step of the benchmark process is to obtain a common timestamp to mark the start of an operation across all involved processes, and the state-of-the-art in many applications and widely used MPI benchmark suites is the use of MPI barriers. In this paper, we show that the synchronization in space provided by an MPI_Barrier is insufficient for proper benchmark results of parallel distributed algorithms, using MPI collective operations as examples. The resulting lack of a global start timestamp for an operation leads to skewed results, with a significant impact of the used barrier algorithm. In order to mitigate these issues, we propose and discuss the implementation of MPIX_Harmonize, which extends the synchronization in space provided by MPI_Barrier with a time synchronization to guarantee a common starting timestamp across all involved processes. By replacing the use of MPI_Barrier with MPIX_Harmonize, benchmark implementors can eliminate skews resulting from barrier algorithms and achieve stable performance benchmark results. We will show that the proper time synchronization can have significant impact on the benchmark results for various implementations of MPI_Allreduce, MPI_Reduce, and MPI_Bcast.
%B EUROMPI '23: 30th European MPI Users' Group Meeting
%I ACM
%C Bristol, United Kingdom
%8 2023-09
%@ 9798400709135
%G eng
%U https://dl.acm.org/doi/proceedings/10.1145/3615318
%R 10.1145/3615318.3615325

%0 Conference Paper
%B SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
%D 2023
%T Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators
%A Sukkari, Dalal
%A Gates, Mark
%A Al Farhan, Mohammed
%A Anzt, Hartwig
%A Dongarra, Jack
%X We investigate a new task-based implementation of the polar decomposition on massively parallel systems augmented with multiple GPUs using SLATE. We implement the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, whose building blocks mainly consist of compute-bound matrix operations, allowing for high levels of parallelism to be exploited on various hardware architectures, such as NVIDIA, AMD, and Intel GPU-based systems. To achieve both performance and portability, we implement our QDWH-based polar decomposition in the SLATE library, which uses efficient techniques in dense linear algebra, such as 2D block cyclic data distribution and communication-avoiding algorithms, as well as modern parallel programming approaches, such as dynamic scheduling and communication overlapping, and uses OpenMP tasks to track data dependencies.    We report numerical accuracy and performance results. The benchmarking campaign reveals up to an 18-fold performance speedup of the GPU accelerated implementation compared to the existing state-of-the-art implementation for the polar decomposition.
%B SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
%I ACM
%C Denver, CO
%8 2023-11
%@ 9798400707858
%G eng
%U https://dl.acm.org/doi/proceedings/10.1145/3624062
%R 10.1145/3624062.3624248

%0 Journal Article
%J Future Generation Computer Systems
%D 2023
%T Three-precision algebraic multigrid on GPUs
%A Tsai, Yu-Hsiang Mike
%A Beams, Natalie
%A Anzt, Hartwig
%K Algebraic multigrid
%K GPUs
%K mixed precision
%K Portability
%X Recent research has demonstrated that using low precision inside some levels of an algebraic multigrid (AMG) solver can improve performance without negatively impacting the AMG quality. In this paper, we build upon previous research and implement an AMG that can use double, single, and half precision for the distinct multigrid levels. The implementation is platform-portable across GPU architectures from AMD, Intel, and NVIDIA. In an experimental analysis, we demonstrate that the use of half precision can be a viable option in multigrid. We evaluate the performance of different AMG configurations and demonstrate that mixed precision AMG can provide runtime savings compared to a double precision AMG.
%B Future Generation Computer Systems
%8 2023-07
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167739X23002741
%R 10.1016/j.future.2023.07.024

%0 Conference Paper
%B 37th ACM International Conference on Supercomputing (ICS'23)
%D 2023
%T Using Additive Modifications in LU Factorization Instead of Pivoting
%A Neil Lindquist
%A Piotr Luszczek
%A Jack Dongarra
%B 37th ACM International Conference on Supercomputing (ICS'23)
%I ACM
%C Orlando, FL
%8 2023-06
%G eng
%R 10.1145/3577193.3593731

%0 Journal Article
%J Software: Practice and Experience
%D 2023
%T Using Ginkgo's memory accessor for improving the accuracy of memory‐bound low precision BLAS
%A Grützmacher, Thomas
%A Anzt, Hartwig
%A Quintana‐Ortí, Enrique S.
%B Software: Practice and Experience
%V 532
%P 81 - 98
%8 Jan-01-2023
%G eng
%U https://doi.org/10.1002/spe.3041
%N 1
%! Softw Pract Exp
%R 10.1002/spe.v53.110.1002/spe.3041

%0 Conference Paper
%B Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop
%D 2023
%T When to checkpoint at the end of a fixed-length reservation?
%A Quentin Barbut
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%X This work considers an application executing for a fixed duration, namely the length of the reservation that it has been granted. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. We address two scenarios. In the first scenario, a checkpoint can be taken at any time; despite its simplicity, this natural problem has not been considered yet (to the best of our knowledge). We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. The second scenario is more involved: the application is a linear workflow consisting of a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the application checkpoints at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue executing at the end of each task. We instantiate this second scenario with several examples of probability distribution laws for task durations.
%B Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop
%C Denver, United States
%8 2023-08
%G eng
%U https://inria.hal.science/hal-04215554

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2022
%T Accelerating Geostatistical Modeling and Prediction With Mixed-Precision Computations: A High-Productivity Approach With PaRSEC
%A Abdulah, Sameh
%A Qinglei Cao
%A Pei, Yu
%A George Bosilca
%A Jack Dongarra
%A Genton, Marc G.
%A Keyes, David E.
%A Ltaief, Hatem
%A Sun, Ying
%K Computational modeling
%K Covariance matrices
%K Data models
%K Maximum likelihood estimation
%K Predictive models
%K runtime
%K Task analysis
%X Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Spatial data are assumed to possess properties of stationarity or non-stationarity via a kernel fitted to a covariance matrix. A primary workhorse of stationary spatial statistics is Gaussian maximum log-likelihood estimation (MLE), whose central data structure is a dense, symmetric positive definite covariance matrix of the dimension of the number of correlated observations. Two essential operations in MLE are the application of the inverse and evaluation of the determinant of the covariance matrix. These can be rendered through the Cholesky decomposition and triangular solution. In this contribution, we reduce the precision of weakly correlated locations to single- or half- precision based on distance. We thus exploit mathematical structure to migrate MLE to a three-precision approximation that takes advantage of contemporary architectures offering BLAS3-like operations in a single instruction that are extremely fast for reduced precision. We illustrate application-expected accuracy worthy of double-precision from a majority half-precision computation, in a context where uniform single-precision is by itself insufficient. In tackling the complexity and imbalance caused by the mixing of three precisions, we deploy the PaRSEC runtime system. PaRSEC delivers on-demand casting of precisions while orchestrating tasks and data movement in a multi-GPU distributed-memory environment within a tile-based Cholesky factorization. Application-expected accuracy is maintained while achieving up to 1.59X by mixing FP64/FP32 operations on 1536 nodes of HAWK or 4096 nodes of Shaheen II , and up to 2.64X by mixing FP64/FP32/FP16 operations on 128 nodes of Summit , relative to FP64-only operations. This translates into up to 4.5, 4.7, ...
%B IEEE Transactions on Parallel and Distributed Systems
%V 33
%P 964 - 976
%8 2022-04
%G eng
%U https://ieeexplore.ieee.org/document/9442267/https://ieeexplore.ieee.org/ielam/71/9575177/9442267-aam.pdfhttp://xplorestaging.ieee.org/ielx7/71/9575177/09442267.pdf?arnumber=9442267
%N 4
%! IEEE Trans. Parallel Distrib. Syst.
%R 10.1109/TPDS.2021.3084071

%0 Conference Proceedings
%B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22)
%D 2022
%T Addressing Irregular Patterns of Matrix Computations on GPUs and Their Impact on Applications Powered by Sparse Direct Solvers
%A Ahmad Abdelfattah
%A Pieter Ghysels
%A Wajih Boukaram
%A Stanimire Tomov
%A Xiaoye Sherry Li
%A Jack Dongarra
%K GPU computing
%K irregular computational workloads
%K lu factorization
%K multifrontal solvers
%K sparse direct solvers
%X Many scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are different in size. Matrix decompositions on such irregular workloads are rarely addressed on GPUs. This paper addresses irregular workloads of matrix computations on GPUs, and their application to accelerate sparse direct solvers. We design an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse direct LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on a multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high-frequency indefinite Maxwell problem.
%B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22)
%I IEEE Computer Society
%C Dallas, TX
%P 354-367
%8 2022-11
%G eng
%U https://dl.acm.org/doi/abs/10.5555/3571885.3571919

%0 Generic
%D 2022
%T Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale
%A Alan Ayala
%A Stanimire Tomov
%A Piotr Luszczek
%A Sebastien Cayrols
%A Gerald Ragghianti
%A Jack Dongarra
%B ICL Technical Report
%I Innovative Computing Laboratory
%8 2022-07
%G eng

%0 Book Section
%B Approximate Computing Techniques
%D 2022
%T Approximate Computing for Scientific Applications
%A Anzt, Hartwig
%A Casas, Marc
%A Malossi,  Cristiano I.
%A Quintana-Ortí, Enrique S
%A Scheidegger, Florian
%A Zhuang, Sicong
%E Bosio, Alberto
%E Ménard, Daniel
%E Sentieys, Olivier
%X This chapter reviews the performance benefits that result from applying (software) approximate computing to scientific applications. For this purpose, we target two particular areas, linear algebra and deep learning, with the first one selected for being ubiquitous in scientific problems and the second one for its considerable and growing number of important applications both in industry and science.    The review of linear algebra in scientific computing is focused on the iterative solution of sparse linear systems, exposing the prevalent costs of memory accesses in these methods, and demonstrating how approximate computing can help to reduce these overheads, for example, in the case of stationary solvers themselves or the application of preconditioners for the solution of sparse linear systems via Krylov subspace methods.    The discussion of deep learning is focused on the use of approximate data transfer for cutting costs of host-to-device operations, as well as the use of adaptive precision for accelerating training of classical CNN architectures. Additionally we discuss model optimization and architecture search in presence of constraints for edge devices applications.
%B Approximate Computing Techniques
%7 322
%I Springer International Publishing
%P 415 - 465
%8 2022-01
%@ 978-3-030-94704-0
%G eng
%U https://link.springer.com/chapter/10.1007/978-3-030-94705-7_14
%R 10.1007/978-3-030-94705-7_14

%0 Book Section
%B Lecture Notes in Computer Science
%D 2022
%T Batch QR Factorization on GPUs: Design, Optimization, and Tuning
%A Abdelfattah, Ahmad
%A Stanimire Tomov
%A Dongarra, Jack
%E Groen, Derek
%E de Mulatier, Célia
%E Paszyński, Maciej
%E Krzhizhanovskaya, Valeria V.
%E Dongarra, Jack J.
%E Sloot, Peter M. A.
%K Batch linear algebra
%K GPU computing
%K QR factorization
%X QR factorization of dense matrices is a ubiquitous tool in high performance computing (HPC). From solving linear systems and least squares problems to eigenvalue problems, and singular value decompositions, the impact of a high performance QR factorization is fundamental to computer simulations and many applications. More importantly, the QR factorization on a batch of relatively small matrices has acquired a lot of attention in sparse direct solvers and low-rank approximations for Hierarchical matrices. To address this interest and demand, we developed and present a high performance batch QR factorization for Graphics Processing Units (GPUs). We present a multi-level blocking strategy that adjusts various algorithmic designs to the size of the input matrices. We also show that following the LAPACK QR design convention, while still useful, is significantly outperformed by unconventional code structures that increase data reuse. The performance results show multi-fold speedups against the state of the art libraries on the latest GPU architectures from both NVIDIA and AMD.
%B Lecture Notes in Computer Science
%I Springer International Publishing
%C Cham
%V 13350
%8 2022-06
%@ 978-3-031-08750-9
%G eng
%U https://link.springer.com/chapter/10.1007/978-3-031-08751-6_5
%R 10.1007/978-3-031-08751-6_5

%0 Conference Paper
%B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2022
%T Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations
%A Kashi, Aditya
%A Nayak, Pratik
%A Kulkarni, Dhruva
%A Scheinberg, Aaron
%A Lin, Paul
%A Anzt, Hartwig
%X Batched linear solvers, which solve many small related but independent problems, are important in several applications. This is increasingly the case for highly parallel processors such as graphics processing units (GPUs), which need a substantial amount of work to keep them operating efficiently and solving smaller problems one-by-one is not an option. Because of the small size of each problem, the task of coming up with a parallel partitioning scheme and mapping the problem to hardware is not trivial. In recent history, significant attention has been given to batched dense linear algebra. However, there is also an interest in utilizing sparse iterative solvers in a batched form, and this presents further challenges. An example use case is found in a gyrokinetic Particle-In-Cell (PIC) code used for modeling magnetically confined fusion plasma devices. The collision operator has been identified as a bottleneck, and a proxy app has been created for facilitating optimizations and porting to GPUs. The current collision kernel linear solver does not run on the GPU-a major bottleneck. As these matrices are well-conditioned, batched iterative sparse solvers are an attractive option. A batched sparse iterative solver capability has recently been developed in the Ginkgo library. In this paper, we describe how the software architecture can be used to develop an efficient solution for the XGC collision proxy app. Comparisons for the solve times on NVIDIA V100 and A100 GPUs and AMD MI100 GPUs with one dual-socket Intel Xeon Skylake CPU node with 40 OpenMP threads are presented for matrices representative of those required in the collision kernel of XGC. The results suggest that GINKGO's batched sparse iterative solvers are well suited for efficient utilization of the GPU for this problem, and the performance portability of Ginkgo in conjunction with Kokkos (used within XGC as the heterogeneous programming model) allows seamless execution for exascale oriented heterogeneous architectures at the various leadership supercomputing facilities.
%B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Lyon, France
%8 2022-07
%G eng
%U https://ieeexplore.ieee.org/document/9820663
%R 10.1109/IPDPS53621.2022.00024

%0 Conference Proceedings
%B IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing
%D 2022
%T Checkpointing à la Young/Daly: An Overview
%A Anne Benoit
%A Yishu Du
%A Thomas Herault
%A Loris Marchal
%A Guillaume Pallez
%A Lucas Perotin
%A Yves Robert
%A Hongyang Sun
%A Frederic Vivien
%X The Young/Daly formula provides an approximation of the optimal checkpoint period for a parallel application executing on a supercomputing platform. The Young/Daly formula was originally designed for preemptible tightly-coupled applications. We provide some background and survey various application scenarios to assess the usefulness and limitations of the formula.
%B IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing
%I ACM Press
%C Noida, India
%P 701-710
%8 2022-08
%@ 9781450396752
%G eng
%U https://dl.acm.org/doi/fullHtml/10.1145/3549206.3549328
%R 10.1145/3549206

%0 Generic
%D 2022
%T Communication Avoiding LU with Tournament Pivoting in SLATE
%A Rabab Alomairy
%A Mark Gates
%A Sebastien Cayrols
%A Dalal Sukkari
%A Kadir Akbudak
%A Asim YarKhan
%A Paul Bagwell
%A Jack Dongarra
%B SLATE Working Notes
%8 2022-01
%G eng

%0 Journal Article
%J International Journal of Networking and Computing
%D 2022
%T Comparing Distributed Termination Detection Algorithms for Modern HPC Platforms
%A George Bosilca
%A Bouteiller, Aurélien
%A Herault, Thomas
%A Le Fèvre, Valentin
%A Robert, Yves
%A Dongarra, Jack
%X This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. We then compare the implementation of these algorithms over a task-based runtime system, PaRSEC and show the advantages and limitations of each approach in a real implementation.
%B International Journal of Networking and Computing
%V 12
%P 26 - 46
%8 2022-01
%G eng
%U https://www.jstage.jst.go.jp/article/ijnc/12/1/12_26/_article
%N 1
%! IJNC
%R 10.15803/ijnc.12.1_26

%0 Conference Paper
%B 2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)
%D 2022
%T Composition of Algorithmic Building Blocks in Template Task Graphs
%A Herault, Thomas
%A Schuchart, Joseph
%A Valeev, Edward F.
%A George Bosilca
%B 2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)
%I IEEE
%C Dallas, TX, USA
%8 2023-01
%G eng
%U https://ieeexplore.ieee.org/document/10024647/
%R 10.1109/PAW-ATM56565.2022.00008

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2022
%T Compressed basis GMRES on high-performance graphics processing units
%A Aliaga, José I
%A Anzt, Hartwig
%A Grützmacher, Thomas
%A Quintana-Ortí, Enrique S
%A Andres E. Thomas
%X Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in current computer architectures, motivating the investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This article leverages Ginkgo’s memory accessor in order to integrate a communication-reduction strategy into the (Krylov) GMRES solver that decouples the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory accesses, the cost of the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a decrease in the volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating-point as well as fixed-point formats with mild impact on the convergence of the iterative process. We develop a high-performance implementation of the “compressed basis GMRES” solver in the Ginkgo sparse linear algebra library using a large set of test problems from the SuiteSparse Matrix Collection. We demonstrate robustness and performance advantages on a modern NVIDIA V100 graphics processing unit (GPU) of up to 50% over the standard GMRES solver that stores all data in IEEE double-precision.
%B The International Journal of High Performance Computing Applications
%8 2022-05
%G eng
%U http://journals.sagepub.com/doi/10.1177/10943420221115140
%! The International Journal of High Performance Computing Applications
%R 10.1177/10943420221115140

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2022
%T Compression and load balancing for efficient sparse matrix‐vector product on multicore processors and graphics processing units
%A Aliaga, José I.
%A Anzt, Hartwig
%A Grützmacher, Thomas
%A Quintana-Orti, Enrique S.
%A Andres E. Thomas
%B Concurrency and Computation: Practice and Experience
%V 34
%8 2022-06
%G eng
%U https://doi.org/10.1002/cpe.6515
%N 14
%! Concurrency and Computation
%R 10.1002/cpe.6515

%0 Journal Article
%J Journal of Computational Science
%D 2022
%T Computational science for a better future
%A Kovalchuk, Sergey V.
%A Krzhizhanovskaya, Valeria V.
%A Paszyński, Maciej
%A Kranzlmüller, Dieter
%A Dongarra, Jack
%A Sloot, Peter M.A.
%B Journal of Computational Science
%V 62
%P 101745
%8 2022-07
%G eng
%U https://www.sciencedirect.com/science/article/pii/S1877750322001351
%! Journal of Computational Science
%R 10.1016/j.jocs.2022.101745

%0 Conference Proceedings
%B 2022 IEEE High Performance Extreme Computing Conference (HPEC)
%D 2022
%T Deep Gaussian process with multitask and transfer learning for performance optimization
%A Sid-Lakhdar, Wissam M.
%A Aznaveh, Mohsen
%A Luszczek, Piotr
%A Dongarra, Jack
%X We combine Deep Gaussian Processes with multitask and transfer learning for the performance modeling and optimization of HPC applications. Deep Gaussian processes merge the uncertainty quantification advantage of Gaussian Processes with the predictive power of deep learning. Multitask and transfer learning allow for improved learning efficiency when several similar tasks are to be learned simultaneously and when previous learned models are sought to help in the learning of new tasks, respectively. A comparison with state-of-the-art autotuners shows the advantage of our approach on two application problems.
%B 2022 IEEE High Performance Extreme Computing Conference (HPEC)
%P 1-7
%8 2022-09
%G eng
%R 10.1109/HPEC55821.2022.9926396

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2022
%T Evaluating Data Redistribution in PaRSEC
%A Qinglei Cao
%A George Bosilca
%A Losada, Nuria
%A Wu, Wei
%A Zhong, Dong
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%V 33
%P 1856-1872
%8 2022-08
%G eng
%R 10.1109/TPDS.2021.3131657

%0 Journal Article
%J Journal of Radioanalytical and Nuclear Chemistry
%D 2022
%T Evaluations of molecular modeling and machine learning for predictive capabilities in binding of lanthanum and actinium with carboxylic acids
%A Penchoff, Deborah A.
%A Peterson, Charles C.
%A Wrancher, Eleigha M.
%A George Bosilca
%A Harrison, Robert J.
%A Valeev, Edward F.
%A Benny, Paul D.
%B Journal of Radioanalytical and Nuclear Chemistry
%8 2022-12
%G eng
%U https://rdcu.be/c2lGj
%! J Radioanal Nucl Chem
%R 10.1007/s10967-022-08620-7

%0 Journal Article
%J Communications of the ACM
%D 2022
%T The evolution of mathematical software
%A Dongarra, Jack
%B Communications of the ACM
%V 65227
%P 66 - 72
%8 2022-12
%G eng
%U https://dl.acm.org/doi/10.1145/3554977
%N 12
%! Commun. ACM
%R 10.1145/3554977

%0 Generic
%D 2022
%T Extending MAGMA Portability with OneAPI
%A Anna Fortenberry
%A Stanimire Tomov
%A Kwai Wong
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), ACM Student Research Competition
%C Dallas, TX
%8 2022-11
%G eng
%U https://sc22.supercomputing.org/proceedings/src_poster/poster_files/spostu105s3-file1.pdf

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022)
%D 2022
%T Extending MAGMA Portability with OneAPI
%A Anna Fortenberry
%A Stanimire Tomov
%B The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022)
%C Dallas, TX
%8 2022-11
%G eng

%0 Generic
%D 2022
%T FFT Benchmark Performance Experiments on Systems Targeting Exascale
%A Alan Ayala
%A Stanimire Tomov
%A Piotr Luszczek
%A Sebastien Cayrols
%A Gerald Ragghianti
%A Jack Dongarra
%B ICL Technical Report
%8 2022-03
%G eng

%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2022
%T A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization
%A Qinglei Cao
%A Rabab Alomairy
%A Yu Pei
%A George Bosilca
%A Hatem Ltaief
%A David Keyes
%A Jack Dongarra
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%8 2022-07
%G eng
%U https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9820680&isnumber=9820610
%R 10.1109/IPDPS53621.2022.00047

%0 Conference Paper
%B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2022
%T Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment
%A Schuchart, Joseph
%A Nookala, Poornima
%A Javanmard, Mohammad Mahdi
%A Herault, Thomas
%A Valeev, Edward F.
%A George Bosilca
%A Harrison, Robert J.
%X We present and evaluate TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. Programming interfaces that support task-based execution often only support shared memory parallel environments; a few support distributed memory environments, either by discovering the entire DAG of tasks on all processes, or by introducing explicit communications. The first approach limits scalability, while the second increases the complexity of programming. We demonstrate how TTG can address these issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task-centric programming systems, without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently. TTG supports distributed memory execution over 2 different task runtimes, PaRSEC and MADNESS. Performance of four paradigmatic applications (in graph analytics, dense and block-sparse linear algebra, and numerical integrodifferential calculus) with various degrees of irregularity implemented in TTG is illustrated on large distributed-memory platforms and compared to the state-of-the-art implementations.
%B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Lyon, France
%8 2022-07
%G eng
%U https://ieeexplore.ieee.org/abstract/document/9820613
%R 10.1109/IPDPS53621.2022.00086

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2022
%T Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing
%A Anzt, Hartwig
%A Cojean, Terry
%A Flegar, Goran
%A Göbel, Fritz
%A Grützmacher, Thomas
%A Nayak, Pratik
%A Ribizel, Tobias
%A Tsai, Yuhsiang Mike
%A Quintana-Ortí, Enrique S
%X In this article, we present Ginkgo, a modern C++ math library for scientific high performance computing. While classical linear algebra libraries act on matrix and vector objects, Ginkgo’s design principle abstracts all functionality as “linear operators,” motivating the notation of a “linear operator algebra library.” Ginkgo’s current focus is oriented toward providing sparse linear algebra functionality for high performance graphics processing unit (GPU) architectures, but given the library design, this focus can be easily extended to accommodate other algorithms and hardware architectures. We introduce this sophisticated software architecture that separates core algorithms from architecture-specific backends and provide details on extensibility and sustainability measures. We also demonstrate Ginkgo’s usability by providing examples on how to use its functionality inside the MFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of Ginkgo’s high performance on state-of-the-art GPU architectures.
%B ACM Transactions on Mathematical Software
%V 48
%P 1 - 33
%8 2022-03
%G eng
%U https://dl.acm.org/doi/10.1145/3480935
%N 12
%! ACM Trans. Math. Softw.
%R 10.1145/3480935

%0 Journal Article
%J Parallel Computing
%D 2022
%T Ginkgo—A math library designed for platform portability
%A Terry Cojean
%A Yu-Hsiang Mike Tsai
%A Hartwig Anzt
%K AMD
%K Intel
%K nVidia
%K performance portability
%K Platform Portability
%K Porting to GPU accelerators
%X In an era of increasing computer system diversity, the portability of software from one system to another plays a central role. Software portability is important for the software developers as many software projects have a lifetime longer than a specific system, e.g., a supercomputer, and it is important for the domain scientists that realize their scientific application in a software framework and want to be able to run on one or another system. On a high level, there exist two approaches for realizing platform portability: (1) implementing software using a portability layer leveraging any technique which always generates specific kernels from another language or through an interface for running on different architectures; and (2) providing backends for different hardware architectures, with the backends typically differing in how and in which programming language functionality is realized due to using the language of choice for each hardware (e.g., CUDA kernels for NVIDIA GPUs, SYCL (DPC++) kernels to targeting Intel GPUs and other supported hardware, …). In practice, these two approaches can be combined in applications to leverage their respective strengths. In this paper, we present how we realize portability across different hardware architectures for the Ginkgo library by following the second strategy and the goal to not only port to new hardware architectures but also achieve good performance. We present the Ginkgo library design, separating algorithms from hardware-specific kernels forming the distinct hardware executors, and report our experience when adding execution backends for NVIDIA, AMD, and Intel GPUs. We also present the performance we achieve with this approach for distinct hardware backends.
%B Parallel Computing
%V 111
%P 102902
%8 2022-02
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167819122000096
%R https://doi.org/10.1016/j.parco.2022.102902

%0 Conference Paper
%B 2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
%D 2022
%T Implicit Actions and Non-blocking Failure Recovery with MPI
%A Bouteiller, Aurélien
%A George Bosilca
%X Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in which the fault tolerance activities of multiple components can be carried out simultaneously and overlap. This work proposes to: (1) provide the required consistency in fault reporting to applications (i.e., enable an application to assess the success of a computational phase without incurring an unacceptable performance hit); (2) bring forward the building blocks that permit the effective scoping of fault recovery in an application, so that independent components in an application can recover without interfering with each other, and separate groups of processes in the application can recover independently or in unison; and (3) overlap recovery activities necessary to restore the consistency of the system (e.g., eviction of faulty processes from the communication group) with application recovery activities (e.g., dataset restoration from checkpoints).
%B 2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
%I IEEE
%C Dallas, TX, USA
%8 2023-01
%G eng
%U https://ieeexplore.ieee.org/document/10024038/
%R 10.1109/FTXS56515.2022.00009

%0 Conference Proceedings
%B 2022 IEEE International Conference on Cluster Computing (CLUSTER 2022)
%D 2022
%T Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach
%A Whitlock, Matthew
%A Morales, Nicolas
%A George Bosilca
%A Bouteiller, Aurélien
%A Nicolae, Bogdan
%A Teranishi, Keita
%A Giem, Elisabeth
%A Sarkar, Vivek
%K checkpointing
%K Fault tolerance
%K Fenix
%K HPC
%K Kokkos
%K MPI-ULFM
%K resilience
%B 2022 IEEE International Conference on Cluster Computing (CLUSTER 2022)
%C Heidelberg, Germany
%8 2022-09
%G eng
%U https://hal.archives-ouvertes.fr/hal-03772536

%0 Conference Proceedings
%B 2022 IEEE International Conference on Cluster Computing (CLUSTER)
%D 2022
%T Lossy all-to-all exchange for accelerating parallel 3-D FFTs on hybrid architectures with GPUs
%A Cayrols, Sebastien
%A Li, Jiali
%A George Bosilca
%A Stanimire Tomov
%A Ayala, Alan
%A Dongarra, Jack
%X In the context of parallel applications, communication is a critical part of the infrastructure and a potential bottleneck. The traditional approach to tackle communication challenges consists of redesigning algorithms so that the complexity or the communication volume is reduced. However, there are algorithms like the Fast Fourier Transform (FFT) where reducing the volume of communication is very challenging yet can reap large benefit in terms of time-to-completion. In this paper, we revisit the implementation of the MPI all-to-all routine at the core of 3D FFTs by using advanced MPI features, such as One-Sided Communication, and integrate data compression during communication to reduce the volume of data exchanged. Since some compression techniques are ‘lossy’ in the sense that they involve a loss of accuracy, we study the impact of lossy compression in heFFTe, the state-of-the-art FFT library for large scale 3D FFTs on hybrid architectures with GPUs. Consequently, we design an approximate FFT algorithm that trades off user-controlled accuracy for speed. We show that we speedup the 3D FFTs proportionally to the compression rate. In terms of accuracy, comparing our approach with a reduced precision execution, where both the data and the computation are in reduced precision, we show that when the volume of communication is compressed to the size of the reduced precision data, the approximate FFT algorithm is as fast as the one in reduced precision while the accuracy is one order of magnitude better.
%B 2022 IEEE International Conference on Cluster Computing (CLUSTER)
%P 152-160
%8 2022-09
%G eng
%R 10.1109/CLUSTER51413.2022.00029

%0 Generic
%D 2022
%T Mixed precision and approximate 3D FFTs: Speed for accuracy trade-off with GPU-aware MPI and run-time data compression
%A Sebastien Cayrols
%A Jiali Li
%A George Bosilca
%A Stanimire Tomov
%A Alan Ayala
%A Jack Dongarra
%K All-to-all
%K Approximate FFTs
%K ECP
%K heFFTe
%K Lossy compression
%K mixed-precision algorithms
%K MPI
%B ICL Technical Report
%8 2022-05
%G eng

%0 Journal Article
%J Parallel Computing
%D 2022
%T OpenMP application experiences: Porting to accelerated nodes
%A Bak, Seonmyeong
%A Bertoni, Colleen
%A Boehm, Swen
%A Budiardja, Reuben
%A Chapman, Barbara M.
%A Doerfert, Johannes
%A Eisenbach, Markus
%A Finkel, Hal
%A Hernandez, Oscar
%A Huber, Joseph
%A Iwasaki, Shintaro
%A Kale, Vivek
%A Kent, Paul R.C.
%A Kwack, JaeHyuk
%A Lin, Meifeng
%A Luszczek, Piotr
%A Luo, Ye
%A Pham, Buu
%A Pophale, Swaroop
%A Ravikumar, Kiran
%A Sarkar, Vivek
%A Scogland, Thomas
%A Tian, Shilei
%A Yeung, P.K.
%X As recent enhancements to the OpenMP specification become available in its implementations, there is a need to share the results of experimentation in order to better understand the OpenMP implementation’s behavior in practice, to identify pitfalls, and to learn how the implementations can be effectively deployed in scientific codes. We report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadership-class high-performance supercomputer systems at the United States Department of Energy. Additionally, we identify important challenges and open problems related to the deployment of OpenMP. Through our report of experiences, we find that OpenMP implementations are successful on current supercomputing platforms and that OpenMP is a promising programming model to use for applications to be run on emerging and future platforms with accelerated nodes.
%B Parallel Computing
%V 109
%8 2022-03
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167819121001009
%! Parallel Computing
%R 10.1016/j.parco.2021.102856

%0 Journal Article
%J IEEE Transactions on Parallel Distributed Systems
%D 2022
%T Optimal Checkpointing Strategies for Iterative Applications
%A Yishu Du
%A Guillaume Pallez
%A Loris Marchal
%A Yves Robert
%B IEEE Transactions on Parallel Distributed Systems
%V 33
%P 507-522
%8 2022-03
%G eng
%U https://ieeexplore.ieee.org/document/9495174
%N 3
%R 10.1109/TPDS.2021.3099440

%0 Generic
%D 2022
%T PAQR: Pivoting Avoiding QR factorization
%A Wissam M. Sid-Lakhdar
%A Sebastien Cayrols
%A Daniel Bielich
%A Ahmad Abdelfattah
%A Piotr Luszczek
%A Mark Gates
%A Stanimire Tomov
%A Hans Johansen
%A David Williams-Young
%A Timothy A. Davis
%A Jack Dongarra
%X The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the  matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is deficient. On the other hand, QR with column pivoting (QRCP) is able to produce smaller forward errors on deficient matrices, but its cost is prohibitive compared to QR. The aim of this paper is to propose PAQR, an alternative solution method with the same cost (or smaller) as QR and as accurate as QRCP in practical cases, for the solution of rank-deficient linear least-squares problems. After presenting the algorithm and its implementations on different architectures, we compare its accuracy and performance results on a variety of application problems.
%B ICL Technical Report
%8 2022-06
%G eng

%0 Conference Paper
%B 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2022
%T Performance Analysis of Parallel FFT on Large Multi-GPU Systems
%A Ayala, Alan
%A Stanimire Tomov
%A Stoyanov, Miroslav
%A Haidar, Azzam
%A Dongarra, Jack
%B 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Lyon, France
%8 2022-08
%G eng
%U https://ieeexplore.ieee.org/document/9835388/
%R 10.1109/IPDPSW55747.2022.00072

%0 Book Section
%B Accelerated Computing with HIP
%D 2022
%T Performance Application Programming Interface
%A Anthony Danalis
%A Heike Jagode
%B Accelerated Computing with HIP
%I Sun, Baruah and Kaeli
%8 2022-12
%@ B0BR8KSS7K
%G eng
%U https://a.co/d/0DoG5as

%0 Conference Proceedings
%B Euro-Par 2021: Parallel Processing Workshops
%D 2022
%T Porting Sparse Linear Algebra to Intel GPUs
%A Tsai, Yuhsiang M.
%A Cojean, Terry
%A Anzt, Hartwig
%E Chaves, Ricardo
%E B. Heras, Dora
%E Ilic, Aleksandar
%E Unat, Didem
%E Badia, Rosa M.
%E Bracciali, Andrea
%E Diehl, Patrick
%E Dubey, Anshu
%E Sangyoon, Oh
%E L. Scott, Stephen
%E Ricci, Laura
%K Ginkgo
%K Intel GPUs
%K math library
%K oneAPI
%K SpMV
%X With discrete Intel GPUs entering the high performance computing landscape, there is an urgent need for production-ready software stacks for these platforms. In this paper, we report how we prepare the Ginkgo math library for Intel GPUs by developing a kernel backed based on the DPC++ programming environment. We discuss conceptual differences to the CUDA and HIP programming models and describe workflows for simplified code conversion. We benchmark advanced sparse linear algebra routines utilizing the converted kernels to assess the efficiency of the DPC++ backend in the hardware-specific performance bounds. We compare the performance of basic building blocks against routines providing the same functionality that ship with Intel’s oneMKL vendor library.
%B Euro-Par 2021: Parallel Processing Workshops
%I Springer International Publishing
%C Lisbon, Portugal
%V 13098
%P 57 - 68
%8 2022-06
%@ 978-3-031-06155-4
%G eng
%U https://link.springer.com/chapter/10.1007/978-3-031-06156-1_5
%R 10.1007/978-3-031-06156-1_5

%0 Conference Proceedings
%B 2022 SIAM Conference on Parallel Processing for Scientific Computing (PP)
%D 2022
%T Prediction of Optimal Solvers for Sparse Linear Systems Using Deep Learning
%A Funk, Yannick
%A Götz, Markus
%A Anzt, Hartwig
%E Li, Xiaoye S.
%E Teranishi, Keita
%X Solving sparse linear systems is a key task in a number of computational problems, such as data analysis and simulations, and majorly determines overall execution time. Choosing a suitable iterative solver algorithm, however, can significantly improve time-to-completion. We present a deep learning approach designed to predict the optimal iterative solver for a given sparse linear problem. For this, we detail useful linear system features to drive the prediction process, the metrics we use to quantify the iterative solvers' time-to-approximation performance and a comprehensive experimental evaluation of the prediction quality of the neural network. Using a hyperparameter optimization and an ablation study on the SuiteSparse matrix collection we have inferred the importance of distinct features, achieving a top-1 classification accuracy of 60%.
%B 2022 SIAM Conference on Parallel Processing for Scientific Computing (PP)
%I Society for Industrial and Applied Mathematics
%C Philadelphia, PA
%P 14 - 24
%8 2022
%G eng
%U https://epubs.siam.org/doi/10.1137/1.9781611977141.2
%R 10.1137/1.978161197714110.1137/1.9781611977141.2

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2022
%T Providing performance portable numerics for Intel GPUs
%A Tsai, Yu-Hsiang M.
%A Cojean, Terry
%A Anzt, Hartwig
%K Ginkgo
%K Intel GPUs
%K math library
%K oneAPI
%K SpMV
%X With discrete Intel GPUs entering the high-performance computing landscape, there is an urgent need for production-ready software stacks for these platforms. In this article, we report how we enable the Ginkgo math library to execute on Intel GPUs by developing a kernel backed based on the DPC++ programming environment. We discuss conceptual differences between the CUDA and DPC++ programming models and describe workflows for simplified code conversion. We evaluate the performance of basic and advanced sparse linear algebra routines available in Ginkgo's DPC++ backend in the hardware-specific performance bounds and compare against routines providing the same functionality that ship with Intel's oneMKL vendor library.
%B Concurrency and Computation: Practice and Experience
%V 17
%8 2022-10
%G eng
%U https://onlinelibrary.wiley.com/doi/full/10.1002/cpe.7400
%! Concurrency and Computation
%R 10.1002/cpe.7400

%0 Conference Paper
%B 2022 IEEE International Conference on Cluster Computing (CLUSTER)
%D 2022
%T Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG
%A Schuchart, Joseph
%A Nookala, Poornima
%A Herault, Thomas
%A Valeev, Edward F.
%A George Bosilca
%K Dataflow graph
%K Hardware
%K Instruction sets
%K Memory management
%K PaR-SEC
%K parallel programming
%K runtime
%K scalability
%K Task analysis
%K task-based programming
%K Template Task Graph
%K TTG
%X Shared memory parallel programming models strive to provide low-overhead execution environments. Task-based programming models, in particular, are well-suited to cope with the ubiquitous multi- and many-core systems since they allow applications to express all available concurrency to a scheduler, which is tasked with exploiting the available hardware resources. It is general consensus that atomic operations should be preferred over locks and mutexes to avoid inter-thread serialization and the resulting loss in efficiency. However, even atomic operations may serialize threads if not used judiciously. In this work, we will discuss several optimizations applied to TTG and the underlying PaRSEC runtime system aiming at removing contentious atomic operations to reduce the overhead of task management to a few hundred clock cycles. The result is an optimized data-flow programming system that seamlessly scales from a single node to distributed execution and which is able to compete with OpenMP in shared memory.
%B 2022 IEEE International Conference on Cluster Computing (CLUSTER)
%I IEEE
%C Heidelberg, Germany
%8 2022-09
%G eng
%U https://ieeexplore.ieee.org/document/9912704/
%R 10.1109/CLUSTER51413.2022.00026

%0 Conference Paper
%B 2022 IEEE 19th International Conference on Mobile Ad Hoc and Smart Systems (MASS)
%D 2022
%T A Python Library for Matrix Algebra on GPU and Multicore Architectures
%A Nance, Delario
%A Stanimire Tomov
%A Wong, Kwai
%B 2022 IEEE 19th International Conference on Mobile Ad Hoc and Smart Systems (MASS)
%I IEEE
%C Denver, CO
%8 2022-12
%G eng
%U https://ieeexplore.ieee.org/document/9973474/
%R 10.1109/MASS56207.2022.00121

%0 Generic
%D 2022
%T Randomized Numerical Linear Algebra: A Perspective on the Field with an Eye to Software
%A Riley Murray
%A James Demmel
%A Michael W. Mahoney
%A N. Benjamin Erichson
%A Maksim Melnichenko
%A Osman Asif Malik
%A Laura Grigori
%A Piotr Luszczek
%A Michał Dereziński
%A Miles E. Lopes
%A Tianyu Liang
%A Hengrui Luo
%A Jack Dongarra
%K Randomized algorithms
%X Randomized numerical linear algebra – RandNLA, for short – concerns the use of randomization as a resource to develop improved algorithms for large-scale linear algebra computations. The origins of contemporary RandNLA lay in theoretical computer science, where it blossomed from a simple idea: randomization provides an avenue for computing approximate solutions to linear algebra problems more eﬀiciently than deterministic algorithms. This idea proved fruitful in and was largely driven by the development of scalable algorithms for machine learning and statistical data analysis applications. However, the true potential of RandNLA only came into focus once it began to integrate with the fields of numerical analysis and “classical” numerical linear algebra. Through the efforts of many individuals, randomized algorithms have been developed that provide full control over the accuracy of their solutions and that are every bit as reliable as algorithms that might be found in libraries such as LAPACK.  The spectrum of possibilities offered by RandNLA has created a virtuous cycle of contributions by numerical analysts, statisticians, theoretical computer scientists, and the machine learning community. Recent years have even seen the incorporation of certain RandNLA methods into MATLAB, the NAG Library, and NVIDIA’s cuSOLVER. In view of these developments, we believe the time is ripe to accelerate the adoption of RandNLA in the scientific community. In particular, we believe the community stands to benefit significantly from a suitably defined “RandBLAS” and “RandLAPACK,” to serve as standard libraries for RandNLA, in much the same way that BLAS and LAPACK serve as standards for deterministic linear algebra.  This monograph surveys the field of RandNLA as a step toward building mean- ingful RandBLAS and RandLAPACK libraries. Section 1 begins by setting scope and design principles for RandLAPACK and summarizing subsequent sections of the monograph. Section 2 focuses on RandBLAS, which is to be responsible for sketching. Details of functionality suitable for RandLAPACK are covered in the five sections that follow. Specifically, Sections 3 to 5 cover least squares and optimization, low- rank approximation, and other select problems that are well-understood in how they benefit from randomized algorithms. The remaining sections – on statistical leverage scores (Section 6) and tensor computations (Section 7) – read more like traditional surveys. The different flavor of these latter sections reflects how, in our assessment, the literature on these topics is still maturing.  We provide a substantial amount of pseudo-code and supplementary material over the course of five appendices. Much of the pseudo-code has been tested via publicly available Matlab and Python implementations.
%B University of California, Berkeley EECS Technical Report
%I University of California, Berkeley
%8 2022-11
%G eng
%U https://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-258.html
%R 10.48550/arXiv.2302.1147

%0 Report
%D 2022
%T Reinventing High Performance Computing: Challenges and Opportunities
%A Daniel Reed
%A Dennis Gannon
%A Jack Dongarra
%X The world of computing is in rapid transition, now dominated by a world of smartphones and cloud services, with profound implications for the future of advanced scientific computing. Simply put, high-performance computing (HPC) is at an important inflection point. For the last 60 years, the world's fastest supercomputers were almost exclusively produced in the United States on behalf of scientific research in the national laboratories. Change is now in the wind. While costs now stretch the limits of U.S. government funding for advanced computing, Japan and China are now leaders in the bespoke HPC systems funded by government mandates. Meanwhile, the global semiconductor shortage and political battles surrounding fabrication facilities affect everyone. However, another, perhaps even deeper, fundamental change has occurred. The major cloud vendors have invested in global networks of massive scale systems that dwarf today's HPC systems. Driven by the computing demands of AI, these cloud systems are increasingly built using custom semiconductors, reducing the financial leverage of traditional computing vendors. These cloud systems are now breaking barriers in game playing and computer vision, reshaping how we think about the nature of scientific computation. Building the next generation of leading edge HPC systems will require rethinking many fundamentals and historical approaches by embracing end-to-end co-design; custom hardware configurations and packaging; large-scale prototyping, as was common thirty years ago; and collaborative partnerships with the dominant computing ecosystem companies, smartphone and cloud computing vendors.
%B ICL Technical Report
%8 2022-03
%G eng

%0 Generic
%D 2022
%T Report on the Oak Ridge National Laboratory's Frontier System
%A Jack Dongarra
%A Al Geist
%B ICL Technical Report
%8 2022-05
%G eng

%0 Conference Proceedings
%B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22)
%D 2022
%T Reshaping Geostatistical Modeling and Prediction for Extreme-Scale Environmental Applications
%A Cao, Qinglei
%A Abdulah, Sameh
%A Rabab Alomairy
%A Pei, Yu
%A Pratik Nag
%A George Bosilca
%A Dongarra, Jack
%A Genton, Marc G.
%A Keyes, David
%A Ltaief, Hatem
%A Sun, Ying
%K climate/weather prediction
%K dynamic runtime systems
%K high performance computing.
%K low- rank matrix approximations
%K mixed-precision computations
%K space-time geospatial statistics
%K Task-based programming models
%X We extend the capability of space-time geostatistical modeling using algebraic approximations, illustrating application-expected accuracy worthy of double precision from majority low-precision computations and low-rank matrix approximations. We exploit the mathematical structure of the dense covariance matrix whose inverse action and determinant are repeatedly required in Gaussian log-likelihood optimization. Geostatistics augments first-principles modeling approaches for the prediction of environmental phenomena given the availability of measurements at a large number of locations; however, traditional Cholesky-based approaches grow cubically in complexity, gating practical extension to continental and global datasets now available. We combine the linear algebraic contributions of mixed-precision and low-rank computations within a tilebased Cholesky solver with on-demand casting of precisions and dynamic runtime support from PaRSEC to orchestrate tasks and data movement. Our adaptive approach scales on various systems and leverages the Fujitsu A64FX nodes of Fugaku to achieve up to 12X performance speedup against the highly optimized dense Cholesky implementation.
%B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22)
%I IEEE Press
%C Dallas, TX
%8 2022-11
%@ 9784665454445
%G eng
%U https://dl.acm.org/doi/abs/10.5555/3571885.3571888

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2022
%T Resiliency in numerical algorithm design for extreme scale simulations
%A Agullo, Emmanuel
%A Altenbernd, Mirco
%A Anzt, Hartwig
%A Bautista-Gomez, Leonardo
%A Benacchio, Tommaso
%A Bonaventura, Luca
%A Bungartz, Hans-Joachim
%A Chatterjee, Sanjay
%A Ciorba, Florina M
%A DeBardeleben, Nathan
%A Drzisga, Daniel
%A Eibl, Sebastian
%A Engelmann, Christian
%A Gansterer, Wilfried N
%A Giraud, Luc
%A Göddeke, Dominik
%A Heisig, Marco
%A Jézéquel, Fabienne
%A Kohl, Nils
%A Li, Xiaoye Sherry
%A Lion, Romain
%A Mehl, Miriam
%A Mycek, Paul
%A Obersteiner, Michael
%A Quintana-Ortí, Enrique S
%A Rizzi, Francesco
%A Rüde, Ulrich
%A Schulz, Martin
%A Fung, Fred
%A Speck, Robert
%A Stals, Linda
%A Teranishi, Keita
%A Thibault, Samuel
%A Thönnes, Dominik
%A Wagner, Andreas
%A Wohlmuth, Barbara
%K Fault tolerance
%K Numerical algorithms
%K parallel computer architecture
%K resilience
%B The International Journal of High Performance Computing Applications
%V 36371337212766180823
%P 251 - 285
%8 2022-03
%G eng
%U http://journals.sagepub.com/doi/10.1177/10943420211055188http://journals.sagepub.com/doi/pdf/10.1177/10943420211055188http://journals.sagepub.com/doi/pdf/10.1177/10943420211055188http://journals.sagepub.com/doi/full-xml/10.1177/10943420211055188
%N 2
%! The International Journal of High Performance Computing Applications
%R 10.1177/10943420211055188

%0 Conference Paper
%B 2022 IEEE High Performance Extreme Computing Conference (HPEC)
%D 2022
%T Surrogate ML/AI Model Benchmarking for FAIR Principles' Conformance
%A Piotr Luszczek
%A Cade Brown
%K Analytical models
%K Benchmark testing
%K Cloud computing
%K Computational modeling
%K Data models
%K Measurement
%K Satellites
%X We present benchmarking platform for surrogate ML/AI models that enables the essential properties for open science and allow them to be findable, accessible, interoperable, and reusable. We also present a use case of cloud cover modeling, analysis, and experimental testing based on a large dataset of multi-spectral satellite sensor data. We use this particular evaluation to highlight the plethora of choices that need resolution for the life cycle of supporting the scientific workflows with data-driven models that need to be first trained to satisfactory accuracy and later monitored during field usage for proper feedback into both computational results and future data model improvements. Unlike traditional testing, performance, or analysis efforts, we focus exclusively on science-oriented metrics as the relevant figures of merit.
%B 2022 IEEE High Performance Extreme Computing Conference (HPEC)
%I IEEE
%8 2022-09
%G eng
%U https://ieeexplore.ieee.org/document/9926401/
%R 10.1109/HPEC55821.2022.9926401

%0 Conference Paper
%B ScalAH22: 13th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems
%D 2022
%T Threshold Pivoting for Dense LU Factorization
%A Neil Lindquist
%A Mark Gates
%A Piotr Luszczek
%A Jack Dongarra
%X LU factorization is a key approach for solving large, dense systems of linear equations. Partial row pivoting is commonly used to ensure numerical stability; however, the data movement needed for the row interchanges can reduce performance. To improve this, we propose using threshold pivoting to find pivots almost as good as those selected by partial pivoting but that result in less data movement. Our theoretical analysis bounds the element growth similarly to partial pivoting; however, it also shows that the growth of threshold pivoting for a given matrix cannot be bounded by that of partial pivoting and vice versa. Additionally, we experimentally tested the approach on the Summit supercomputer. Threshold pivoting improved performance by up to 32% without a significant effect on accuracy. For a more aggressive configuration with up to one digit of accuracy lost, the improvement was as high as 44%.
%B ScalAH22: 13th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems
%I IEEE
%C Dallas, Texas
%8 2022-11
%G eng
%R 10.1109/ScalAH56622.2022.00010

%0 Journal Article
%J Parallel Computing
%D 2022
%T Using long vector extensions for MPI reductions
%A Zhong, Dong
%A Cao, Qinglei
%A George Bosilca
%A Dongarra, Jack
%X The modern CPU’s design, including the deep memory hierarchies and SIMD/vectorization capability have a more significant impact on algorithms’ efficiency than the modest frequency increase observed recently. The current introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become a critical software component to increase efficiency and close the gap to peak performance.    In this paper, we investigate the impact of the vectorization of MPI reduction operations. We propose an implementation of predefined MPI reduction operations using vector intrinsics (AVX and SVE) to improve the time-to-solution of the predefined MPI reduction operations. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architectures. Experiments conducted on varied architectures (Intel Xeon Gold, AMD Zen 2, and Arm A64FX), show that the proposed vector extension optimized reduction operations significantly reduce completion time for collective communication reductions. With these optimizations, we achieve higher memory bandwidth and an increased efficiency for local computations, which directly benefit the overall cost of collective reductions and applications based on them.
%B Parallel Computing
%V 109
%P 102871
%8 2022-03
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167819121001137
%! Parallel Computing
%R 10.1016/j.parco.2021.102871

%0 Journal Article
%J Journal of Computational Science
%D 2021
%T 20 years of computational science: Selected papers from 2020 International Conference on Computational Science
%A Kovalchuk, Sergey V
%A Krzhizhanovskaya, Valeria V
%A Sloot, PMA
%A Závodszky, Gábor
%A Lees, Michael H
%A Paszyński, M
%A Jack Dongarra
%X We thank the authors of the selected papers for their valuable contributions, the reviewers of this special section for their in-depth reviews and constructive comments, the ICCS program committee members, and workshop organizers for their diligent work ensuring the high standard of accepted ICCS papers. As always, we also thank Springer for publishing the conference proceedings and Elsevier for their continuous support and inspiration during the preparation and publishing of this virtual special issue.
%B Journal of Computational Science
%V 53
%P 101395–101395
%G eng
%R 10.1016/j.jocs.2021.101395

%0 Generic
%D 2021
%T Accelerating FFT towards Exascale Computing
%A Alan Ayala
%A Stanimire Tomov
%A Haidar, Azzam
%A Stoyanov, M.
%A Cayrols, Sebastien
%A Li, Jiali
%A George Bosilca
%A Jack Dongarra
%I NVIDIA GPU Technology Conference (GTC2021)
%G eng

%0 Conference Paper
%B 2021 Workshop on Exascale MPI (ExaMPI)
%D 2021
%T Accelerating Multi - Process Communication for Parallel 3-D FFT
%A Ayala, Alan
%A Tomov, Stan
%A Stoyanov, Miroslav
%A Haidar, Azzam
%A Dongarra, Jack
%X Today largest and most powerful supercomputers in the world are built on heterogeneous platforms; and using the combined power of multi-core CPUs and GPUs, has had a great impact accelerating large-scale applications. However, on these architectures, parallel algorithms, such as the Fast Fourier Transform (FFT), encounter that inter-processor communication become a bottleneck and limits their scalability. In this paper, we present techniques for speeding up multi-process communication cost during the computation of FFTs, considering hybrid network connections as those expected on upcoming exascale machines. Among our techniques, we present algorithmic tuning, making use of phase diagrams; parametric tuning, using different FFT settings; and MPI distribution tuning based on FFT size and computational resources available. We present several experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 40,960 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs.
%B 2021 Workshop on Exascale MPI (ExaMPI)
%I IEEE
%C St. Louis, MO, USA
%8 2021-12
%G eng
%U https://ieeexplore.ieee.org/document/9652837/
%R 10.1109/ExaMPI54564.2021.00011

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2021
%T Accelerating Restarted GMRES with Mixed Precision Arithmetic
%A Neil Lindquist
%A Piotr Luszczek
%A Jack Dongarra
%K Convergence
%K Error correction
%K iterative methods
%K Kernel
%K linear systems
%K Stability analysis
%X The generalized minimum residual method (GMRES) is a commonly used iterative Krylov solver for sparse, non-symmetric systems of linear equations. Like other iterative solvers, data movement dominates its run time. To improve this performance, we propose running GMRES in reduced precision with key operations remaining in full precision. Additionally, we provide theoretical results linking the convergence of finite precision GMRES with classical Gram-Schmidt with reorthogonalization (CGSR) and its infinite precision counterpart which helps justify the convergence of this method to double-precision accuracy. We tested the mixed-precision approach with a variety of matrices and preconditioners on a GPU-accelerated node. Excluding the incomplete LU factorization without fill in (ILU(0)) preconditioner, we achieved average speedups ranging from 8 to 61 percent relative to comparable double-precision implementations, with the simpler preconditioners achieving the higher speedups.
%B IEEE Transactions on Parallel and Distributed Systems
%8 2021-06
%G eng
%R 10.1109/TPDS.2021.3090757

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2021
%T Budget-aware scheduling algorithms for scientific workflows with stochastic task weights on IaaS Cloud platforms
%A Eddy Caron
%A Yves Caniou
%A Aurélie Kong Win Chang
%A Yves Robert
%B Concurrency and Computation: Practice and Experience
%V 33
%P e6065
%G eng
%R https://doi.org/10.1002/cpe.6065

%0 Journal Article
%J Parallel Computing
%D 2021
%T Callback-based completion notification using MPI Continuations
%A Schuchart, Joseph
%A Samfass, Philipp
%A Niethammer, Christoph
%A Gracia, José
%A George Bosilca
%K MPI
%K MPI Continuations
%K OmpSs
%K OpenMP
%K parsec
%K TAMPI
%K Task-based programming models
%X Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and nonblocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. We have previously proposed an extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations. This interface is flexible enough to accommodate a wide range of different APMs.    In this paper, we present an extension to the previously described interface that allows for finer control of the behavior of the MPI Continuations interface. We then present some of our first experiences in using the interface in the context of different applications, including the NAS parallel benchmarks, the PaRSEC task-based runtime system, and a load-balancing scheme within an adaptive mesh refinement solver called ExaHyPE. We show that the interface, implemented inside Open MPI, enables low-latency, high-throughput completion notifications that outperform solutions implemented in the application space.
%B Parallel Computing
%V 21238566
%P 102793
%8 Jan-05-2021
%G eng
%U https://www.sciencedirect.com/science/article/abs/pii/S0167819121000466?via%3Dihub
%N 0225
%! Parallel Computing
%R 10.1016/j.parco.2021.102793

%0 Conference Paper
%B 35th IEEE International Parallel &  Distributed Processing Symposium (IPDPS 2021)
%D 2021
%T Distributed-Memory Multi-GPU Block-Sparse Tensor Contraction for Electronic Structure
%A Thomas Herault
%A Yves Robert
%A George Bosilca
%A Robert Harrison
%A Cannada Lewis
%A Edward Valeev
%A Jack Dongarra
%K block-sparse matrix multiplication
%K distributed-memory
%K Electronic structure
%K multi-GPU node
%K parsec
%K tensor contraction
%X Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.
%B 35th IEEE International Parallel &  Distributed Processing Symposium (IPDPS 2021)
%I IEEE
%C Portland, OR
%8 2021-05
%G eng
%U https://hal.inria.fr/hal-02970659/document

%0 Generic
%D 2021
%T DTE: PaRSEC Enabled Libraries and Applications
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%I 2021 Exascale Computing Project Annual Meeting
%8 2021-04
%G eng

%0 Journal Article
%J Int. J. of Networking and Computing
%D 2021
%T Dynamic DAG scheduling under memory constraints for shared-memory platforms
%A Gabriel Bathie
%A Loris Marchal
%A Yves Robert
%A Samuel Thibault
%B Int. J. of Networking and Computing
%V 11
%P 27-49
%G eng

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2021
%T Efficient exascale discretizations: High-order finite element methods
%A Kolev, Tzanio
%A Fischer, Paul
%A Min, Misun
%A Jack Dongarra
%A Brown, Jed
%A Dobrev, Veselin
%A Warburton, Tim
%A Stanimire Tomov
%A Shephard, Mark S
%A Abdelfattah, Ahmad
%A others
%K co-design
%K high-order discretizations
%K High-performance computing
%K PDEs
%K unstructured grids
%X Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms used in many large-scale applications. These architectures favor algorithms that expose ultra fine-grain parallelism and maximize the ratio of floating point operations to energy intensive data movement. One of the few viable approaches to achieve high efficiency in the area of PDE discretizations on unstructured grids is to use matrix-free/partially assembled high-order finite element methods, since these methods can increase the accuracy and/or lower the computational time due to reduced data motion. In this paper we provide an overview of the research and development activities in the Center for Efficient Exascale Discretizations (CEED), a co-design center in the Exascale Computing Project that is focused on the development of next-generation discretization software and algorithms to enable a wide range of finite element applications to run efficiently on future hardware. CEED is a research partnership involving more than 30 computational scientists from two US national labs and five universities, including members of the Nek5000, MFEM, MAGMA and PETSc projects. We discuss the CEED co-design activities based on targeted benchmarks, miniapps and discretization libraries and our work on performance optimizations for large-scale GPU architectures. We also provide a broad overview of research and development activities in areas such as unstructured adaptive mesh refinement algorithms, matrix-free linear solvers, high-order data visualization, and list examples of collaborations with several ECP and external applications.
%B The International Journal of High Performance Computing Applications
%P 10943420211020803
%G eng
%R 10.1177/10943420211020803

%0 Book Section
%B Tools for High Performance Computing 2018/2019
%D 2021
%T Effortless Monitoring of Arithmetic Intensity with PAPI’s Counter Analysis Toolkit
%A Daniel Barry
%A Danalis, Anthony
%A Heike Jagode
%X With exascale computing forthcoming, performance metrics such as memory traffic and arithmetic intensity are increasingly important for codes that heavily utilize numerical kernels. Performance metrics in different CPU architectures can be monitored by reading the occurrences of various hardware events. However, from architecture to architecture, it becomes more and more unclear which native performance events are indexed by which event names, making it difficult for users to understand what specific events actually measure. This ambiguity seems particularly true for events related to hardware that resides beyond the compute core, such as events related to memory traffic. Still, traffic to memory is a necessary characteristic for determining arithmetic intensity. To alleviate this difficulty, PAPI’s Counter Analysis Toolkit measures the occurrences of events through a series of benchmarks, allowing its users to discover the high-level meaning of native events. We (i) leverage the capabilities of the Counter Analysis Toolkit to identify the names of hardware events for reading and writing bandwidth utilization in addition to floating-point operations, (ii) measure the occurrences of the events they index during the execution of important numerical kernels, and (iii) verify their identities by comparing these occurrence patterns to the expected arithmetic intensity of the numerical kernels.
%B Tools for High Performance Computing 2018/2019
%I Springer
%P 195–218
%@ 978-3-030-66057-4
%G eng
%R 10.1007/978-3-030-66057-4_11

%0 Conference Proceedings
%B 42nd Real Time Systems Symposium (RTSS)
%D 2021
%T Evaluating Task Dropping Strategies for Overloaded Real-Time Systems (Work-In-Progress)
%A Yiqin Gao
%A Guillaume Pallez
%A Yves Robert
%A Frederic Vivien
%B 42nd Real Time Systems Symposium (RTSS)
%I IEEE Computer Society Press
%G eng

%0 Journal Article
%J IEEE Access
%D 2021
%T Exploiting Block Structures of KKT Matrices for Efficient Solution of Convex Optimization Problems
%A Iqbal, Zafar
%A Nooshabadi, Saeid
%A Yamazaki, Ichitaro
%A Stanimire Tomov
%A Jack Dongarra
%B IEEE Access
%G eng
%R 10.1109/ACCESS.2021.3106054

%0 Generic
%D 2021
%T Gingko: A Sparse Linear Algebrea Library for HPC
%A Hartwig Anzt
%A Natalie Beams
%A Terry Cojean
%A Fritz Göbel
%A Thomas Grützmacher
%A Aditya Kashi
%A Pratik Nayak
%A Tobias Ribizel
%A Yuhsiang M. Tsai
%I 2021 ECP Annual Meeting
%8 2021-04
%G eng

%0 Journal Article
%J Parallel Computing
%D 2021
%T GPU algorithms for Efficient Exascale Discretizations
%A Abdelfattah, Ahmad
%A Valeria Barra
%A Natalie Beams
%A Bleile, Ryan
%A Brown, Jed
%A Camier, Jean-Sylvain
%A Carson, Robert
%A Chalmers, Noel
%A Dobrev, Veselin
%A Dudouit, Yohann
%A others
%K Exascale applications
%K Finite element methods
%K GPU acceleration
%K high-order discretizations
%K High-performance computing
%X In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems.
%B Parallel Computing
%V 108
%P 102841
%G eng
%R 10.1016/j.parco.2021.102841

%0 Generic
%D 2021
%T Interim Report on Benchmarking FFT Libraries on High Performance Systems
%A Alan Ayala
%A Stanimire Tomov
%A Piotr Luszczek
%A Cayrols, Sebastien
%A Ragghianti, Gerald
%A Jack Dongarra
%X The Fast Fourier Transform (FFT) is used in many applications such as molecular dynamics, spectrum estimation, fast convolution and correlation, signal modulation, and many wireless multimedia applications. FFTs are also heavily used in ECP applications, such as EXAALT, Copa, ExaSky-HACC, ExaWind, WarpX, and many others. As these applications’ accuracy and speed depend on the performance of the FFTs, we designed an FFT benchmark to mea- sure performance and scalability of currently available FFT packages and present the results from a pre-Exascale platform. Our benchmarking also stresses the overall capacity of system interconnect; thus, it may be considered as an indicator of the bisection bandwidth, communication contention noise, and the software overheads in MPI collectives that are of interest to many other ECP applications and libraries.    This FFT benchmarking project aims to show the strengths and weaknesses of multiple FFT libraries and to indicate what can be done to improve their performance. In particular, we believe that the benchmarking results could help design and implement a fast and robust FFT library for 2D and 3D inputs, while targeting large-scale heterogeneous systems with multicore processors and hardware accelerators that are a co-designed in tandem with ECP applications. Our work involves studying and analyzing state-of-the-art FFT software both from vendors and available as open-source codes to better understand their performance.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2021-07
%G eng
%9 ICL Tech Report

%0 Journal Article
%J Parallel Computing
%D 2021
%T An international survey on MPI users
%A Atsushi Hori
%A Emmanuel Jeannot
%A George Bosilca
%A Takahiro Ogura
%A Balazs Gerofi
%A Jie Yin
%A Yutaka Ishikawa
%K message passing interface
%K MPI
%K survey
%X The Message Passing Interface (MPI) plays a crucial part in the parallel computing ecosystem, a driving force behind many of thehigh-performance computing (HPC) successes. To maintain its relevance to the user community—and in particular to the growingHPC community at large—the MPI standard needs to identify and understand the MPI users’ concerns and expectations, and adaptaccordingly to continue to efficiently bridge the gap between users and hardware.  This questionnaire survey was conducted usingtwo online questionnaire frameworks and has gathered more than 850 answers from 42 countries since February 2019.  Some ofpreceding surveys of MPI uses are questionnaire surveys like ours, while others are conducted either by analyzing MPI programsto reveal static behavior or by using profiling tools to analyze the dynamic runtime behavior of MPI jobs.  Our survey is differentfrom other questionnaire surveys in terms of its larger number of participants and wide geographic spread. As a result, it is possibleto illustrate the current status of MPI users more accurately and with a wider geographical distribution. In this report, we will showsome interesting findings, compare the results with preceding studies when possible, and provide some recommendations for MPIForum based on the findings.
%B Parallel Computing
%V 108
%8 2021-12
%G eng
%U https://www.sciencedirect.com/science/article/abs/pii/S0167819121000983
%R 10.1016/j.parco.2021.102853

%0 Book Section
%B Rare Earth Elements and Actinides: Progress in Computational Science Applications
%D 2021
%T An Introduction to High Performance Computing and Its Intersection with Advances in Modeling Rare Earth Elements and Actinides
%A Deborah A. Penchoff
%A Edward Valeev
%A Heike Jagode
%A Piotr Luszczek
%A Anthony Danalis
%A George Bosilca
%A Robert J. Harrison
%A Jack Dongarra
%A Theresa L. Windus
%K actinide
%K Computational modeling
%K HPC
%K REE
%X Computationally driven solutions in nuclear and radiochemistry heavily depend on efficient modeling of Rare Earth Elements (REEs) and actinides. Accurate modeling of REEs and actinides faces challenges stemming from limitations from an imbalanced hardware-software ecosystem and its implications on inefficient use of High Performance Computing (HPC). This chapter provides a historical perspective on the evolution of HPC hardware, its intersectionality with domain sciences, the importance of benchmarks for performance, and an overview of challenges and advances in modeling REEs and actinides. This chapter intends to provide an introduction for researchers at the intersection of scientific computing, software development for HPC, and applied computational modeling of REEs and actinides. The chapter is structured in five sections. First, the Introduction includes subsections focusing on the Importance of REEs and Actinides (1.1), Hardware, Software, and the HPC Ecosystem (1.2), and Electronic Structure Modeling of REEs and Actinides (1.3). Second, a section in High Performance Computing focuses on the TOP500 (2.1), HPC Performance (2.2), HPC Benchmarks: Processing, Bandwidth, and Latency (2.3), and HPC Benchmarks and their Relationship to Chemical Modeling (2.4). Third, the Software Challenges and Advances focus on NWChem/NWChemEx (3.1), MADNESS (3.2), and MPQC (3.3). The fourth section provides a short overview of Artificial Intelligence in HPC applications relevant to nuclear and radiochemistry. The fifth section illustrates A Protocol to Evaluate Complexation Preferences in Separations of REEs and Actinides through Computational Modeling.
%B Rare Earth Elements and Actinides: Progress in Computational Science Applications
%I American Chemical Society
%C Washington, DC
%V 1388
%P 3-53
%8 2021-10
%@ ISBN13: 9780841298255 eISBN: 9780841298248
%G eng
%U https://pubs.acs.org/doi/10.1021/bk-2021-1388.ch001
%& 1
%R 10.1021/bk-2021-1388.ch001

%0 Book
%D 2021
%T Lecture Notes in Computer Science: High Performance Computing
%A Heike Jagode
%A Anzt, Hartwig
%A Ltaief, Hatem
%A Piotr Luszczek
%X This book constitutes the refereed post-conference proceedings of 9 workshops held at the 35th International ISC High Performance 2021 Conference, in Frankfurt, Germany, in June-July 2021: Second International Workshop on the Application of Machine Learning Techniques to Computational Fluid Dynamics and Solid Mechanics Simulations and Analysis; HPC-IODC: HPC I/O in the Data Center Workshop; Compiler-assisted Correctness Checking and Performance Optimization for HPC; Machine Learning on HPC Systems; 4th International Workshop on Interoperability of Supercomputing and Cloud Technologies; 2nd International Workshop on Monitoring and Operational Data Analytics; 16th Workshop on Virtualization in High­-Performance Cloud Computing; Deep Learning on Supercomputers;  5th International Workshop on In Situ Visualization. The 35 papers included in this volume were carefully reviewed and selected. They cover all aspects of research, development, and application of large-scale, high performance experimental and commercial systems. Topics include high-performance computing (HPC), computer architecture and hardware, programming models, system software, performance analysis and modeling, compiler analysis and optimization techniques, software sustainability, scientific applications, deep learning.
%I Springer International Publishing
%V 12761
%@ 978-3-030-90538-5
%G eng
%R 10.1007/978-3-030-90539-2

%0 Conference Paper
%B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)
%D 2021
%T Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems
%A Qinglei Cao
%A Yu Pei
%A Kadir Akbudak
%A George Bosilca
%A Hatem Ltaief
%A David Keyes
%A Jack Dongarra
%K asynchronous executions and load balancing
%K dynamic runtime system
%K environmental applications
%K High-performance computing
%K low-rank matrix computations
%K task-based programming model
%K user productivity
%X The task-based programming model associated with dynamic runtime systems has gained popularity for challenging problems because of workload imbalance, heterogeneous resources, or extreme concurrency. During the last decade, lowrank matrix approximations, where the main idea consists of exploiting data sparsity typically by compressing off-diagonal tiles up to an application-specific accuracy threshold, have been adopted to address the curse of dimensionality at extreme scale. In this paper, we create a bridge between the runtime and the linear algebra by communicating knowledge of the data sparsity to the runtime. We design and implement this synergistic approach with high user productivity in mind, in the context of the PaRSEC runtime system and the HiCMA numerical library. This requires to extend PaRSEC with new features to integrate rank information into the dataflow so that proper decisions can be taken at runtime. We focus on the tile low-rank (TLR) Cholesky factorization for solving 3D data-sparse covariance matrix problems arising in environmental applications. In particular, we employ the 3D exponential model of Matern matrix kernel, which exhibits challenging nonuniform ´high ranks in off-diagonal tiles. We first provide a dynamic data structure management driven by a performance model to reduce extra floating-point operations. Next, we optimize the memory footprint of the application by relying on a dynamic memory allocator, and supported by a rank-aware data distribution to cope with the workload imbalance. Finally, we expose further parallelism using kernel recursive formulations to shorten the critical path. Our resulting high-performance implementation outperforms existing data-sparse TLR Cholesky factorization by up to 7-fold on a large-scale distributed-memory system, while minimizing the memory footprint up to a 44-fold factor. This multidisciplinary work highlights the need to empower runtime systems beyond their original duty of task scheduling for servicing next-generation low-rank matrix algebra libraries.
%B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)
%I IEEE
%C Portland, OR
%8 2021-05
%G eng

%0 Journal Article
%J Journal of Open Source Software
%D 2021
%T libCEED: Fast algebra for high-order element-based discretizations
%A Jed Brown
%A Ahmad Abdelfattah
%A Valeria Barra
%A Natalie Beams
%A Jean-Sylvain Camier
%A Veselin Dobrev
%A Yohann Dudouit
%A Leila Ghaffari
%A Tzanio Kolev
%A David Medina
%A Will Pazner
%A Thilina Ratnayaka
%A Jeremy Thompson
%A Stanimire Tomov
%K finite elements
%K high-order methods
%K High-performance computing
%K matrix-free
%K spectral elements
%X Finite element methods are widely used to solve partial differential equations (PDE) in science and engineering, but their standard implementation (Arndt et al., 2020; Kirk et al., 2006; Logg et al., 2012) relies on assembling sparse matrices. Sparse matrix multiplication and triangular operations perform a scalar multiply and add for each nonzero entry, just 2 floating point operations (flops) per scalar that must be loaded from memory (Williams et al., 2009). Modern hardware is capable of nearly 100 flops per scalar streamed from memory (Rupp, 2020) so sparse matrix operations cannot achieve more than about 2% utilization of arithmetic units.  Matrix assembly becomes even more problematic when the polynomial degree p of the basis functions is increased, resulting in O(pd) storage and O(p2d) compute per degree of freedom (DoF) in d dimensions. Methods pioneered by the spectral element community (Deville et al., 2002; Orszag, 1980) exploit problem structure to reduce costs to O(1) storage and O(p)  compute per DoF, with very high utilization of modern CPUs and GPUs. Unfortunately, highquality implementations have been relegated to applications and intrusive frameworks that are often difficult to extend to new problems or incorporate into legacy applications, especially when strong preconditioners are required.  libCEED, the Code for Efficient Extensible Discretization (Abdelfattah et al., 2021), is a lightweight library that provides a purely algebraic interface for linear and nonlinear operators and preconditioners with element-based discretizations. libCEED provides portable performance via run-time selection of implementations optimized for CPUs and GPUs, including  support for just-in-time (JIT) compilation. It is designed for convenient use in new and legacy software, and offers interfaces in C99 (International Standards Organisation, 1999), Fortran77 (ANSI, 1978), Python (Python, 2021), Julia (Bezanson et al., 2017), and Rust (Rust, 2021). Users and library developers can integrate libCEED at a low level into existing applications in  place of existing matrix-vector products without significant refactoring of their own discretization infrastructure. Alternatively, users can utilize integrated libCEED support in MFEM (Anderson et al., 2020; MFEM, 2021).  In addition to supporting applications and discretization libraries, libCEED provides a platform for performance engineering and co-design, as well as an algebraic interface for solvers research like adaptive p-multigrid, much like how sparse matrix libraries enable development and deployment of algebraic multigrid solvers
%B Journal of Open Source Software
%V 6
%P 2945
%G eng
%U https://doi.org/10.21105/joss.02945
%R 10.21105/joss.02945

%0 Generic
%D 2021
%T Linear Algebra Prepara.on for Emergent Neural Network Architectures: MAGMA, BLAS, and Batched GPU Computing
%A Stanimire Tomov
%A Kwai Wong
%A Rocco Febbo
%A Julian Halloy
%I LAPENNA Workshop
%C Virtual
%8 2021-11
%G eng

%0 Generic
%D 2021
%T MAGMA: Evolution and Revolution
%A Stan Tomov
%I ICL Lunch Talk Seminar
%C Knoxville, TN
%8 2021-07
%G eng

%0 Journal Article
%J Computer Physics Communications
%D 2021
%T Materials fingerprinting classification
%A Spannaus, Adam
%A Law, Kody J.H.
%A Piotr Luszczek
%A Nasrin, Farzana
%A Micucci, Cassie Putman
%A Liaw, Peter K.
%A Santodonato, Louis J.
%A Keffer, David J.
%A Maroulas, Vasileios
%K Atom probe tomography
%K High entropy alloy
%K Machine Learning
%K Materials discovery
%K Topological data analysis
%X Significant progress in many classes of materials could be made with the availability of experimentally-derived large datasets composed of atomic identities and three-dimensional coordinates. Methods for visualizing the local atomic structure, such as atom probe tomography (APT), which routinely generate datasets comprised of millions of atoms, are an important step in realizing this goal. However, state-of-the-art APT instruments generate noisy and sparse datasets that provide information about elemental type, but obscure atomic structures, thus limiting their subsequent value for materials discovery. The application of a materials fingerprinting process, a machine learning algorithm coupled with topological data analysis, provides an avenue by which here-to-fore unprecedented structural information can be extracted from an APT dataset. As a proof of concept, the material fingerprint is applied to high-entropy alloy APT datasets containing body-centered cubic (BCC) and face-centered cubic (FCC) crystal structures. A local atomic configuration centered on an arbitrary atom is assigned a topological descriptor, with which it can be characterized as a BCC or FCC lattice with near perfect accuracy, despite the inherent noise in the dataset. This successful identification of a fingerprint is a crucial first step in the development of algorithms which can extract more nuanced information, such as chemical ordering, from existing datasets of complex materials.
%B Computer Physics Communications
%P 108019
%8 Jan-05-2021
%G eng
%U https://linkinghub.elsevier.com/retrieve/pii/S0010465521001314
%! Computer Physics Communications
%R 10.1016/j.cpc.2021.108019

%0 Conference Proceedings
%B IPDPS'2021, the 34th IEEE International Parallel and Distributed Processing Symposium
%D 2021
%T Max-Stretch Minimization on an Edge-Cloud Platform
%A Anne Benoit
%A Redouane Elghazi
%A Yves Robert
%B IPDPS'2021, the 34th IEEE International Parallel and Distributed Processing Symposium
%I IEEE Computer Society Press
%G eng

%0 Generic
%D 2021
%T Mixed-Precision Algorithm for Finding Selected Eigenvalues and Eigenvectors of Symmetric and Hermitian Matrices
%A Yaohung M. Tsai
%A Piotr Luszczek
%A Jack Dongarra
%K eigenvalue solver
%K hardware accelerators
%K mixed-precision algorithms
%X As the new hardware is being equipped with powerful low-precision capabilities driven primarily by the needs of the burgeoning field of Artificial Intelligence (AI), mixed-precision algorithms are now showing far greater potential and renewed interest in scientific computing community. The multi-precision methods commonly follow approximate-iterate scheme by first obtaining the approximate solution from a low-precision factorization and solve. Then, they iteratively refine the solution to the desired accuracy that is often as high as what is possible with traditional approaches. While targeting symmetric and Hermitian eigenvalue problems of the form Ax=&#955;x, we revisit the SICE algorithm proposed by Dongarra et al. By applying the Sherman-Morrison formula on the diagonally-shifted tridiagonal systems, we propose an updated SICE-SM algorithm. By incorporating the latest two-stage algorithms from the PLASMA and MAGMA software libraries for numerical linear algebra, we achieved up to 3.6x speedup using the mixed-precision eigensolver with the blocked SICE-SM algorithm for iterative refinement when compared with full double complex precision solvers for the cases with a portion of eigenvalues and eigenvectors requested.
%B ICL Technical Report
%8 2021-08
%G eng

%0 Generic
%D 2021
%T A More Portable HeFFTe: Implementing a Fallback Algorithm for Scalable Fourier Transforms
%A Daniel Sharp
%A Miroslav Stoyanov
%A Stanimire Tomov
%A Jack Dongarra
%B ICL Technical Report
%I University of Tennessee
%8 2021-08
%G eng
%9 ICL Tech Report

%0 Generic
%D 2021
%T P1673R3: A Free Function Linear algebra Interface Based on the BLAS
%A Mark Hoemmen
%A Daisy Hollman
%A Christian Trott
%A Daniel Sunderland
%A Nevin Liber
%A Li-Ta Lo
%A Damien Lebrun-Grandie
%A Graham Lopez
%A Peter Caday
%A Sarah Knepper
%A Piotr Luszczek
%A Timothy Costa
%K C++
%K linear algebra
%X We believe this proposal is complementary to P1385, a proposal for a C++ Standard linear algebra library that introduces matrix and vector classes and overloaded arithmetic operators. In fact, we think that our proposal would make a natural foundation for a library like what P1385 proposes. However, a free function interface -- which clearly separates algorithms from data structures -- more naturally allows for a richer set of operations such as what the BLAS provides. A natural extension of the present proposal would include accepting P1385's matrix and vector objects as input for the algorithms proposed here. A straightforward way to do that would be for P1385's matrix and vector objects to make views of their data available as basic_mdspan.
%B ISO JTC1 SC22 WG22
%I ISO
%8 2021-04
%G eng
%U http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1673r3.pdf
%9 standard

%0 Conference Paper
%B EuroMPI'21
%D 2021
%T Quo Vadis MPI RMA? Towards a More Efficient Use of MPI One-Sided Communication
%A Schuchart, Joseph
%A Niethammer, Christoph
%A Gracia, José
%A George Bosilca
%K Memory Handles
%K MPI
%K MPI-RMA
%K RDMA
%X The MPI standard has long included one-sided communication abstractions through the MPI Remote Memory Access (RMA) interface. Unfortunately, the MPI RMA chapter in the 4.0 version of the MPI standard still contains both well-known and lesser known short-comings for both implementations and users, which lead to potentially non-optimal usage patterns. In this paper, we identify a set of issues and propose ways for applications to better express anticipated usage of RMA routines, allowing the MPI implementation to better adapt to the application's needs. In order to increase the flexibility of the RMA interface, we add the capability to duplicate windows, allowing access to the same resources encapsulated by a window using different configurations. In the same vein, we introduce the concept of MPI memory handles, meant to provide life-time guarantees on memory attached to dynamic windows, removing the overhead currently present in using dynamically exposed memory. We will show that our extensions provide improved accumulate latencies, reduced overheads for multi-threaded flushes, and allow for zero overhead dynamic memory window usage.
%B EuroMPI'21
%C Garching, Munich Germany
%G eng
%U https://arxiv.org/abs/2111.08142

%0 Book
%B ACS Symposium Series
%D 2021
%T Rare Earth Elements and Actinides: Progress in Computational Science Applications
%A Deborah A. Penchoff
%A Theresa L. Windus
%A Charles C. Peterson
%K actinides
%K HPC
%K REEs
%X Rare earth elements (REEs) and actinides are critical to electronics, communication, military applications, and green energy systems. They also play a large role in nuclear waste challenges with critical national importance. Actinides are still among some of the least studied elements in the periodic table, due to their short half-lives and radioactivity, which demand expert facilities for research. Computational modeling greatly aids in understanding REEs and actinides; however, electronic structure modeling of these elements presents limitations. High Performance Computing (HPC) has had a direct impact not only on technical advances and access to information on a global scale but also on investigations of REEs and actinides. This work discusses recent advances in molecular and data driven modeling that are essential to the study of REEs and actinides, effects of computational science in nuclear and radiochemical applications, and advances and challenges in the exascale era of supercomputing.
%B ACS Symposium Series
%I American Chemical Society
%C Washington, DC
%V 1388
%8 2021-10
%@ ISBN13: ‍9780841298255 eISBN: ‍9780841298248
%G eng
%U https://pubs.acs.org/doi/book/10.1021/bk-2021-1388
%R DOI: 10.1021/bk-2021-1388

%0 Book Section
%B Rare Earth Elements and Actinides: Progress in Computational Science Applications
%D 2021
%T Rare Earth Elements and Critical Materials: Uses and Availability
%A Deborah A. Penchoff
%A Charles B. Sims
%A Theresa L. Windus
%K critical materials
%K REE
%X Rare Earth Elements (REEs) are essential elements in critical materials for the economy and national security. This chapter explores REEs’ classifications, importance, abundance, resources and challenges, and the need for solutions and alternatives. REEs’ production, reserves, world markets, and net import reliance is discussed. Applications of REEs to clean energy, electronics, and defense are highlighted.
%B Rare Earth Elements and Actinides: Progress in Computational Science Applications
%I American Chemical Society
%C Washington, DC
%V 1388
%P 63-74
%8 2021-10
%@ ISBN13: 9780841298255 eISBN: 9780841298248
%G eng
%U https://pubs.acs.org/doi/10.1021/bk-2021-1388.ch003
%& 3
%R 10.1021/bk-2021-1388.ch003

%0 Journal Article
%J Int. J. of Networking and Computing
%D 2021
%T Resilient scheduling heuristics for rigid parallel jobs
%A Anne Benoit
%A Valentin Le Fèvre
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%B Int. J. of Networking and Computing
%V 11
%P 2-26
%G eng

%0 Conference Proceedings
%B 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2021
%T Revisiting Credit Distribution Algorithms for Distributed Termination Detection
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Le Fèvre, Valentin
%A Robert, Yves
%A Jack Dongarra
%K control messages
%K credit distribution algorithms
%K task-based HPC application
%K Termination detection
%X This paper revisits distributed termination detection  algorithms in the context of High-Performance Computing (HPC)  applications. We introduce an efficient variant of the Credit  Distribution Algorithm (CDA) and compare it to the original  algorithm (HCDA) as well as to its two primary competitors: the  Four Counters algorithm (4C) and the Efficient Delay-Optimal  Distributed algorithm (EDOD). We analyze the behavior of each  algorithm for some simplified task-based kernels and show the  superiority of CDA in terms of the number of control messages.
%B 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%P 611–620
%G eng
%R 10.1109/IPDPSW52791.2021.00095

%0 Conference Proceedings
%B International Conference on Parallel Computing Technologies
%D 2021
%T Scalability Issues in FFT Computation
%A Alan Ayala
%A Stanimire Tomov
%A Stoyanov, Miroslav
%A Jack Dongarra
%K Hybrid systems
%K Parallel FFT
%K scalability
%X The fast Fourier transform (FFT), is one the most important tools in mathematics, and it is widely required by several applications of science and engineering. State-of-the-art parallel implementations of the FFT algorithm, based on Cooley-Tukey developments, are known to be communication-bound, which causes critical issues when scaling the computational and architectural capabilities. In this paper, we study the main performance bottleneck of FFT computations on hybrid CPU and GPU systems at large-scale. We provide numerical simulations and potential acceleration techniques that can be easily integrated into FFT distributed libraries. We present different experiments on performance scalability and runtime analysis on the world’s most powerful supercomputers today: Summit, using up to 6,144 NVIDIA V100 GPUs, and Fugaku, using more than one million Fujitsu A64FX cores.
%B International Conference on Parallel Computing Technologies
%I Springer
%P 279–287
%@ 978-3-030-86359-3
%G eng
%R 10.1007/978-3-030-86359-3_21

%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2021
%T A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines
%A Abdelfattah, Ahmad
%A Costa, Timothy
%A Jack Dongarra
%A Mark Gates
%A Haidar, Azzam
%A Hammarling, Sven
%A Higham, Nicholas J
%A Kurzak, Jakub
%A Piotr Luszczek
%A Stanimire Tomov
%A others
%K Computations on matrices
%K Mathematical analysis
%K Mathematics of computing
%K Numerical analysis
%X This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.
%B ACM Transactions on Mathematical Software (TOMS)
%V 47
%P 1–23
%G eng
%R 10.1145/3431921

%0 Generic
%D 2021
%T SLATE Performance Improvements: QR and Eigenvalues
%A Kadir Akbudak
%A Paul Bagwell
%A Sebastien Cayrols
%A Mark Gates
%A Dalal Sukkari
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%8 2021-04
%G eng

%0 Generic
%D 2021
%T SLATE Port to AMD and Intel Platforms
%A Ahmad Abdelfattah
%A Mohammed Al Farhan
%A Cade Brown
%A Mark Gates
%A Dalal Sukkari
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%8 2021-04
%G eng

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2021
%T A survey of numerical linear algebra methods utilizing mixed-precision arithmetic
%A Abdelfattah, Ahmad
%A Anzt, Hartwig
%A Boman, Erik G
%A Carson, Erin
%A Cojean, Terry
%A Jack Dongarra
%A Fox, Alyson
%A Mark Gates
%A Higham, Nicholas J
%A Li, Xiaoye S
%A others
%K GPUs
%K High-performance computing
%K linear algebra
%K Mixed-precision arithmetic
%K numerical mathematics
%X The efficient utilization of mixed-precision numerical linear algebra algorithms can offer attractive acceleration to scientific computing applications. Especially with the hardware integration of low-precision special-function units designed for machine learning applications, the traditional numerical algorithms community urgently needs to reconsider the floating point formats used in the distinct operations to efficiently leverage the available compute power. In this work, we provide a comprehensive survey of mixed-precision numerical linear algebra routines, including the underlying concepts, theoretical background, and experimental results for both dense and sparse linear algebra problems.
%B The International Journal of High Performance Computing Applications
%V 35
%P 344–369
%G eng
%R 10.1177/10943420211003313

%0 Conference Proceedings
%B Proceedings of the ACM International Conference on Supercomputing
%D 2021
%T Task-graph scheduling extensions for efficient synchronization and communication
%A Bak, Seonmyeong
%A Hernandez, Oscar
%A Mark Gates
%A Piotr Luszczek
%A Sarkar, Vivek
%K Compilers
%K Computing methodologies
%K Parallel computing methodologies
%K Parallel programming languages
%K Runtime environments
%K Software and its engineering
%K Software notations and tools
%X Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs.
%B Proceedings of the ACM International Conference on Supercomputing
%P 88–101
%G eng
%R 10.1145/3447818.3461616

%0 Journal Article
%J Journal of Computational Science
%D 2021
%T Translational process: Mathematical software perspective
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%K communication avoiding algorithms
%K DATAFLOW scheduling runtimes
%K hardware accelerators
%X Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to world-class high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack.
%B Journal of Computational Science
%V 52
%P 101216
%G eng
%R 10.1016/j.jocs.2020.101216

%0 Generic
%D 2020
%T ASCR@40: Four Decades of Department of Energy Leadership in Advanced Scientific Computing Research
%A Bruce Hendrickson
%A Paul Messina
%A Buddy Bland
%A Jackie Chen
%A Phil Colella
%A Eli Dart
%A Jack Dongarra
%A Thom Dunning
%A Ian Foster
%A Richard Gerber
%A Rachel Harken
%A Wendy Huntoon
%A Bill Johnston
%A John Sarrao
%A Jeff Vetter
%I Advanced Scientific Computing Advisory Committee (ASCAC), US Department of Energy
%8 2020-08
%G eng
%U https://computing.llnl.gov/misc/ASCR@40-Highlights.pdf

%0 Generic
%D 2020
%T ASCR@40: Highlights and Impacts of ASCR’s Programs
%A Bruce Hendrickson
%A Paul Messina
%A Buddy Bland
%A Jackie Chen
%A Phil Colella
%A Eli Dart
%A Jack Dongarra
%A Thom Dunning
%A Ian Foster
%A Richard Gerber
%A Rachel Harken
%A Wendy Huntoon
%A Bill Johnston
%A John Sarrao
%A Jeff Vetter
%X The Office of Advanced Scientific Computing Research (ASCR) sits within the Office of Science in the Department of Energy (DOE). Per their web pages, “the mission of the ASCR program is to discover, develop, and deploy computational and networking capabilities to analyze, model, simulate, and predict complex phenomena important to the DOE.” This succinct statement encompasses a wide range of responsibilities for computing and networking facilities; for procuring, deploying, and operating high performance computing, networking, and storage resources; for basic research in mathematics and computer science; for developing and sustaining a large body of software; and for partnering with organizations across the Office of Science and beyond. While its mission statement may seem very contemporary, the roots of ASCR are quite deep—long predating the creation of DOE. Applied mathematics and advanced computing were both elements of the Theoretical Division of the Manhattan Project. In the early 1950s, the Manhattan Project scientist and mathematician John von Neumann, then a commissioner for the AEC (Atomic Energy Commission), advocated for the creation of a Mathematics program to support the continued development and applications of digital computing. Los Alamos National Laboratory (LANL) scientist John Pasta created such a program to fund researchers at universities and AEC laboratories. Under several organizational name changes, this program has persisted ever since, and would eventually grow to become ASCR.
%I US Department of Energy’s Office of Advanced Scientific Computing Research
%8 2020-06
%G eng
%U https://www.osti.gov/servlets/purl/1631812
%R https://doi.org/10.2172/1631812

%0 Generic
%D 2020
%T Asynchronous SGD for DNN Training on Shared-Memory Parallel Architectures
%A Florent Lopez
%A Edmond Chow
%A Stanimire Tomov
%A Jack Dongarra
%K Asynchronous iterative methods
%K Deep learning
%K gpu
%K multicore CPU
%K Stochastic Gradient Descent
%X We present a parallel asynchronous Stochastic Gradient Descent algorithm for shared memory architectures. Different from previous asynchronous algorithms, we consider the case where the gradient updates are not particularly sparse. In the context of the MagmaDNN framework, we compare the parallel efficiency of the asynchronous implementation with that of the traditional synchronous implementation. Tests are performed for training deep neural networks on multicore CPUs and GPU devices.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2020-03
%G eng

%0 Conference Paper
%B Workshop on Scalable Deep Learning over Parallel And Distributed Infrastructures (ScaDL 2020)
%D 2020
%T Asynchronous SGD for DNN Training on Shared-Memory Parallel Architectures
%A Florent Lopez
%A Edmond Chow
%A Stanimire Tomov
%A Jack Dongarra
%B Workshop on Scalable Deep Learning over Parallel And Distributed Infrastructures (ScaDL 2020)
%8 2020-05
%G eng

%0 Generic
%D 2020
%T CEED ECP Milestone Report: Improve Performance and Capabilities of CEED-Enabled ECP Applications on Summit/Sierra
%A Kolev, Tzanio
%A Fischer, Paul
%A Abdelfattah, Ahmad
%A Ananthan, Shreyas
%A Valeria Barra
%A Natalie Beams
%A Bleile, Ryan
%A Brown, Jed
%A Carson, Robert
%A Camier, Jean-Sylvain
%A Churchfield, Matthew
%A Dobrev, Veselin
%A Jack Dongarra
%A Dudouit, Yohann
%A Karakus, Ali
%A Kerkemeier, Stefan
%A Lan, YuHsiang
%A Medina, David
%A Merzari, Elia
%A Min, Misun
%A Parker, Scott
%A Ratnayaka, Thilina
%A Smith, Cameron
%A Sprague, Michael
%A Stitt, Thomas
%A Thompson, Jeremy
%A Tomboulides, Ananias
%A Stanimire Tomov
%A Tomov, Vladimir
%A Vargas, Arturo
%A Warburton, Tim
%A Weiss, Kenneth
%B ECP Milestone Reports
%I Zenodo
%8 2020-05
%G eng
%U https://doi.org/10.5281/zenodo.3860804
%R https://doi.org/10.5281/zenodo.3860804

%0 Generic
%D 2020
%T Clover: Computational Libraries Optimized via Exascale Research
%A Mark Gates
%A Stanimire Tomov
%A Hartwig Anzt
%A Piotr Luszczek
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Conference Paper
%B 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2020
%T Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime
%A Yu Pei
%A Qinglei Cao
%A George Bosilca
%A Piotr Luszczek
%A Victor Eijkhout
%A Jack Dongarra
%X Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2× speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.
%B 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C New Orleans, LA
%8 2020-05
%G eng
%R https://doi.org/10.1109/IPDPSW50202.2020.00127

%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part VI
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 667
%8 2020-06
%@ 978-3-030-50433-5
%G eng
%R https://doi.org/10.1007/978-3-030-50433-5

%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part III
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 648
%8 2020-06
%@ 978-3-030-50420-5
%G eng
%R https://doi.org/10.1007/978-3-030-50420-5

%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part VII
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 775
%8 2020-06
%@ 978-3-030-50436-6
%G eng
%R https://doi.org/10.1007/978-3-030-50436-6

%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part IV
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 668
%8 2020-06
%@ 978-3-030-50423-6
%G eng
%R https://doi.org/10.1007/978-3-030-50423-6

%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part I
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 707
%8 2020-06
%@ 978-3-030-50371-0
%G eng
%R https://doi.org/10.1007/978-3-030-50371-0

%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part V
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 618
%8 2020-06
%@ 978-3-030-50426-7
%G eng
%R https://doi.org/10.1007/978-3-030-50426-7

%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part II
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 697
%8 2020-06
%@ 978-3-030-50417-5
%G eng
%R https://doi.org/10.1007/978-3-030-50417-5

%0 Conference Paper
%B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
%D 2020
%T DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models
%A Bogdan Nicolae
%A Jiali Li
%A Justin M. Wozniak
%A George Bosilca
%A Matthieu Dorier
%A Franck Cappello
%X In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.
%B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
%I IEEE
%C Melbourne, VIC, Australia
%8 2020-05
%G eng
%R https://doi.org/10.1109/CCGrid49817.2020.00-76

%0 Conference Paper
%B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020)
%D 2020
%T Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs
%A Anne Benoit
%A Valentin Le Fèvre
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020)
%I IEEE Computer Society Press
%C New Orleans, LA
%8 2020-05
%G eng

%0 Conference Paper
%B 2020 IEEE High Performance Extreme Computing Virtual Conference
%D 2020
%T Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs
%A Cade Brown
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%X Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the HeterogeneousComputing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements.
%B 2020 IEEE High Performance Extreme Computing Virtual Conference
%I IEEE
%8 2020-09
%G eng

%0 Generic
%D 2020
%T Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs
%A Cade Brown
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%K AMD GPUs
%K GPU computing
%K HIP Runtime
%K HPC
%K numerical linear algebra
%K Portability
%X Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the HeterogeneousComputing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-08
%G eng

%0 Conference Paper
%B Computer Modeling and Intelligent Systems CMIS-2020
%D 2020
%T Docker Container based PaaS Cloud Computing Comprehensive Benchmarks using LAPACK
%A Dmitry Zaitsev
%A Piotr Luszczek
%K docker containers
%K software containers
%X Platform as a Service (PaaS) cloud computing model becomes wide- spread implemented within Docker Containers. Docker uses operating system level virtualization to deliver software in packages called containers. Containers are isolated from one another and comprise all the required software, including operating system API, libraries and configuration files. With such advantageous integrity one can doubt on Docker performance. The present paper applies packet LAPACK, which is widely used for performance benchmarks of super- computers, to collect and compare benchmarks of Docker on Linux Ubuntu and MS Windows platforms. After a brief overview of Docker and LAPACK, a se- ries of Docker images containing LAPACK is created and run, abundant benchmarks obtained and represented in tabular and graphical form. From the final discussion, we conclude that Docker runs with nearly the same perfor- mance on both Linux and Windows platforms, the slowdown does not exceed some ten percent. Though Docker performance in Windows is essentially lim- ited by the amount of RAM allocated to Docker Engine.
%B Computer Modeling and Intelligent Systems CMIS-2020
%C Zaporizhzhoa
%8 2020-03
%G eng

%0 Generic
%D 2020
%T DTE: PaRSEC Enabled Libraries and Applications (Poster)
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Generic
%D 2020
%T DTE: PaRSEC Systems and Interfaces (Poster)
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Conference Paper
%B 13th International Workshop on Parallel Tools for High Performance Computing
%D 2020
%T Effortless Monitoring of Arithmetic Intensity with PAPI's Counter Analysis Toolkit
%A Daniel Barry
%A Anthony Danalis
%A Heike Jagode
%X With exascale computing forthcoming, performance metrics such as memory traffic and arithmetic intensity are increasingly important for codes that heavily utilize numerical kernels. Performance metrics in different CPU architectures can be monitored by reading the occurrences of various hardware events. However, from architecture to architecture, it becomes more and more unclear which native performance events are indexed by which event names, making it difficult for users to understand what specific events actually measure. This ambiguity seems particularly true for events related to hardware that resides beyond the compute core, such as events related to memory traffic. Still, traffic to memory is a necessary characteristic for determining arithmetic intensity. To alleviate this difficulty, PAPI’s Counter Analysis Toolkit measures the occurrences of events through a series of benchmarks, allowing its users to discover the high-level meaning of native events. We (i) leverage the capabilities of the Counter Analysis Toolkit to identify the names of hardware events for reading and writing bandwidth utilization in addition to floating-point operations, (ii) measure the occurrences of the events they index during the execution of important numerical kernels, and (iii) verify their identities by comparing these occurrence patterns to the expected arithmetic intensity of the numerical kernels.
%B 13th International Workshop on Parallel Tools for High Performance Computing
%I Springer International Publishing
%C Dresden, Germany
%8 2020-09
%G eng

%0 Conference Paper
%B 49th International Conference on Parallel Processing (ICPP 2020)
%D 2020
%T Energy-Aware Strategies for Reliability-Oriented Real-Time Task Allocation on Heterogeneous Platforms
%A Li Han
%A Yiqin Gao
%A Jing Liu
%A Yves Robert
%A Frederic Vivien
%B 49th International Conference on Parallel Processing (ICPP 2020)
%I ACM Press
%C Edmonton, AB, Canada
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2020
%T Evaluating Asynchronous Schwarz Solvers on GPUs
%A Pratik Nayak
%A Terry Cojean
%A Hartwig Anzt
%K abstract Schwarz methods
%K Asynchronous solvers
%K exascale
%K GPUs
%K multicore processors
%K parallel numerical linear algebra
%X With the commencement of the exascale computing era, we realize that the majority of the leadership supercomputers are heterogeneous and massively parallel. Even a single node can contain multiple co-processors such as GPUs and multiple CPU cores. For example, ORNL’s Summit accumulates six NVIDIA Tesla V100 GPUs and 42 IBM Power9 cores on each node. Synchronizing across compute resources of multiple nodes can be prohibitively expensive. Hence, it is necessary to develop and study asynchronous algorithms that circumvent this issue of bulk-synchronous computing. In this study, we examine the asynchronous version of the abstract Restricted Additive Schwarz method as a solver. We do not explicitly synchronize, but allow the communication between the sub-domains to be completely asynchronous, thereby removing the bulk synchronous nature of the algorithm.    We accomplish this by using the one-sided Remote Memory Access (RMA) functions of the MPI standard. We study the benefits of using such an asynchronous solver over its synchronous counterpart. We also study the communication patterns governed by the partitioning and the overlap between the sub-domains on the global solver. Finally, we show that this concept can render attractive performance benefits over the synchronous counterparts even for a well-balanced problem.
%B International Journal of High Performance Computing Applications
%8 2020-08
%G eng
%R https://doi.org/10.1177/1094342020946814

%0 Conference Paper
%B 2020 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)
%D 2020
%T Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse and Batched Computations
%A Hartwig Anzt
%A Yuhsiang M. Tsai
%A Ahmad Abdelfattah
%A Terry Cojean
%A Jack Dongarra
%K Batched linear algebra
%K NVIDIA A100 GPU
%K sparse linear algebra
%K Sparse Matrix Vector Product
%X GPU accelerators have become an important backbone for scientific high performance-computing, and the performance advances obtained from adopting new GPU hardware are significant. In this paper we take a first look at NVIDIA’s newest server-line GPU, the A100 architecture, part of the Ampere generation. Specifically, we assess its performance for sparse and batch computations, as these routines are relied upon in many scientific applications, and compare to the p
%B 2020 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)
%I IEEE
%8 2020-11
%G eng

%0 Generic
%D 2020
%T Exa-PAPI: The Exascale Performance API with Modern C++
%A Heike Jagode
%A Anthony Danalis
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Conference Paper
%B Platform for Advanced Scientific Computing Conference (PASC20)
%D 2020
%T Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications
%A Qinglei Cao
%A Yu Pei
%A Kadir Akbudak
%A Aleksandr Mikhalev
%A George Bosilca
%A Hatem Ltaief
%A David Keyes
%A Jack Dongarra
%X Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving of large-scale linear systems that performs a Cholesky factorization on a symmetric positive-definite covariance matrix---a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix off-diagonal tiles by means of low-rank approximations; and, at the programming-paradigm level, we integrate PaRSEC, a dynamic, task-based runtime to reach unparalleled levels of efficiency for solving extreme-scale linear algebra matrix operations. The resulting solution leverages fine-grained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130, 000 cores, which represent a cornerstone toward fast and accurate predictions of environmental applications.
%B Platform for Advanced Scientific Computing Conference (PASC20)
%I ACM
%C Geneva, Switzerland
%8 2020-06
%G eng
%R https://doi.org/10.1145/3394277.3401846

%0 Journal Article
%J Future Generation Computer Systems
%D 2020
%T Fault Tolerance of MPI Applications in Exascale Systems: The ULFM Solution
%A Nuria Losada
%A Patricia González
%A María J. Martín
%A George Bosilca
%A Aurelien Bouteiller
%A Keita Teranishi
%K Application-level checkpointing
%K MPI
%K resilience
%K ULFM
%X The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient fault-tolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms.
%B Future Generation Computer Systems
%V 106
%P 467-481
%8 2020-05
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167739X1930860X
%R https://doi.org/10.1016/j.future.2020.01.026

%0 Conference Paper
%B 9th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 20)
%D 2020
%T FFT-Based Gradient Sparsification for the Distributed Training of Deep Neural Networks
%A Linnan Wang
%A Wei Wu
%A Junyu Zhang
%A Hang Liu
%A George Bosilca
%A Maurice Herlihy
%A Rodrigo Fonseca
%K FFT
%K Gradient Compression
%K Loosy Gradients
%K Machine Learning
%K Neural Networks
%X The performance and efficiency of distributed training of Deep Neural Networks (DNN) highly depend on the performance of gradient averaging among participating processes, a step bound by communication costs. There are two major approaches to reduce communication overhead: overlap communications with computations (lossless), or reduce communications (lossy). The lossless solution works well for linear neural architectures, e.g. VGG, AlexNet, but more recent networks such as ResNet and Inception limit the opportunity for such overlapping. Therefore, approaches that reduce the amount of data (lossy) become more suitable. In this paper, we present a novel, explainable lossy method that sparsifies gradients in the frequency domain, in addition to a new range-based float point representation to quantize and further compress gradients. These dynamic techniques strike a balance between compression ratio, accuracy, and computational overhead, and are optimized to maximize performance in heterogeneous environments.    Unlike existing works that strive for a higher compression ratio, we stress the robustness of our methods, and provide guidance to recover accuracy from failures. To achieve this, we prove how the FFT sparsification affects the convergence and accuracy, and show that our method is guaranteed to converge using a diminishing θ in training. Reducing θ can also be used to recover accuracy from the failure. Compared to STOA lossy methods, e.g., QSGD, TernGrad, and Top-k sparsification, our approach incurs less approximation error, thereby better in both the wall-time and accuracy. On an 8 GPUs, InfiniBand interconnected cluster, our techniques effectively accelerate AlexNet training up to 2.26x to the baseline of no compression, and 1.31x to QSGD, 1.25x to Terngrad and 1.47x to Top-K sparsification.
%B 9th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 20)
%I ACM
%C Stockholm, Sweden
%8 2020-06
%G eng
%R https://doi.org/10.1145/3369583.3392681

%0 Generic
%D 2020
%T FFT-ECP API and High-Performance Library Prototype for 2-D and 3-D FFTs on Large-Scale Heterogeneous Systems with GPUs
%A Stanimire Tomov
%A Alan Ayala
%A Azzam Haidar
%A Jack Dongarra
%B ECP Milestone Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2020-01
%G eng
%9 ECP WBS 2.3.3.13 Milestone Report

%0 Conference Paper
%B IEEE International Conference on Cluster Computing (Cluster 2020)
%D 2020
%T Flexible Data Redistribution in a Task-Based Runtime System
%A Qinglei Cao
%A George Bosilca
%A Wei Wu
%A Dong Zhong
%A Aurelien Bouteiller
%A Jack Dongarra
%X Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal to increase the efficiency and therefore decrease the time-to-solution for the algorithm. The classical redistribution problem focuses on optimally scheduling communications when reshuffling data between two regular, usually block-cyclic, data distributions. Recently, task-based runtime systems have gained popularity as a potential candidate to address the programming complexity on the way to exascale. In addition to an increase in portability against complex hardware and software systems, task-based runtime systems have the potential to be able to more easily cope with less-regular data distribution, providing a more balanced computational load during the lifetime of the execution. In this scenario, it becomes paramount to develop a general redistribution algorithm for task-based runtime systems, which could support all types of regular and irregular data distributions. In this paper, we detail a flexible redistribution algorithm, capable of dealing with redistribution problems without constraints of data distribution and data size and implement it in a task-based runtime system, PaRSEC. Performance results show great capability compared to ScaLAPACK, and applications highlight an increased efficiency with little overhead in terms of data distribution and data size.
%B IEEE International Conference on Cluster Computing (Cluster 2020)
%I IEEE
%C Kobe, Japan
%8 2020-09
%G eng
%R https://doi.org/10.1109/CLUSTER49012.2020.00032

%0 Generic
%D 2020
%T Formulation of Requirements for New PAPI++ Software Package: Part I: Survey Results
%A Heike Jagode
%A Anthony Danalis
%A Jack Dongarra
%B PAPI++ Working Notes
%I Innovative Computing Laboratory, University of Tennessee Knoxville
%8 2020-01
%G eng

%0 Journal Article
%J Journal of Open Source Software
%D 2020
%T Ginkgo: A High Performance Numerical Linear Algebra Library
%A Hartwig Anzt
%A Terry Cojean
%A Yen-Chen Chen
%A Fritz Goebel
%A Thomas Gruetzmacher
%A Pratik Nayak
%A Tobias Ribizel
%A Yu-Hsiang Tsai
%X Ginkgo is a production-ready sparse linear algebra library for high performance computing on GPU-centric architectures with a high level of performance portability and focuses on software sustainability.    The library focuses on solving sparse linear systems and accommodates a large variety of matrix formats, state-of-the-art iterative (Krylov) solvers and preconditioners, which make the library suitable for a variety of scientific applications. Ginkgo supports many architectures such as multi-threaded CPU, NVIDIA GPUs, and AMD GPUs. The heavy use of modern C++ features simplifies the addition of new executor paradigms and algorithmic functionality without introducing significant performance overhead.    Solving linear systems is usually one of the most computationally and memory intensive aspects of any application. Hence there has been a significant amount of effort in this direction with software libraries such as UMFPACK (Davis, 2004) and CHOLMOD (Chen, Davis, Hager, & Rajamanickam, 2008) for solving linear systems with direct methods and PETSc (Balay et al., 2020), Trilinos (“The Trilinos Project Website,” 2020), Eigen (Guennebaud, Jacob, & others, 2010) and many more to solve linear systems with iterative methods. With Ginkgo, we aim to ensure high performance while not compromising portability. Hence, we provide very efficient low level kernels optimized for different architectures and separate these kernels from the algorithms thereby ensuring extensibility and ease of use.    Ginkgo is also a part of the xSDK effort (Bartlett et al., 2017) and available as a Spack (Gamblin et al., 2015) package. xSDK aims to provide infrastructure for and interoperability between a collection of related and complementary software elements to foster rapid and efficient development of scientific applications using High Performance Computing. Within this effort, we provide interoperability with application libraries such as deal.ii (Arndt et al., 2019) and mfem (Anderson et al., 2020). Ginkgo provides wrappers within these two libraries so that they can take advantage of the features of Ginkgo.
%B Journal of Open Source Software
%V 5
%8 2020-08
%G eng
%N 52
%R https://doi.org/10.21105/joss.02260

%0 Generic
%D 2020
%T Ginkgo: A Node-Level Sparse Linear Algebra Library for HPC (Poster)
%A Hartwig Anzt
%A Terry Cojean
%A Yen-Chen Chen
%A Fritz Goebel
%A Thomas Gruetzmacher
%A Pratik Nayak
%A Tobias Ribizel
%A Yu-Hsiang Tsai
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Conference Paper
%B IEEE Cluster Conference
%D 2020
%T HAN: A Hierarchical AutotuNed Collective Communication Framework
%A Xi Luo
%A Wei Wu
%A George Bosilca
%A Yu Pei
%A Qinglei Cao
%A Thananon Patinyasakdikul
%A Dong Zhong
%A Jack Dongarra
%X High-performance computing (HPC) systems keep growing in scale and heterogeneity to satisfy the increasing computational need, and this brings new challenges to the design of MPI libraries, especially with regard to collective operations. To address these challenges, we present "HAN," a new hierarchical autotuned collective communication framework in Open MPI, which selects suitable homogeneous collective communication modules as submodules for each hardware level, uses collective operations from the submodules as tasks, and organizes these tasks to perform efficient hierarchical collective operations. With a task-based design, HAN can easily swap out submodules, while keeping tasks intact, to adapt to new hardware. This makes HAN suitable for the current platform and provides a strong and flexible support for future HPC systems. To provide a fast and accurate autotuning mechanism, we present a novel cost model based on benchmarking the tasks instead of a whole collective operation. This method drastically reduces tuning time, as the cost of tasks can be reused across different message sizes, and is more accurate than existing cost models. Our cost analysis suggests the autotuning component can find the optimal configuration in most cases. The evaluation of the HAN framework suggests our design significantly improves the default Open MPI and achieves decent speedups against state-of-the-art MPI implementations on tested applications.
%B IEEE Cluster Conference
%I Best Paper Award, IEEE Computer Society Press
%C Kobe, Japan
%8 2020-09
%G eng

%0 Book Section
%B Fog Computing: Theory and Practice
%D 2020
%T Harnessing the Computing Continuum for Programming Our World
%A Pete Beckman
%A Jack Dongarra
%A Nicola Ferrier
%A Geoffrey Fox
%A Terry Moore
%A Dan Reed
%A Micah Beck
%X This chapter outlines a vision for how best to harness the computing continuum of interconnected sensors, actuators, instruments, and computing systems, from small numbers of very large devices to large numbers of very small devices. The hypothesis is that only via a continuum perspective one can intentionally specify desired continuum actions and effectively manage outcomes and systemic properties—adaptability and homeostasis, temporal constraints and deadlines—and elevate the discourse from device programming to intellectual goals and outcomes. Development of a framework for harnessing the computing continuum would catalyze new consumer services, business processes, social services, and scientific discovery. Realizing and implementing a continuum programming model requires balancing conflicting constraints and translating the high‐level specification into a form suitable for execution on a unifying abstract machine model. In turn, the abstract machine must implement the mapping of specification demands to end‐to‐end resources.
%B Fog Computing: Theory and Practice
%I John Wiley & Sons, Inc.
%@ 9781119551713
%G eng
%& 7
%R https://doi.org/10.1002/9781119551713.ch7

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2020)
%D 2020
%T heFFTe: Highly Efficient FFT for Exascale
%A Alan Ayala
%A Stanimire Tomov
%A Azzam Haidar
%A Jack Dongarra
%K exascale
%K FFT
%K gpu
%K scalable algorithm
%X Exascale computing aspires to meet the increasing demands from large scientific applications. Software targeting exascale is typically designed for heterogeneous architectures; henceforth, it is not only important to develop well-designed software, but also make it aware of the hardware architecture and efficiently exploit its power. Currently, several and diverse applications, such as those part of the Exascale Computing Project (ECP) in the United States, rely on efficient computation of the Fast Fourier Transform (FFT). In this context, we present the design and implementation of heFFTe (Highly Efficient FFT for Exascale) library, which targets the upcoming exascale supercomputers. We provide highly (linearly) scalable GPU kernels that achieve more than 40× speedup with respect to local kernels from CPU state-of-the-art libraries, and over 2× speedup for the whole FFT computation. A communication model for parallel FFTs is also provided to analyze the bottleneck for large-scale problems. We show experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 24,576 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs.
%B International Conference on Computational Science (ICCS 2020)
%C Amsterdam, Netherlands
%8 2020-06
%G eng
%R https://doi.org/10.1007/978-3-030-50371-0_19

%0 Generic
%D 2020
%T heFFTe: Highly Efficient FFT for Exascale (Poster)
%A Alan Ayala
%A Stanimire Tomov
%A Azzam Haidar
%A Jack Dongarra
%X Considered one of the top 10 algorithms of the 20th century, the Fast Fourier Transform (FFT) is widely used by applications in science and engineering. Large scale parallel applications targeting exascale, such as those part of the DOE Exascale Computing Project (ECP), are designed for heterogeneous architectures and, currently, more than a dozen ECP applications use FFTs in their codes. To address the applications needs, we developed the highly efficient FFTs for exascale (heFFTe) library. The heFFTe library release features very good weak and strong scalability and performance that is close to 90% of the roofline peak performance. We present these performance results on the Summit supercomputer. heFFTe is also integrated in a number of applications and we present how the overall performance gets improved by using hFFTe. Performance model, limitations, and challenges are discussed for current and upcoming computer architectures.
%I SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20)
%C Seattle, WA
%8 2020-02
%G eng

%0 Generic
%D 2020
%T heFFTe: Highly Efficient FFT for Exascale (Poster)
%A Alan Ayala
%A Stanimire Tomov
%A Jack Dongarra
%A Azzam Haidar
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Generic
%D 2020
%T heFFTe: Highly Efficient FFT for Exascale (Poster)
%A Alan Ayala
%A Stanimire Tomov
%A Azzam Haidar
%A Jack Dongarra
%I NVIDIA GPU Technology Conference (GTC2020)
%8 2020-10
%G eng

%0 Conference Paper
%B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
%D 2020
%T High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs
%A Natalie Beams
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%A Tzanio Kolev
%A Yohann Dudouit
%K Batched linear algebra
%K finite elements
%K gpu
%K high-order methods
%K matrix-free FEM
%K Tensor contractions
%X We present new GPU implementations of the tensor contractions arising from basis-related computations for highorder finite element methods. We consider both tensor and nontensor bases. In the case of tensor bases, we introduce new kernels based on a series of fused device-level matrix multiplications (GEMMs), specifically designed to utilize the fast memory of the GPU. For non-tensor bases, we develop a tuned framework for choosing standard batch-BLAS GEMMs that will maximize performance across groups of elements. The implementations are included in a backend of the libCEED library. We present benchmark results for the diffusion and mass operators using libCEED integration through the MFEM finite element library and compare to those of the previously best-performing GPU backends for stand-alone basis computations. In tensor cases, we see improvements of approximately 10-30% for some cases, particularly for higher basis orders. For the non-tensor tests, the new batch-GEMMs implementation is twice as fast as what was previously available for basis function order greater than five and greater than approximately 105 degrees of freedom in the mesh; up to ten times speedup is seen for eighth-order basis functions.
%B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
%I IEEE
%8 2020-11
%G eng

%0 Generic
%D 2020
%T hipMAGMA v1.0
%A Cade Brown
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%I Zenodo
%8 2020-03
%G eng
%U https://doi.org/10.5281/zenodo.3908549
%R 10.5281/zenodo.3908549

%0 Generic
%D 2020
%T hipMAGMA v2.0
%A Cade Brown
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%I Zenodo
%8 2020-07
%G eng
%U https://doi.org/10.5281/zenodo.3928667
%R 10.5281/zenodo.3928667

%0 Generic
%D 2020
%T How to Build Your Own Deep Neural Network
%A Kwai Wong
%A Stanimire Tomov
%A Daniel Nichols
%A Rocco Febbo
%A Florent Lopez
%A Julian Halloy
%A Xianfeng Ma
%K AI
%K Deep Neural Networks
%K dense linear algebra
%K HPC
%K ML
%I PEARC20
%8 2020-07
%G eng

%0 Conference Paper
%B 40th IEEE Real-Time Systems Symposium (RTSS 2019)
%D 2020
%T Improved Energy-Aware Strategies for Periodic Real-Time Tasks under Reliability Constraints
%A Li Han
%A Louis-Claude Canon
%A Jing Liu
%A Yves Robert
%A Frederic Vivien
%B 40th IEEE Real-Time Systems Symposium (RTSS 2019)
%I IEEE Press
%C York, UK
%8 2020-02
%G eng

%0 Conference Paper
%B Smoky Mountains Computational Sciences & Engineering Conference (SMC2020)
%D 2020
%T Improving the Performance of the GMRES Method using Mixed-Precision Techniques
%A Neil Lindquist
%A Piotr Luszczek
%A Jack Dongarra
%K Kokkos
%K Krylov subspace methods
%K linear algebra
%K mixed precision
%X The GMRES method is used to solve sparse, non-symmetric systems of linear equations arising from many scientific applications. The solver performance within a single node is memory bound, due to the low arithmetic intensity of its computational kernels. To reduce the amount of data movement, and thus, to improve performance, we investigated the effect of using a mix of single and double precision while retaining double-precision accuracy. Previous efforts have explored reduced precision in the preconditioner, but the use of reduced precision in the solver itself has received limited attention. We found that GMRES only needs double precision in computing the residual and updating the approximate solution to achieve double-precision accuracy, although it must restart after each improvement of single-precision accuracy. This finding holds for the tested orthogonalization schemes: Modified Gram-Schmidt (MGS) and Classical Gram-Schmidt with Re-orthogonalization (CGSR). Furthermore, our mixed-precision GMRES, when restarted at least once, performed 19% and 24% faster on average than double-precision GMRES for MGS and CGSR, respectively. Our implementation uses generic programming techniques to ease the burden of coding implementations for different data types. Our use of the Kokkos library allowed us to exploit parallelism and optimize data management. Additionally, KokkosKernels was used when producing performance results. In conclusion, using a mix of single and double precision in GMRES can improve performance while retaining double-precision accuracy.
%B Smoky Mountains Computational Sciences & Engineering Conference (SMC2020)
%8 2020-08
%G eng

%0 Generic
%D 2020
%T Integrating Deep Learning in Domain Science at Exascale (MagmaDNN)
%A Stanimire Tomov
%A Kwai Wong
%A Jack Dongarra
%A Rick Archibald
%A Edmond Chow
%A Eduardo D'Azevedo
%A Markus Eisenbach
%A Rocco Febbo
%A Florent Lopez
%A Daniel Nichols
%A Junqi Yin
%X We will present some of the current challenges in the design and integration of deep learning AI with traditional HPC simulations. We evaluate existing packages for readiness to run efficiently deep learning models and applications on large scale HPC systems, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and up-coming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated in MagmaDNN, an open source HPC deep learning framework.   Many deep learning frameworks are targeted towards data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how these can be provided, e.g., as in MagmaDNN, through a deep integration with existing HPC libraries such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced and mixed-precision and asynchronous optimization methods. Finally, we present illustrations and potential solutions on enhancing traditional compute and data intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated on materials science, imaging, and climate applications.
%I DOD HPCMP seminar
%C virtual
%8 2020-12
%G eng

%0 Generic
%D 2020
%T Integrating Deep Learning in Domain Sciences at Exascale
%A Rick Archibald
%A Edmond Chow
%A Eduardo D'Azevedo
%A Jack Dongarra
%A Markus Eisenbach
%A Rocco Febbo
%A Florent Lopez
%A Daniel Nichols
%A Stanimire Tomov
%A Kwai Wong
%A Junqi Yin
%X This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems e ciently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-08
%G eng

%0 Conference Paper
%B 2020 Smoky Mountains Computational Sciences and Engineering Conference (SMC 2020)
%D 2020
%T Integrating Deep Learning in Domain Sciences at Exascale
%A Rick Archibald
%A Edmond Chow
%A Eduardo D'Azevedo
%A Jack Dongarra
%A Markus Eisenbach
%A Rocco Febbo
%A Florent Lopez
%A Daniel Nichols
%A Stanimire Tomov
%A Kwai Wong
%A Junqi Yin
%X This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems e ciently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications.
%B 2020 Smoky Mountains Computational Sciences and Engineering Conference (SMC 2020)
%8 2020-08
%G eng

%0 Book Section
%B Advances in Information and Communication: Proceedings of the 2019 Future of Information and Communication Conference (FICC)
%D 2020
%T Interoperable Convergence of Storage, Networking, and Computation
%A Micah Beck
%A Terry Moore
%A Piotr Luszczek
%A Anthony Danalis
%E Kohei Arai
%E Rahul Bhatia
%K active networks
%K distributed cloud
%K distributed processing
%K distributed storage
%K edge computing
%K network convergence
%K network layering
%K scalability
%X In every form of digital store-and-forward communication, intermediate forwarding nodes are computers, with attendant memory and processing resources. This has inevitably stimulated efforts to create a wide-area infrastructure that goes beyond simple store-and-forward to create a platform that makes more general and varied use of the potential of this collection of increasingly powerful nodes. Historically, these efforts predate the advent of globally routed packet networking. The desire for a converged infrastructure of this kind has only intensified over the last 30 years, as memory, storage, and processing resources have increased in both density and speed while simultaneously decreasing in cost. Although there is a general consensus that it should be possible to define and deploy such a dramatically more capable wide-area platform, a great deal of investment in research prototypes has yet to produce a credible candidate architecture. Drawing on technical analysis, historical examples, and case studies, we present an argument for the hypothesis that in order to realize a distributed system with the kind of convergent generality and deployment scalability that might qualify as "future-defining," we must build it from a small set of simple, generic, and limited abstractions of the low level resources (processing, storage and network) of its intermediate nodes.
%B Advances in Information and Communication: Proceedings of the 2019 Future of Information and Communication Conference (FICC)
%I Springer International Publishing
%P 667-690
%@ 978-3-030-12385-7
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2020)
%D 2020
%T Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for  Symmetric Positive Definite Matrices using GPUs
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%X Half-precision computation refers to performing floating-point operations in a 16-bit format. While half-precision has been driven largely by machine learning applications, recent algorithmic advances in numerical linear algebra have discovered beneficial use cases for half precision in accelerating the solution of linear systems of equations at higher precisions. In this paper, we present a high-performance, mixed-precision linear solver (Ax = b) for symmetric positive definite systems in double-precision using graphics processing units (GPUs). The solver is based on a mixed-precision Cholesky factorization that utilizes the high-performance tensor core units in CUDA-enabled GPUs. Since the Cholesky factors are affected by the low precision, an iterative refinement (IR) solver is required to recover the solution back to double-precision accuracy. Two different types of IR solvers are discussed on a wide range of test matrices. A preprocessing step is also developed, which scales and shifts the matrix, if necessary, in order to preserve its positive-definiteness in lower precisions. Our experiments on the V100 GPU show that performance speedups are up to 4.7× against a direct double-precision solver. However, matrix properties such as the condition number and the eigenvalue distribution can affect the convergence rate, which would consequently affect the overall performance.
%B International Conference on Computational Science (ICCS 2020)
%I Springer, Cham
%C Amsterdam, Netherlands
%8 2020-06
%G eng
%R https://doi.org/10.1007/978-3-030-50417-5_18

%0 Journal Article
%J ACM Transactions on Parallel Computing
%D 2020
%T Load-Balancing Sparse Matrix Vector Product Kernels on GPUs
%A Hartwig Anzt
%A Terry Cojean
%A Chen Yen-Chen
%A Jack Dongarra
%A Goran Flegar
%A Pratik Nayak
%A Stanimire Tomov
%A Yuhsiang M. Tsai
%A Weichung Wang
%X Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA’s cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.
%B ACM Transactions on Parallel Computing
%V 7
%8 2020-03
%G eng
%N 1
%R https://doi.org/10.1145/3380930

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2020
%T MAGMA Templates for Scalable Linear Algebra on Emerging Architectures
%A Mohammed Al Farhan
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Mark Gates
%A Dalal Sukkari
%A Azzam Haidar
%A Robert Rosenberg
%A Jack Dongarra
%X With the acquisition and widespread use of more resources that rely on accelerator/wide vector–based computing, there has been a strong demand for science and engineering applications to take advantage of these latest assets. This, however, has been extremely challenging due to the diversity of systems to support their extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. To address these challenges, we design a programming model and describe its ease of use in the development of a new MAGMA Templates library that delivers high-performance scalable linear algebra portable on current and emerging architectures. MAGMA Templates derives its performance and portability by (1) building on existing state-of-the-art linear algebra libraries, like MAGMA, SLATE, Trilinos, and vendor-optimized math libraries, and (2) providing access (seamlessly to the users) to the latest algorithms and architecture-specific optimizations through a single, easy-to-use C++-based API.
%B The International Journal of High Performance Computing Applications
%V 34
%P 645-658
%8 2020-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342020938421

%0 Generic
%D 2020
%T MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
%A Stanimire Tomov
%I 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting
%C Seattle, WA
%8 2020-02
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2020
%T Matrix Multiplication on Batches of Small Matrices in Half and Half-Complex Precisions
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%X Machine learning and artificial intelligence (AI) applications often rely on performing many small matrix operations—in particular general matrix–matrix multiplication (GEMM). These operations are usually performed in a reduced precision, such as the 16-bit floating-point format (i.e., half precision or FP16). The GEMM operation is also very important for dense linear algebra algorithms, and half-precision GEMM operations can be used in mixed-precision linear solvers. Therefore, high-performance batched GEMM operations in reduced precision are significantly important, not only for deep learning frameworks, but also for scientific applications that rely on batched linear algebra, such as tensor contractions and sparse direct solvers.    This paper presents optimized batched GEMM kernels for graphics processing units (GPUs) in FP16 arithmetic. The paper addresses both real and complex half-precision computations on the GPU. The proposed design takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. With eight tuning parameters introduced in the design, the developed kernels have a high degree of flexibility that overcomes the limitations imposed by the hardware and software (in the form of discrete configurations for the Tensor Core APIs). For real FP16 arithmetic, performance speedups are observed against cuBLAS for sizes up to 128, and range between  and . For the complex FP16 GEMM kernel, the speedups are between  and  thanks to a design that uses the standard interleaved matrix layout, in contrast with the planar layout required by the vendor’s solution. The paper also discusses special optimizations for extremely small matrices, where even higher performance gains are achievable.
%B Journal of Parallel and Distributed Computing
%V 145
%P 188-201
%8 2020-11
%G eng
%R https://doi.org/10.1016/j.jpdc.2020.07.001

%0 Generic
%D 2020
%T Mixed Precision LU Factorization on GPU Tensor Cores: Reducing Data Movement and Memory Footprint
%A Florent Lopez
%A Theo Mary
%K High Performance Computing
%K lu factorization
%K mixed precision algorithms
%K numerical linear algebra
%K NVIDIA GPU
%K rounding error analysis
%K tensor cores
%X Modern GPUs equipped with mixed precision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, previous works have focused solely on improving speed, neglecting memory consumption. Indeed, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixed precision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy, up to twice faster, and with only half the data movement and memory footprint.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-09
%G eng

%0 Journal Article
%J Proceedings of the Royal Society A
%D 2020
%T Mixed-Precision Iterative Refinement using Tensor Cores on GPUs to Accelerate Solution of Linear Systems
%A Azzam Haidar
%A Harun Bayraktar
%A Stanimire Tomov
%A Jack Dongarra
%A Nicholas J. Higham
%K GMRESLU factorization
%K GPU computing
%K half precision arithmetic
%K iterative refinement
%K mixed precision solvers
%X Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced-precision and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability. The techniques we employ include multiprecision LU factorization, the preconditioned generalized minimal residual algorithm (GMRES), and scaling and auto-adaptive rounding to avoid overflow. We also show how to efficiently handle systems with multiple right-hand sides. On the NVIDIA Quadro GV100 (Volta) GPU, we achieve a 4×−5× performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability.
%B Proceedings of the Royal Society A
%V 476
%8 2020-11
%G eng
%N 2243
%R https://doi.org/10.1098/rspa.2020.0110

%0 Generic
%D 2020
%T Mixed-Precision Solution of Linear Systems Using Accelerator-Based Computing
%A Azzam Haidar
%A Harun Bayraktar
%A Stanimire Tomov
%A Jack Dongarra
%A Nicholas J. Higham
%X Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced- and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability. We achieve a 4×–5× performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-05
%G eng

%0 Conference Paper
%B European Conference on Parallel Processing (Euro-Par 2020)
%D 2020
%T Multiprecision Block-Jacobi for Iterative Triangular Solves
%A Fritz Goebel
%A Hartwig Anzt
%A Terry Cojean
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K Block-Jacobi
%K graphics processing units (GPUs)
%K incomplete factorization preconditioning
%K multiprecision
%K sparse linear algebra
%X Recent research efforts have shown that Jacobi and block-Jacobi relaxation methods can be used as an effective and highly parallel approach for the solution of sparse triangular linear systems arising in the application of ILU-type preconditioners. Simultaneously, a few independent works have focused on designing efficient high performance adaptive-precision block-Jacobi preconditioning (block-diagonal scaling), in the context of the iterative solution of sparse linear systems, on manycore architectures. In this paper, we bridge the gap between relaxation methods based on regular splittings and preconditioners by demonstrating that iterative refinement can be leveraged to construct a relaxation method from the preconditioner. In addition, we exploit this insight to construct a highly-efficient sparse triangular system solver for graphics processors that combines iterative refinement with the block-Jacobi preconditioner available in the Ginkgo library.
%B European Conference on Parallel Processing (Euro-Par 2020)
%I Springer
%8 2020-08
%G eng
%R https://doi.org/10.1007/978-3-030-57675-2_34

%0 Journal Article
%J Philosophical Transactions of the Royal Society A
%D 2020
%T Numerical Algorithms for High-Performance Computational Science
%A Jack Dongarra
%A Laura Grigori
%A Nicholas J. Higham
%X A number of features of today’s high-performance computers make it challenging to exploit these machines fully for computational science. These include increasing core counts but stagnant clock frequencies; the high cost of data movement; use of accelerators (GPUs, FPGAs, coprocessors), making architectures increasingly heterogeneous; and multi- ple precisions of floating-point arithmetic, including half-precision. Moreover, as well as maximizing speed and accuracy, minimizing energy consumption is an important criterion. New generations of algorithms are needed to tackle these challenges. We discuss some approaches that we can take to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers.
%B Philosophical Transactions of the Royal Society A
%V 378
%G eng
%N 2166
%R https://doi.org/10.1098/rsta.2019.0066

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2020
%T Overhead of Using Spare Nodes
%A Atsushi Hori
%A Kazumi Yoshinaga
%A Thomas Herault
%A Aurelien Bouteiller
%A George Bosilca
%A Yutaka Ishikawa
%K communication performance
%K fault mitigation
%K Fault tolerance
%K sliding method
%K spare node
%X With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even when spare nodes are present, they are not always substituted for failed nodes in an effective way. This article considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the optimal node-rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this article, several spare node allocation and failed node substitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. The proposed substitution methods are named sliding methods. The sliding methods are analyzed by using our developed simulation program and evaluated by using the K computer, Blue Gene/Q (BG/Q), and TSUBAME 2.5. It will be shown that when failures occur, the stencil communication performance on the K and BG/Q can be slowed around 10 times depending on the number of node failures. The barrier performance on the K can be cut in half. On BG/Q, barrier performance can be slowed by a factor of 10. Further, it will also be shown that almost no such communication performance degradation can be seen on TSUBAME 2.5. This is because TSUBAME 2.5 has an Infiniband network connected with a FatTree topology, while the K computer and BG/Q have dedicated Cartesian networks. Thus, the communication performance degradation depends on network characteristics.
%B The International Journal of High Performance Computing Applications
%8 2020-02
%G eng
%U https://journals.sagepub.com/doi/10.1177/1094342020901885
%! The International Journal of High Performance Computing Applications
%R https://doi.org/10.1177%2F1094342020901885

%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Parallel Processing and Applied Mathematics: 13th International Conference, PPAM 2019, Bialystok, Poland, September 8–11, 2019, Revised Selected Papers, Part I
%A Roman Wyrzykowski
%A Ewa Deelman
%A Jack Dongarra
%A Konrad Karczewski
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 581
%8 2020-03
%@ 978-3-030-43229-4
%G eng
%R https://doi.org/10.1007/978-3-030-43229-4

%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Parallel Processing and Applied Mathematics: 13th International Conference, PPAM 2019, Bialystok, Poland, September 8–11, 2019, Revised Selected Papers, Part II
%A Roman Wyrzykowski
%A Ewa Deelman
%A Jack Dongarra
%A Konrad Karczewski
%B Lecture Notes in Computer Science
%I Springer International Publishing
%P 503
%8 2020-03
%@ 978-3-030-43222-5
%G eng
%R https://doi.org/10.1007/978-3-030-43222-5

%0 Generic
%D 2020
%T Performance Application Programming Interface for Extreme-Scale Environments (PAPI-EX) (Poster)
%A Jack Dongarra
%A Heike Jagode
%A Anthony Danalis
%A Daniel Barry
%A Vince Weaver
%I 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting
%C Seattle, WA
%8 2020-20
%G eng

%0 Generic
%D 2020
%T Performance Tuning SLATE
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Dalal Sukkari
%A Mohammed Al Farhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2020-01
%G eng

%0 Generic
%D 2020
%T The PLASMA Library on CORAL Systems and Beyond (Poster)
%A Piotr Luszczek
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Conference Paper
%B 2020 IEEE International Conference on Cluster Computing (CLUSTER)
%D 2020
%T Predicting MPI Collective Communication Performance Using Machine Learning
%A Sascha Hunold
%A Abhinav Bhatele
%A George Bosilca
%A Peter Knees
%K Auto-tuning
%K GAM
%K KNN
%K Machine Learning
%K message passing interface
%K Performance Prediction
%K XGBoost
%X The Message Passing Interface (MPI) defines the semantics of data communication operations, while the implementing libraries provide several parameterized algorithms for each operation. Each algorithm of an MPI collective operation may work best on a particular system and may be dependent on the specific communication problem. Internally, MPI libraries employ heuristics to select the best algorithm for a given communication problem when being called by an MPI application. The majority of MPI libraries allow users to override the default algorithm selection, enabling the tuning of this selection process. The problem then becomes how to select the best possible algorithm for a specific case automatically. In this paper, we address the algorithm selection problem for MPI collective communication operations. To solve this problem, we propose an auto-tuning framework for collective MPI operations based on machine-learning techniques. First, we execute a set of benchmarks of an MPI library and its entire set of collective algorithms. Second, for each algorithm, we fit a performance model by applying regression learners. Last, we use the regression models to predict the best possible (fastest) algorithm for an unseen communication problem. We evaluate our approach for different MPI libraries and several parallel machines. The experimental results show that our approach outperforms the standard algorithm selection heuristics, which are hard-coded into the MPI libraries, by a significant margin.
%B 2020 IEEE International Conference on Cluster Computing (CLUSTER)
%I IEEE
%C Kobe, Japan
%8 2020-09
%G eng
%R https://doi.org/10.1109/CLUSTER49012.2020.00036

%0 Journal Article
%J The Journal of Computational Science Education
%D 2020
%T Project-Based Research and Training in High Performance Data Sciences, Data Analytics, and Machine Learning
%A Wong, Kwai
%A Stanimire Tomov
%A Jack Dongarra
%B The Journal of Computational Science Education
%V 11
%P 36-44
%8 2020-01
%G eng
%U http://www.jocse.org/articles/11/1/7/
%N 1
%! JOCSE
%R https://doi.org/10.22369/issn.2153-4136/11/1/7

%0 Generic
%D 2020
%T Prospectus for the Next LAPACK and ScaLAPACK Libraries: Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC)
%A James Demmel
%A Jack Dongarra
%A Julie Langou
%A Julien Langou
%A Piotr Luszczek
%A Michael Mahoney
%X The convergence of several unprecedented changes, including formidable new system design constraints and revolutionary levels of heterogeneity, has made it clear that much of the essential software infrastructure of computational science and engineering is, or will soon be, obsolete. Math libraries have historically been in the vanguard of software that must be adapted first to such changes, both because these low-level workhorses are so critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Under the Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC) project, the principal designers of the Linear Algebra PACKage (LAPACK) and the Scalable Linear Algebra PACKage (ScaLAPACK), the combination of which is abbreviated Sca/LAPACK, aim to enhance and update these libraries for the ongoing revolution in processor architecture, system design, and application requirements by incorporating them into a layered package of software components—the BALLISTIC ecosystem—that provides users seamless access to state-of-the-art solver implementations through familiar and improved Sca/LAPACK interfaces.
%B LAPACK Working Notes
%I University of Tennessee
%8 2020/07
%G eng

%0 Generic
%D 2020
%T PULSE: PAPI Unifying Layer for Software-Defined Events (Poster)
%A Heike Jagode
%A Anthony Danalis
%I 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting
%C Seattle, WA
%8 2020-02
%G eng

%0 Generic
%D 2020
%T Redesigning PAPI's High-Level API
%A Frank Winkler
%X PAPI (Performance Application Programming Interface) provides a portable and efficient API to access the hardware performance counters found on modern microprocessors. With the introduction of Component PAPI or PAPI-C in early 2010 PAPI has extended its reach beyond the CPU and can now monitor system information across a range of components from CPUs to network cards, graphics accelerator cards, parallel file systems and more. To collect performance events, PAPI provides two APIs, the low-level and high-level API. The legacy high-level API was designed for simplicity, but could only handle preset CPU events. To access events from all installed components, the programmer had to use the low-level API. This paper introduces a new high-level API that enables the measurement of both preset and native events. It is intended for programmers who want to perform simple event measurements with minimal code instrumentation.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-02
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2020
%T Reducing the Amount of out-of-core Data Access for GPU-Accelerated Randomized SVD
%A Yuechao Lu
%A Ichitaro Yamazaki
%A Fumihiko Ino
%A Yasuyuki Matsushita
%A Stanimire Tomov
%A Jack Dongarra
%K Divide and conquer
%K gpu
%K out-of-core computation
%K Singular value decomposition
%B Concurrency and Computation: Practice and Experience
%8 2020-04
%G eng
%R https://doi.org/10.1002/cpe.5754

%0 Conference Paper
%B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
%D 2020
%T Replacing Pivoting in Distributed Gaussian Elimination with Randomized Techniques
%A Neil Lindquist
%A Piotr Luszczek
%A Jack Dongarra
%K linear systems
%K Randomized algorithms
%X Gaussian elimination is a key technique for solving dense, non-symmetric systems of linear equations. Pivoting is used to ensure numerical stability but can introduce significant overheads. We propose replacing pivoting with recursive butterfly transforms (RBTs) and iterative refinement. RBTs use an FFT-like structure and randomized elements to provide an efficient, two-sided preconditioner for factoring. This approach was implemented and tested using Software for Linear Algebra Targeting Exascale (SLATE). In numerical experiments, our implementation was more robust than Gaussian elimination with no pivoting (GENP) but failed to solve all the problems solvable with Gaussian elimination with partial pivoting (GEPP). Furthermore, the proposed solver was able to outperform GEPP when distributed on GPU-accelerated nodes.
%B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
%I IEEE
%C Atlanta, GA
%8 2020-11
%G eng

%0 Generic
%D 2020
%T A Report of the MPI International Survey (Poster)
%A Atsushi Hori
%A Takahiro Ogura
%A Balazs Gerofi
%A Jie Yin
%A Yutaka Ishikawa
%A Emmanuel Jeannot
%A George Bosilca
%I EuroMPI/USA '20: 27th European MPI Users' Group Meeting
%C Austin, TX
%8 2020-09
%G eng

%0 Generic
%D 2020
%T Report on the Fujitsu Fugaku System
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-06
%G eng

%0 Conference Paper
%B 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2020)
%D 2020
%T Reservation and Checkpointing Strategies for Stochastic Jobs
%A Ana Gainaru
%A Brice Goglin
%A Valentin Honoré
%A Padma Raghavan
%A Guillaume Pallez
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%B 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2020)
%I IEEE Computer Society Press
%C New Orleans, LA
%8 2020-05
%G eng

%0 Conference Paper
%B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020)
%D 2020
%T Revisiting Dynamic DAG Scheduling under Memory Constraints for Shared-Memory Platforms
%A Gabriel Bathie
%A Loris Marchal
%A Yves Robert
%A Samuel Thibault
%B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020)
%I IEEE Computer Society Press
%C New Orleans, LA
%8 2020-05
%G eng

%0 Generic
%D 2020
%T Roadmap for Refactoring Classic PAPI to PAPI++: Part II: Formulation of Roadmap Based on Survey Results
%A Heike Jagode
%A Anthony Danalis
%A Damien Genet
%B PAPI++ Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2020-07
%G eng

%0 Conference Paper
%B 49th International Conference on Parallel Processing (ICPP 2020)
%D 2020
%T Robustness of the Young/Daly Formula for Stochastic Iterative Applications
%A Yishu Du
%A Loris Marchal
%A Guillaume Pallez
%A Yves Robert
%B 49th International Conference on Parallel Processing (ICPP 2020)
%I ACM Press
%C Edmonton, AB, Canada
%8 2020-08
%G eng

%0 Conference Paper
%B 2020 IEEE High Performance Extreme Computing Conference (HPEC)
%D 2020
%T Scalable Data Generation for Evaluating Mixed-Precision Solvers
%A Piotr Luszczek
%A Yaohung Tsai
%A Neil Lindquist
%A Hartwig Anzt
%A Jack Dongarra
%X We present techniques of generating data for mixed precision solvers that allows to test those solvers in a scalable manner. Our techniques focus on mixed precision hardware and software where both the solver and the hardware can take advantage of mixing multiple floating precision formats. This allows taking advantage of recently released generation of hardware platforms that focus on ML and DNN workloads but can also be utilized for HPC applications if a new breed of algorithms is combined with the custom floating-point formats to deliver performance levels beyond the standard IEEE data types while delivering a comparable accuracy of the results.
%B 2020 IEEE High Performance Extreme Computing Conference (HPEC)
%I IEEE
%C Waltham, MA, USA
%8 2020-09
%G eng
%R https://doi.org/10.1109/HPEC43674.2020.9286145

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2020
%T A Set of Batched Basic Linear Algebra Subprograms
%A Ahmad Abdelfattah
%A Timothy Costa
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Mawussi Zounon
%X This paper describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular half precision is used in many very large scale applications, such as those associated with machine learning.
%B ACM Transactions on Mathematical Software
%8 2020-10
%G eng

%0 Generic
%D 2020
%T SLATE Performance Report: Updates to Cholesky and LU Factorizations
%A Asim YarKhan
%A Mohammed Al Farhan
%A Dalal Sukkari
%A Mark Gates
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-10
%G eng

%0 Generic
%D 2020
%T SLATE: Software for Linear Algebra Targeting Exascale (POSTER)
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Mohammed Al Farhan
%A Dalal Sukkari
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Generic
%D 2020
%T SLATE Tutorial
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%A Ali Charara
%A Jamie Finney
%A Dalal Sukkari
%A Mohammed Al Farhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Jack Dongarra
%I 2020 ECP Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Generic
%D 2020
%T SLATE Users' Guide
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Mohammed Al Farhan
%A Dalal Sukkari
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2020-07
%G eng
%9 SLATE Working Notes

%0 Conference Paper
%B ISC High Performance
%D 2020
%T Sparse Linear Algebra on AMD and NVIDIA GPUs—The Race is On
%A Yuhsiang M. Tsai
%A Terry Cojean
%A Hartwig Anzt
%K AMD
%K GPUs
%K nVidia
%K sparse matrix vector product (SpMV)
%X Efficiently processing sparse matrices is a central and performance-critical part of many scientific simulation codes. Recognizing the adoption of manycore accelerators in HPC, we evaluate in this paper the performance of the currently best sparse matrix-vector product (SpMV) implementations on high-end GPUs from AMD and NVIDIA. Specifically, we optimize SpMV kernels for the CSR, COO, ELL, and HYB format taking the hardware characteristics of the latest GPU technologies into account. We compare for 2,800 test matrices the performance of our kernels against AMD’s hipSPARSE library and NVIDIA’s cuSPARSE library, and ultimately assess how the GPU technologies from AMD and NVIDIA compare in terms of SpMV performance.
%B ISC High Performance
%I Springer
%8 2020-06
%G eng
%R https://doi.org/10.1007/978-3-030-50743-5_16

%0 Generic
%D 2020
%T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Erik Boman
%A Erin Carson
%A Terry Cojean
%A Jack Dongarra
%A Mark Gates
%A Thomas Gruetzmacher
%A Nicholas J. Higham
%A Sherry Li
%A Neil Lindquist
%A Yang Liu
%A Jennifer Loe
%A Piotr Luszczek
%A Pratik Nayak
%A Sri Pranesh
%A Siva Rajamanickam
%A Tobias Ribizel
%A Barry Smith
%A Kasia Swirydowicz
%A Stephen Thomas
%A Stanimire Tomov
%A Yaohung Tsai
%A Ichitaro Yamazaki
%A Urike Meier Yang
%B SLATE Working Notes
%I University of Tennessee
%8 2020-07
%G eng
%9 SLATE Working Notes

%0 Conference Paper
%B International Conference for High Performance Computing Networking, Storage, and Analysis (SC20)
%D 2020
%T Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance
%A Elliott Slaughter
%A Wei Wu
%A Yuankun Fu
%A Legend Brandenburg
%A Nicolai Garcia
%A Wilhem Kautz
%A Emily Marx
%A Kaleb S. Morris
%A Qinglei Cao
%A George Bosilca
%A Seema Mirchandaney
%A Wonchan Lee
%A Sean Treichler
%A Patrick McCormick
%A Alex Aiken
%X We present Task Bench, a parameterized benchmark designed to explore the performance of distributed programming systems under a variety of application scenarios. Task Bench dramatically lowers the barrier to benchmarking and comparing multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench's parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications.    To assess the effectiveness and overheads of the tested systems, we introduce a novel metric, minimum effective task granularity (METG). We conduct a comprehensive study with 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. Running at scale, 100μs-long tasks are the finest granularity that any system runs efficiently with current technologies. We also study each system's scalability, ability to hide communication and mitigate load imbalance.
%B International Conference for High Performance Computing Networking, Storage, and Analysis (SC20)
%I ACM
%8 2020-11
%G eng
%U https://dl.acm.org/doi/10.5555/3433701.3433783

%0 Conference Paper
%B 2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)
%D 2020
%T The Template Task Graph (TTG) - An Emerging Practical Dataflow Programming Paradigm for Scientific Simulation at Extreme Scale
%A George Bosilca
%A Robert Harrison
%A Thomas Herault
%A Mohammad Mahdi Javanmard
%A Poornima Nookala
%A Edward Valeev
%K dag
%K dataflow
%K exascale
%K graph
%K High-performance computing
%K workflow
%X We describe TESSE, an emerging general-purpose, open-source software ecosystem that attacks the twin challenges of programmer productivity and portable performance for advanced scientific applications on modern high-performance computers. TESSE builds upon and extends the ParsecDAG/-dataflow runtime with a new Domain Specific Languages (DSL) and new integration capabilities. Motivating this work is our belief that such a dataflow model, perhaps with applications composed in domain specific languages, can overcome many of the challenges faced by a wide variety of irregular applications that are poorly served by current programming and execution models. Two such applications from many-body physics and applied mathematics are briefly explored. This paper focuses upon the Template Task Graph (TTG), which is TESSE's main C++ Api that provides a powerful work/data-flow programming model. Algorithms on spatial trees, block-sparse tensors, and wave fronts are used to illustrate the API and associated concepts, as well as to compare with related approaches.
%B 2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)
%I IEEE
%8 2020-11
%G eng
%R https://doi.org/10.1109/ESPM251964.2020.00011

%0 Generic
%D 2020
%T Translational Process: Mathematical Software Perspective
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%K communication avoiding algorithms
%K data flow scheduling runtimes
%K hardware accelerators
%X Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to worldclass high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack.
%B Innovative Computing Laboratory Technical Report
%8 2020-08
%G eng

%0 Journal Article
%J Journal of Computational Science
%D 2020
%T Translational Process: Mathematical Software Perspective
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%K communication avoiding algorithms
%K DATAFLOW scheduling runtimes
%K hardware accelerators
%X Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to world-class high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack.
%B Journal of Computational Science
%8 2020-09
%G eng
%R https://doi.org/10.1016/j.jocs.2020.101216

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2020)
%D 2020
%T Twenty Years of Computational Science
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B International Conference on Computational Science (ICCS 2020)
%C Amsterdam, Netherlands
%8 2020-06
%G eng

%0 Conference Paper
%B EuroMPI/USA '20: 27th European MPI Users' Group Meeting
%D 2020
%T Using Advanced Vector Extensions AVX-512 for MPI Reduction
%A Dong Zhong
%A Qinglei Cao
%A George Bosilca
%A Jack Dongarra
%K Instruction level parallelism
%K Intel AVX2/AVX-512
%K Long vector extension
%K MPI reduction operation
%K Single instruction multiple data
%K Vector operation
%X As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the timeto- solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.
%B EuroMPI/USA '20: 27th European MPI Users' Group Meeting
%C Austin, TX
%8 2020-09
%G eng
%R https://doi.org/10.1145/3416315.3416316

%0 Generic
%D 2020
%T Using Advanced Vector Extensions AVX-512  for MPI Reduction (Poster)
%A Dong Zhong
%A George Bosilca
%A Qinglei Cao
%A Jack Dongarra
%I EuroMPI/USA '20: 27th European MPI Users' Group Meeting
%C Austin, TX
%8 2020-09
%G eng

%0 Conference Paper
%B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2020)
%D 2020
%T Using Arm Scalable Vector Extension to Optimize Open MPI
%A Dong Zhong
%A Pavel Shamis
%A Qinglei Cao
%A George Bosilca
%A Jack Dongarra
%K ARMIE
%K datatype pack and unpack
%K local reduction
%K non-contiguous accesses
%K SVE
%K Vector Length Agnostic
%X As the scale of high-performance computing (HPC) systems continues to grow, increasing levels of parallelism must be implored to achieve optimal performance. Recently, the processors support wide vector extensions, vectorization becomes much more important to exploit the potential peak performance of target architecture. Novel processor architectures, such as the Armv8-A architecture, introduce Scalable Vector Extension (SVE) - an optional separate architectural extension with a new set of A64 instruction encodings, which enables even greater parallelisms. In this paper, we analyze the usage and performance of the SVE instructions in Arm SVE vector Instruction Set Architecture (ISA); and utilize those instructions to improve the memcpy and various local reduction operations. Furthermore, we propose new strategies to improve the performance of MPI operations including datatype packing/unpacking and MPI reduction. With these optimizations, we not only provide a higher-parallelism for a single node, but also achieve a more efficient communication scheme of message exchanging. The resulting efforts have been implemented in the context of OPEN MPI, providing efficient and scalable capabilities of SVE usage and extending the possible implementations of SVE to a more extensive range of programming and execution paradigms. The evaluation of the resulting software stack under different scenarios with both simulator and Fujitsu’s A64FX processor demonstrates that the solution is at the same time generic and efficient.
%B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2020)
%I IEEE/ACM
%C Melbourne, Australia
%8 2020-05
%G eng
%R https://doi.org/10.1109/CCGrid49817.2020.00-71

%0 Generic
%D 2020
%T Using Quantized Integer in LU Factorization with Partial Pivoting (Poster)
%A Yaohung Tsai
%A Piotr Luszczek
%A Jack Dongarra
%X Quantization is a common technique to speed the deep learning inference. It is using integers with a shared scalar to represent a set of equally spaced numbers. The quantized integer method has shown great success in compressing the deep learning models, reducing the computation cost without losing too much accuracy. New application specific hardware and specialized CPU extension instructions like Intel AVX-512 VNNI are providing capabilities for us to do integer MADD (multiply and add) efficiently. In this poster, we would like to show our preliminary results of using quantization integers for LU factorization with partial pivoting. Using Int32, the backward error can outperform single precision. However, quantized integer has the similar issue of limited range as FP16 that it would not work directly for large matrices because of big numbers would occur in factored U. We will show some possible solutions to it and how we would like to apply this quantized integer technique to other numerical linear algebra applications.
%I SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20)
%C Seattle, WA
%8 2020-02
%G eng

%0 Generic
%D 2020
%T xSDK4ECP: Extreme-scale Scientific Software Development Kit for ECP (Poster)
%A Roscoe Bartlett
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2019
%T Adaptive Precision in Block-Jacobi Preconditioning for Iterative Sparse Linear System Solvers
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Nicholas J. Higham
%A Enrique S. Quintana-Orti
%K adaptive precision
%K block-Jacobi preconditioning
%K communication reduction
%K energy efficiency
%K Krylov subspace methods
%K sparse linear systems
%X Summary We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block-Jacobi preconditioner in different precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the effects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth-bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem's data in memory. Given this observation, we propose a model that quantifies the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a floating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential benefits of the adaptive block-Jacobi preconditioning scheme.
%B Concurrency and Computation: Practice and Experience
%V 31
%P e4460
%8 2019-03
%G eng
%U https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4460
%R https://doi.org/10.1002/cpe.4460

%0 Journal Article
%J Parallel Computing
%D 2019
%T Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices
%A Ian Masliah
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Marc Baboulin
%A Joël Falcou
%A Jack Dongarra
%K Autotuning
%K Batched GEMM
%K HPC
%K Matrix-matrix product
%K optimization
%K Small matrices
%X Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.
%B Parallel Computing
%V 81
%P 1–21
%8 2019-01
%G eng
%R https://doi.org/10.1016/j.parco.2018.10.003

%0 Conference Paper
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops
%D 2019
%T Approximate and Exact Selection on GPUs
%A Tobias Ribizel
%A Hartwig Anzt
%X We present a novel algorithm for parallel selection on GPUs. The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always using the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for - and exploiting the characteristics of - "pleasant" data distributions. At the same time, as the SampleSelect does not work on the actual values but the ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. Additionally to the exact SampleSelect, we address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy.
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R 10.1109/IPDPSW.2019.00088

%0 Conference Paper
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2019
%T Are we Doing the Right Thing? – A Critical Analysis of the Academic HPC Community
%A Hartwig Anzt
%A Goran Flegar
%X Like in any other research field, academically surviving in the High Performance Computing (HPC) community generally requires to publish papers, in the bast case many of them and in high-ranked journals or at top-tier conferences. As a result, the number of scientific papers published each year in this relatively small community easily outnumbers what a single researcher can read. At the same time, many of the proposed and analyzed strategies, algorithms, and hardware-optimized implementations never make it beyond the prototype stage, as they are abandoned once they served the single purpose of yielding (another) publication. In a time and field where high-quality manpower is a scarce resource, this is extremely inefficient. In this position paper we promote a radical paradigm shift towards accepting high-quality software patches to community software packages as legitimate conference contributions. In consequence, the reputation and appointability of researchers is no longer based on the classical scientific metrics, but on the quality and documentation of open source software contributions - effectively improving and accelerating the collaborative development of community software.
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R 10.1109/IPDPSW.2019.00122

%0 Conference Paper
%B Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19)
%D 2019
%T Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications
%A Nuria Losada
%A Aurelien Bouteiller
%A George Bosilca
%K checkpoint/restart
%K Fault tolerance
%K Message logging
%K MPI
%K ULFM
%K User Level Fault Mitigation
%X With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery,  in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations  forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an  equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions.
%B Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19)
%8 2019-11
%G eng
%U https://sc19.supercomputing.org/proceedings/workshops/workshop_files/ws_ftxs103s2-file1.pdf

%0 Generic
%D 2019
%T BDEC2 Platform White Paper
%A Todd Gamblin
%A Pete Beckman
%A Kate Keahey
%A Kento Sato
%A Masaaki Kondo
%A Gerofi Balazs
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-09
%G eng

%0 Generic
%D 2019
%T CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Valeria Barra
%A Natalie Beams
%A Jed Brown
%A Jean-Sylvain Camier
%A Veselin Dobrev
%A Jack Dongarra
%A Yohann Dudouit
%A Paul Fischer
%A Ali Karakus
%A Stefan Kerkemeier
%A Tzanio Kolev
%A YuHsiang Lan
%A Elia Merzari
%A Misun Min
%A Aleks Obabko
%A Scott Parker
%A Thilina Ratnayaka
%A Jeremy Thompson
%A Ananias Tomboulides
%A Vladimir Tomov
%A Tim Warburton
%I Zenodo
%8 2019-10
%G eng
%R https://doi.org/10.5281/zenodo.3477618

%0 Generic
%D 2019
%T CEED ECP Milestone Report: Public release of CEED 2.0
%A Jed Brown
%A Ahmad Abdelfattah
%A Valeria Barra
%A Veselin Dobrev
%A Yohann Dudouit
%A Paul Fischer
%A Tzanio Kolev
%A David Medina
%A Misun Min
%A Thilina Ratnayaka
%A Cameron Smith
%A Jeremy Thompson
%A Stanimire Tomov
%A Vladimir Tomov
%A Tim Warburton
%I Zenodo
%8 2019-04
%G eng
%U https://doi.org/10.5281/zenodo.2641316
%R 10.5281/zenodo.2641316

%0 Conference Paper
%B 2019 International Conference on Parallel Computing (ParCo2019)
%D 2019
%T Characterization of Power Usage and Performance in Data-Intensive Applications using MapReduce over MPI
%A Joshua Davis
%A Tao Gao
%A Sunita Chandrasekaran
%A Heike Jagode
%A Anthony Danalis
%A Pavan Balaji
%A Jack Dongarra
%A Michela Taufer
%B 2019 International Conference on Parallel Computing (ParCo2019)
%C Prague, Czech Republic
%8 2019-09
%G eng

%0 Journal Article
%J International Journal of Networking and Computing
%D 2019
%T Checkpointing Strategies for Shared High-Performance Computing Platforms
%A Thomas Herault
%A Yves Robert
%A Aurelien Bouteiller
%A Dorian Arnold
%A Kurt Ferreira
%A George Bosilca
%A Jack Dongarra
%X Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Overall, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem.
%B International Journal of Networking and Computing
%V 9
%P 28–52
%G eng
%U http://www.ijnc.org/index.php/ijnc/article/view/195

%0 Generic
%D 2019
%T A Collection of Presentations from the BDEC2 Workshop in Kobe, Japan
%A Rosa M. Badia
%A Micah Beck
%A François Bodin
%A Taisuke Boku
%A Franck Cappello
%A Alok Choudhary
%A Carlos Costa
%A Ewa Deelman
%A Nicola Ferrier
%A Katsuki Fujisawa
%A Kohei Fujita
%A Maria Girone
%A Geoffrey Fox
%A Shantenu Jha
%A Yoshinari Kameda
%A Christian Kniep
%A William Kramer
%A James Lin
%A Kengo Nakajima
%A Yiwei Qiu
%A Kishore Ramachandran
%A Glenn Ricart
%A Kim Serradell
%A Dan Stanzione
%A Lin Gan
%A Martin Swany
%A Christine Sweeney
%A Alex Szalay
%A Christine Kirkpatrick
%A Kenton McHenry
%A Alainna White
%A Steve Tuecke
%A Ian Foster
%A Joe Mambretti
%A William. M Tang
%A Michela Taufer
%A Miguel Vázquez
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2019-02
%G eng

%0 Generic
%D 2019
%T A Collection of White Papers from the BDEC2 Workshop in Poznan, Poland
%A Gabriel Antoniu
%A Alexandru Costan
%A Ovidiu Marcu
%A Maria S. Pérez
%A Nenad Stojanovic
%A Rosa M. Badia
%A Miguel Vázquez
%A Sergi Girona
%A Micah Beck
%A Terry Moore
%A Piotr Luszczek
%A Ezra Kissel
%A Martin Swany
%A Geoffrey Fox
%A Vibhatha Abeykoon
%A Selahattin Akkas
%A Kannan Govindarajan
%A Gurhan Gunduz
%A Supun Kamburugamuve
%A Niranda Perera
%A Ahmet Uyar
%A Pulasthi Wickramasinghe
%A Chathura Widanage
%A Maria Girone
%A Toshihiro Hanawa
%A Richard Moreno
%A Ariel Oleksiak
%A Martin Swany
%A Ryousei Takano
%A M.P. van Haarlem
%A J. van Leeuwen
%A J.B.R. Oonk
%A T. Shimwell
%A L.V.E. Koopmans
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2019-05
%G eng

%0 Generic
%D 2019
%T A Collection of White Papers from the BDEC2 Workshop in San Diego, CA
%A Ilkay Altintas
%A Kyle Marcus
%A Volkan Vural
%A Shweta Purawat
%A Daniel Crawl
%A Gabriel Antoniu
%A Alexandru Costan
%A Ovidiu Marcu
%A Prasanna Balaprakash
%A Rongqiang Cao
%A Yangang Wang
%A Franck Cappello
%A Robert Underwood
%A Sheng Di
%A Justin M. Wozniak
%A Jon C. Calhoun
%A Cong Xu
%A Antonio Lain
%A Paolo Faraboschi
%A Nic Dube
%A Dejan Milojicic
%A Balazs Gerofi
%A Maria Girone
%A Viktor Khristenko
%A Tony Hey
%A Erza Kissel
%A Yu Liu
%A Richard Loft
%A Pekka Manninen
%A Sebastian von Alfthan
%A Takemasa Miyoshi
%A Bruno Raffin
%A Olivier Richard
%A Denis Trystram
%A Maryam Rahnemoonfar
%A Robin Murphy
%A Joel Saltz
%A Kentaro Sano
%A Rupak Roy
%A Kento Sato
%A Jian Guo
%A Jen s Domke
%A Weikuan Yu
%A Takaki Hatsui
%A Yasumasa Joti
%A Alex Szalay
%A William M. Tang
%A Michael R. Wyatt II
%A Michela Taufer
%A Todd Gamblin
%A Stephen Herbein
%A Adam Moody
%A Dong H. Ahn
%A Rich Wolski
%A Chandra Krintz
%A Fatih Bakir
%A Wei-tsung Lin
%A Gareth George
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-10
%G eng

%0 Journal Article
%J International Journal of Networking and Computing
%D 2019
%T Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors
%A Anne Benoit
%A Aurelien Cavelan
%A Florina M. Ciorba
%A Valentin Le Fèvre
%A Yves Robert
%K checkpoint
%K fail-stop error; silent error
%K HPC
%K linear workflow
%K Replication
%X Large-scale platforms currently experience errors from two di?erent sources, namely fail-stop errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt data). This work combines checkpointing and replication for the reliable execution of linear work?ows on platforms subject to these two error types. While checkpointing and replication have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear work?ows in error-prone environments. Moreover, combined checkpointing and replication has not yet been studied in the presence of both fail-stop and silent errors. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it to ensure its reliable execution. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques, lead to improved performance.
%B International Journal of Networking and Computing
%V 9
%P 2-27
%8 2019
%G eng
%U http://www.ijnc.org/index.php/ijnc/article/view/194

%0 Journal Article
%J Parallel Computing
%D 2019
%T Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms
%A Valentin Le Fèvre
%A Thomas Herault
%A Yves Robert
%A Aurelien Bouteiller
%A Atsushi Hori
%A George Bosilca
%A Jack Dongarra
%B Parallel Computing
%V 85
%P 1–12
%8 2019-07
%G eng
%R https://doi.org/10.1016/j.parco.2019.02.002

%0 Journal Article
%J Algorithmica
%D 2019
%T Computing Dense Tensor Decompositions with Optimal Dimension Trees
%A Oguz Kaya
%A Yves Robert
%K CP decomposition
%K Dimension tree
%K Tensor computations
%K Tucker decomposition
%B Algorithmica
%V 81
%P 2092–2121
%8 2019-05
%G eng
%N 5
%R https://doi.org/10.1007/s00453-018-0525-3

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2019
%T Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms
%A Guillaume Aupy
%A Anne Benoit
%A Brice Goglin
%A Loïc Pottier
%A Yves Robert
%K cache partitioning
%K chip multiprocessor
%K co-scheduling
%K HPC application
%X With the recent advent of many-core architectures such as chip multiprocessors (CMPs), the number of processing units accessing a global shared memory is constantly increasing. Co-scheduling techniques are used to improve application throughput on such architectures, but sharing resources often generates critical interferences. In this article, we focus on the interferences in the last level of cache (LLC) and use the Cache Allocation Technology (CAT) recently provided by Intel to partition the LLC and give each co-scheduled application their own cache area. We consider m iterative HPC applications running concurrently and answer to the following questions: (i) How to precisely model the behavior of these applications on the cache-partitioned platform? and (ii) how many cores and cache fractions should be assigned to each application to maximize the platform efficiency? Here, platform efficiency is defined as maximizing the performance either globally, or as guaranteeing a fixed ratio of iterations per second for each application. Through extensive experiments using CAT, we demonstrate the impact of cache partitioning when multiple HPC applications are co-scheduled onto CMP platforms.
%B International Journal of High Performance Computing Applications
%V 33
%P 1221-1239
%8 2019-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342019846956

%0 Conference Paper
%B 11th International Workshop on Parallel Tools for High Performance Computing
%D 2019
%T Counter Inspection Toolkit: Making Sense out of Hardware Performance Events
%A Anthony Danalis
%A Heike Jagode
%A H Hanumantharayappa
%A Sangamesh Ragate
%A Jack Dongarra
%X Hardware counters play an essential role in understanding the behavior of performance-critical applications, and inform any effort to identify opportunities for performance optimization. However, because modern hardware is becoming increasingly complex, the number of counters that are offered by the vendors increases and, in some cases, so does their complexity. In this paper we present a toolkit that aims to assist application developers invested in performance analysis by automatically categorizing and disambiguating performance counters. We present and discuss the set of microbenchmarks and analyses that we developed as part of our toolkit. We explain why they work and discuss the non-obvious reasons why some of our early benchmarks and analyses did not work in an effort to share with the rest of the community the wisdom we acquired from negative results.
%B 11th International Workshop on Parallel Tools for High Performance Computing
%I Cham, Switzerland: Springer
%C Dresden, Germany
%8 2019-02
%G eng
%R https://doi.org/10.1007/978-3-030-11987-4_2

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2019
%T A Customized Precision Format Based on Mantissa Segmentation for Accelerating Sparse Linear Algebra
%A Thomas Gruetzmacher
%A Terry Cojean
%A Goran Flegar
%A Fritz Göbel
%A Hartwig Anzt
%B Concurrency and Computation: Practice and Experience
%V 40319
%8 2019-01
%G eng
%N 262
%R https://doi.org/10.1002/cpe.5418

%0 Conference Paper
%B 5th EAI International Conference on Smart Objects and Technologies for Social Good
%D 2019
%T Data Logistics: Toolkit and Applications
%A Micah Beck
%A Terry Moore
%A Nancy French
%A Erza Kissel
%A Martin Swany
%B 5th EAI International Conference on Smart Objects and Technologies for Social Good
%C Valencia, Spain
%8 2019-09
%G eng

%0 Generic
%D 2019
%T Design and Implementation for FFT-ECP on Distributed Accelerated Systems
%A Stanimire Tomov
%A Azzam Haidar
%A Alan Ayala
%A Daniel Schultz
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-04
%G eng
%9 ECP WBS 2.3.3.09 Milestone Report

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T Distributed-Memory Lattice H-Matrix Factorization
%A Ichitaro Yamazaki
%A Akihiro Ida
%A Rio Yokota
%A Jack Dongarra
%X We parallelize the LU factorization of a hierarchical low-rank matrix (ℋ-matrix) on a distributed-memory computer. This is much more difficult than the ℋ-matrix-vector multiplication due to the dataflow of the factorization, and it is much harder than the parallelization of a dense matrix factorization due to the irregular hierarchical block structure of the matrix. Block low-rank (BLR) format gets rid of the hierarchy and simplifies the parallelization, often increasing concurrency. However, this comes at a price of losing the near-linear complexity of the ℋ-matrix factorization. In this work, we propose to factorize the matrix using a “lattice ℋ-matrix” format that generalizes the BLR format by storing each of the blocks (both diagonals and off-diagonals) in the ℋ-matrix format. These blocks stored in the ℋ-matrix format are referred to as lattices. Thus, this lattice format aims to combine the parallel scalability of BLR factorization with the near-linear complexity of ℋ-matrix factorization. We first compare factorization performances using the ℋ-matrix, BLR, and lattice ℋ-matrix formats under various conditions on a shared-memory computer. Our performance results show that the lattice format has storage and computational complexities similar to those of the ℋ-matrix format, and hence a much lower cost of factorization than BLR. We then compare the BLR and lattice ℋ-matrix factorization on distributed-memory computers. Our performance results demonstrate that compared with BLR, the lattice format with the lower cost of factorization may lead to faster factorization on the distributed-memory computer.
%B The International Journal of High Performance Computing Applications
%V 33
%P 1046–1063
%8 2019-08
%G eng
%N 5
%R https://doi.org/10.1177/1094342019861139

%0 Generic
%D 2019
%T Does your tool support PAPI SDEs yet?
%A Anthony Danalis
%A Heike Jagode
%A Jack Dongarra
%I 13th Scalable Tools Workshop
%C Tahoe City, CA
%8 2019-07
%G eng

%0 Generic
%D 2019
%T An Empirical View of SLATE Algorithms on Scalable Hybrid System
%A Asim YarKhan
%A Jakub Kurzak
%A Ahmad Abdelfattah
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2019-09
%G eng

%0 Journal Article
%J International Journal of High Performance Computing and Networking
%D 2019
%T Evaluation of Directive-Based Performance Portable Programming Models
%A M. Graham Lopez
%A Wayne Joubert
%A Verónica Larrea
%A Oscar Hernandez
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K OpenACC
%K OpenMP 4
%K performance portability
%K Programming models
%X We present an extended exploration of the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architecture with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer offload features of OpenMP 4.5 are for moving codes between architectures, and we document how much tuning might be required and what lessons we can learn from these experiences. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. To better understand fundamental compute vs. bandwidth bound characteristics, we add the compute-bound Level 3 BLAS GEMM kernel to our linear algebra evaluation. We implement the kernels of interest using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both x86_64 and Power8 with attached NVIDIA GPUs, x86_64 multicores, self-hosted Intel Xeon Phi KNL, as well as an x86_64 host system with Intel Xeon Phi coprocessors. We update these evaluations with the newest version of the NVIDIA Pascal architecture (P100), Intel KNL 7230, Power8+, and the newest supporting compiler implementations. Furthermore, we present in detail what factors affected the performance portability, including how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimise and target multiple platforms.
%B International Journal of High Performance Computing and Networking
%V 14
%P 165-182
%8 2019–07
%G eng
%N 2
%R http://dx.doi.org/10.1504/IJHPCN.2017.10009064

%0 Conference Paper
%B PAW-ATM Workshop at SC19
%D 2019
%T Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization
%A Yu Pei
%A George Bosilca
%A Ichitaro Yamazaki
%A Akihiro Ida
%A Jack Dongarra
%B PAW-ATM Workshop at SC19
%I ACM
%C Denver, CO
%8 2019-11
%G eng

%0 Conference Paper
%B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2019
%T Fast Batched Matrix Multiplication for Small Sizes using Half Precision Arithmetic on GPUs
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng

%0 Generic
%D 2019
%T FFT-ECP Fast Fourier Transform
%A Stanimire Tomov
%A Azzam Haidar
%A Alan Ayala
%A Daniel Schultz
%A Jack Dongarra
%I 2019 ECP Annual Meeting (Research Poster)
%C Houston, TX
%8 2019-01
%G eng

%0 Generic
%D 2019
%T FFT-ECP Implementation Optimizations and Features Phase
%A Stanimire Tomov
%A Azzam Haidar
%A Alan Ayala
%A Hejer Shaiek
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-10
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2019
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Li Han
%A Valentin Le Fèvre
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%K checkpoint
%K fail-stop error
%K resilience
%K workflow
%B International Journal of High Performance Computing Applications
%V 33
%P 1255-1274
%8 2019-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342019866891

%0 Journal Article
%J Int. Journal of High Performance Computing Applications
%D 2019
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Han, Li
%A Le Fèvre, Valentin
%A Canon, Louis-Claude
%A Robert, Yves
%A Vivien, Frédéric
%B Int. Journal of High Performance Computing Applications
%V 33
%P 1255-1274
%G eng

%0 Conference Paper
%B ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2019
%T Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC
%A Thomas Herault
%A Yves Robert
%A George Bosilca
%A Jack Dongarra
%B ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%I IEEE
%C Denver, CO
%8 2019-11
%G eng

%0 Conference Paper
%B IEEE Cluster
%D 2019
%T Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs
%A Thananon Patinyasakdikul
%A David Eberius
%A George Bosilca
%A Nathan Hjelm
%K communication contention
%K MPI
%K thread
%X The Message Passing Interface (MPI) has been one of the most prominent programming paradigms in highperformance computing (HPC) for the past decade. Lately, with changes in modern hardware leading to a drastic increase in the number of processor cores, developers of parallel applications are moving toward more integrated parallel programming paradigms, where MPI is used along with other, possibly node-level, programming paradigms, or MPI+X. MPI+threads emerged as one of the favorite choices in HPC community, according to a survey of the HPC community. However, threading support in MPI comes with many compromises to the overall performance delivered, and, therefore, its adoption is compromised. This paper studies in depth the MPI multi-threaded implementation design in one of the leading MPI implementations, Open MPI, and expose some of the shortcomings of the current design. We propose, implement, and evaluate a new design of the internal handling of communication progress which allows for a significant boost in multi-threading performance, increasing the viability of MPI in the MPI+X programming paradigm.
%B IEEE Cluster
%I IEEE
%C Albuquerque, NM
%8 2019-09
%G eng

%0 Generic
%D 2019
%T GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems
%A Hejer Shaiek
%A Stanimire Tomov
%A Alan Ayala
%A Azzam Haidar
%A Jack Dongarra
%K CUDA-Aware MPI
%K ECP
%K FFT
%K FFT-ECP
%K gpu
%K GPUDirect
%X Fast Fourier transforms (FFTs) are used in applications ranging from molecular dynamics and spectrum estimation to machine learn- ing, fast convolution and correlation, signal modulation, wireless multimedia applications, and others. However, FFTs are memory bound, and therefore, to accelerate them, it is crucial to avoid and optimize the FFTs’ communications. To this end, we present a 3-D FFT design for distributed graphics processing unit (GPU) systems that: (1) efficiently uses GPUs’ high bandwidth, (2) reduces global communications algorithmically, when possible, and (3) employs GPUDirect technologies as well as MPI optimizations in the development of high-performance FFTs for large-scale GPU-accelerated systems. We show that these developments and optimizations lead to very good strong scalability and a performance that is close to 90% of the theoretical peak.
%B EuroMPI'19 Posters, Zurich, Switzerland
%I ICL
%8 2019-09
%G eng
%9 Extended Abstract

%0 Conference Paper
%B ISC High Performance
%D 2019
%T Hands-on Research and Training in High-Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments
%A Kwai Wong
%A Stanimire Tomov
%A Jack Dongarra
%B ISC High Performance
%I Springer International Publishing
%C Frankfurt, Germany
%8 2019-06
%G eng

%0 Conference Paper
%B Workshop on Exascale MPI (ExaMPI) at SC19
%D 2019
%T Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation
%A Alan Ayala
%A Stanimire Tomov
%A Xi Luo
%A Hejer Shaiek
%A Azzam Haidar
%A George Bosilca
%A Jack Dongarra
%K Collective MPI
%K Exascale applications
%K FFT
%K Heterogeneous systems
%K scalable
%B Workshop on Exascale MPI (ExaMPI) at SC19
%C Denver, CO
%8 2019-11
%G eng

%0 Conference Paper
%B IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist
%D 2019
%T Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Jack Dongarra
%X The emergence of deep learning as a leading computational workload for machine learning tasks on large-scale cloud infrastructure installations has led to plethora of accelerator hardware releases.  However, the reduced precision and range of the floating-point numbers on these new platforms makes it a non-trivial task to leverage these unprecedented advances in computational power for numerical linear algebra operations that come with a guarantee of robust error bounds.  In order to address these concerns, we present a number of strategies that can be used to increase the accuracy of limited-precision iterative refinement. By limited precision, we mean 16-bit floating-point formats implemented in modern hardware accelerators and are not necessarily compliant with the IEEE half-precision specification. We include the explanation of a broader context and connections to established IEEE floating-point standards and existing high-performance computing (HPC) benchmarks.  We also present a new formulation of LU factorization that we call signed square root LU which produces more numerically balanced L and U factors which directly address the problems of limited range of the low-precision storage formats. The experimental results indicate that it is possible to recover substantial amounts of the accuracy in the system solution that would otherwise be lost. Previously, this could only be achieved by using iterative refinement based on single-precision floating-point arithmetic.  The discussion will also explore the numerical stability issues that are important for robust linear solvers on these new hardware platforms.
%B IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist
%I IEEE
%C Waltham, MA
%8 2019-09
%G eng
%1 Best Paper Finalist

%0 Conference Proceedings
%B ACM International Conference on Supercomputing (ICS '19)
%D 2019
%T Least Squares Solvers for Distributed-Memory Machines with GPU Accelerators
%A Jakub Kurzak
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%Y Rudolf Eigenmann
%Y Chen Ding
%Y Sally A. McKee
%B ACM International Conference on Supercomputing (ICS '19)
%I ACM
%C Phoenix, Arizona
%P 117–126
%8 2019-06
%@ 9781450360791
%G eng
%R https://dl.acm.org/doi/abs/10.1145/3330345.3330356

%0 Conference Proceedings
%B Euro-Par 2019: Parallel Processing
%D 2019
%T Linear Systems Solvers for Distributed-Memory Machines with GPU Accelerators
%A Kurzak, Jakub
%A Mark Gates
%A Charara, Ali
%A Asim YarKhan
%A Yamazaki, Ichitaro
%A Jack Dongarra
%E Yahyapour, Ramin
%B Euro-Par 2019: Parallel Processing
%I Springer
%V 11725
%P 495–506
%8 2019-08
%@ 978-3-030-29399-4
%G eng
%U https://link.springer.com/chapter/10.1007/978-3-030-29400-7_35
%R https://doi.org/10.1007/978-3-030-29400-7_35

%0 Journal Article
%J Future Generation Computer Systems
%D 2019
%T Local Rollback for Resilient MPI Applications with Application-Level Checkpointing and Message Logging
%A Nuria Losada
%A George Bosilca
%A Aurelien Bouteiller
%A Patricia González
%A María J. Martín
%K Application-level checkpointing
%K Local rollback
%K Message logging
%K MPI
%K resilience
%X The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface – the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard – enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the ComPiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level—thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications.
%B Future Generation Computer Systems
%V 91
%P 450-464
%8 2019-02
%G eng
%R https://doi.org/10.1016/j.future.2018.09.041

%0 Generic
%D 2019
%T MagmaDNN 0.2 High-Performance Data Analytics for Manycore GPUs and CPUs
%A Lucien Ng
%A Sihan Chen
%A Alex Gessinger
%A Daniel Nichols
%A Sophia Cheng
%A Anu Meenasorna
%A Kwai Wong
%A Stanimire Tomov
%A Azzam Haidar
%A Eduardo D'Azevedo
%A Jack Dongarra
%I University of Tennessee
%8 2019-01
%G eng
%R 10.13140/RG.2.2.14906.64961

%0 Conference Paper
%B Practice and Experience in Advanced Research Computing (PEARC ’19)
%D 2019
%T MagmaDNN: Accelerated Deep Learning Using MAGMA
%A Daniel Nichols
%A Kwai Wong
%A Stanimire Tomov
%A Lucien Ng
%A Sihan Chen
%A Alex Gessinger
%B Practice and Experience in Advanced Research Computing (PEARC ’19)
%I ACM
%C Chicago, IL
%8 2019-07
%G eng

%0 Conference Paper
%B ISC High Performance
%D 2019
%T MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing
%A Daniel Nichols
%A Natalie-Sofia Tomov
%A Frank Betancourt
%A Stanimire Tomov
%A Kwai Wong
%A Jack Dongarra
%X In this paper, we present work towards the development of a new data analytics and machine learning (ML) framework, called MagmaDNN. Our main goal is to provide scalable, high-performance data analytics and ML solutions for scientific applications running on current and upcoming heterogeneous many-core GPU-accelerated architectures. To this end, since many of the functionalities needed are based on standard linear algebra (LA) routines, we designed MagmaDNN to derive its performance power from the MAGMA library. The close integration provides the fundamental (scalable high-performance) LA routines available in MAGMA as a backend to MagmaDNN. We present some design issues for performance and scalability that are specific to ML using Deep Neural Networks (DNN), as well as the MagmaDNN designs towards overcoming them. In particular, MagmaDNN uses well established HPC techniques from the area of dense LA, including task-based parallelization, DAG representations, scheduling, mixed-precision algorithms, asynchronous solvers, and autotuned hyperparameter optimization. We illustrate these techniques and their incorporation and use to outperform other frameworks, currently available.
%B ISC High Performance
%I Springer International Publishing
%C Frankfurt, Germany
%8 2019-06
%G eng
%R https://doi.org/10.1007/978-3-030-34356-9_37

%0 Conference Paper
%B 48th International Conference on Parallel Processing (ICPP 2019)
%D 2019
%T Massively Parallel Automated Software Tuning
%A Jakub Kurzak
%A Yaohung Tsai
%A Mark Gates
%A Ahmad Abdelfattah
%A Jack Dongarra
%X This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined.
%B 48th International Conference on Parallel Processing (ICPP 2019)
%I ACM Press
%C Kyoto, Japan
%8 2019-08
%G eng
%R https://doi.org/10.1145/3337821.3337908

%0 Conference Paper
%B International Parallel and Distributed Processing Symposium (IPDPS)
%D 2019
%T Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation
%A Zhaojun Bai
%A Jack Dongarra
%A Ding Lu
%A Ichitaro Yamazaki
%X Some scientific and engineering applications need to compute a large number of eigenpairs of a large Hermitian matrix. Though the Lanczos method is effective for computing a few eigenvalues, it can be expensive for computing a large number of eigenpairs (e.g., in terms of computation and communication). To improve the performance of the method, in this paper, we study an s-step variant of thick-restart Lanczos (TRLan) combined with an explicit external deflation (EED). The s-step method generates a set of s basis vectors at a time and reduces the communication costs of generating the basis vectors. We then design a specialized matrix powers kernel (MPK) that reduces both the communication and computational costs by taking advantage of the special properties of the deflation matrix. We conducted numerical experiments of the new TRLan eigensolver using synthetic matrices and matrices from electronic structure calculations. The performance results on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) demonstrate the potential of the specialized MPK to significantly reduce the execution time of the TRLan eigensolver. The speedups of up to 3.1× and 5.3× were obtained in our sequential and parallel runs, respectively.
%B International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng

%0 Generic
%D 2019
%T New Robust ScaLAPACK Routine for Computing the QR Factorization with Column Pivoting
%A Zvonimir Bujanovic
%A Zlatko Drmac
%X In this note we describe two modifications of the ScaLAPACK subroutines PxGEQPF for computing the QR factorization with the Businger-Golub column pivoting. First, we resolve a subtle numerical instability in the same way as we have done it for the LAPACK subroutines xGEQPF, xGEQP3 in 2006. [LAPACK Working Note 176 (2006); ACM Trans. Math. Softw. 2008]. The problem originates in the first release of LINPACK in the 1970's : due to severe cancellations in the down-dating of partial column norms, the pivoting procedure may be in the dark completely about the true norms of the pivot column candidates. This may cause miss-pivoting, and as a result loss of the important rank revealing structure of the computed triangular factor, with severe consequences on other solvers that rely on the rank revealing pivoting. The instability is so subtle that e.g. inserting a WRITE statement or changing the process topology can drastically change the result. Secondly, we also correct a programming error in the complex subroutines PCGEQPF, PZGEQPF, which also causes wrong pivoting because of erroneous use of PSCNRM2, PDZNRM2 for the explicit norm computation.
%B LAPACK Working Note
%I University of Tennessee
%8 2019-10
%G eng

%0 Conference Paper
%B Practice and Experience in Advanced Research Computing (PEARC ’19)
%D 2019
%T OpenDIEL: A Parallel Workflow Engine and DataAnalytics Framework
%A Frank Betancourt
%A Kwai Wong
%A Efosa Asemota
%A Quindell Marshall
%A Daniel Nichols
%A Stanimire Tomov
%B Practice and Experience in Advanced Research Computing (PEARC ’19)
%I ACM
%C Chicago, IL
%8 2019-07
%G eng

%0 Generic
%D 2019
%T Optimizing Batch HGEMM on Small Sizes Using Tensor Cores
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%I GPU Technology Conference (GTC)
%C San Jose, CA
%8 2019-03
%G eng

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T PAPI Software-Defined Events for in-Depth Performance Analysis
%A Heike Jagode
%A Anthony Danalis
%A Hartwig Anzt
%A Jack Dongarra
%X The methodology and standardization layer provided by the Performance Application Programming Interface (PAPI) has played a vital role in application profiling for almost two decades. It has enabled sophisticated performance analysis tool designers and performance-conscious scientists to gain insights into their applications by simply instrumenting their code using a handful of PAPI functions that “just work” across different hardware components. In the past, PAPI development had focused primarily on hardware-specific performance metrics. However, the rapidly increasing complexity of software infrastructure poses new measurement and analysis challenges for the developers of large-scale applications. In particular, acquiring information regarding the behavior of libraries and runtimes—used by scientific applications—requires low-level binary instrumentation, or APIs specific to each library and runtime. No uniform API for monitoring events that originate from inside the software stack has emerged. In this article, we present our efforts to extend PAPI’s role so that it becomes the de facto standard for exposing performance-critical events, which we refer to as software-defined events (SDEs), from different software layers. Upgrading PAPI with SDEs enables monitoring of both types of performance events—hardware- and software-related events—in a uniform way, through the same consistent PAPI. The goal of this article is threefold. First, we motivate the need for SDEs and describe our design decisions regarding the functionality we offer through PAPI’s new SDE interface. Second, we illustrate how SDEs can be utilized by different software packages, specifically, by showcasing their use in the numerical linear algebra library MAGMA-Sparse, the tensor algebra library TAMM that is part of the NWChem suite, and the compiler-based performance analysis tool Byfl. Third, we provide a performance analysis of the overhead that results from monitoring SDEs and discuss the trade-offs between overhead and functionality.
%B The International Journal of High Performance Computing Applications
%V 33
%P 1113-1127
%8 2019-11
%G eng
%U https://doi.org/10.1177/1094342019846287
%N 6

%0 Generic
%D 2019
%T PAPI's new Software-Defined Events for in-depth Performance Analysis
%A Anthony Danalis
%A Heike Jagode
%A Jack Dongarra
%X One of the most recent developments of the Performance API (PAPI) is the addition of Software-Defined Events (SDE). PAPI has successfully served the role of the abstraction and unification layer for hardware performance counters for the past two decades. This talk presents our effort to extend this role to encompass performance critical information that does not originate in hardware, but rather in critical software layers, such as libraries and runtime systems. Our overall objective is to enable monitoring of both types of performance events, hardware- and software-related events, in a uniform way, through one consistent PAPI interface. Performance analysts will be able to form a complete picture of the entire application performance without learning new instrumentation primitives. In this talk, we outline PAPI's new SDE API and showcase the usefulness of SDE through its employment in software layers as diverse as the math library MAGMA, the dataflow runtime PaRSEC, and the state-of-the-art chemistry application NWChem. We outline the process of instrumenting these software packages and highlight the performance information that can be acquired with SDEs.
%I 13th Parallel Tools Workshop
%C Dresden, Germany
%8 2019-09
%G eng

%0 Journal Article
%J Parallel Computing
%D 2019
%T Parallel Selection on GPUs
%A Tobias Ribizel
%A Hartwig Anzt
%K approximate selection
%K gpu
%K kth order statistics
%K multiselection
%K parallel selection algorithm
%X We present a novel parallel selection algorithm for GPUs capable of handling single rank selection (single selection) and multiple rank selection (multiselection). The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always leveraging the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for – and exploiting the characteristics of – “pleasant” data distributions. At the same time, as the proposed SampleSelect algorithm does not work on the actual element values but on the element ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. We also address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy.
%B Parallel Computing
%V 91
%8 2020-03
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167819119301796
%! Parallel Computing
%R https://doi.org/10.1016/j.parco.2019.102588

%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2019
%T ParILUT – A Parallel Threshold ILU for GPUs
%A Hartwig Anzt
%A Tobias Ribizel
%A Goran Flegar
%A Edmond Chow
%A Jack Dongarra
%X In this paper, we present the first algorithm for computing threshold ILU factorizations on GPU architectures. The proposed ParILUT-GPU algorithm is based on interleaving parallel fixed-point iterations that approximate the incomplete factors for an existing nonzero pattern with a strategy that dynamically adapts the nonzero pattern to the problem characteristics. This requires the efficient selection of thresholds that separate the values to be dropped from the incomplete factors, and we design a novel selection algorithm tailored towards GPUs. All components of the ParILUT-GPU algorithm make heavy use of the features available in the latest NVIDIA GPU generations, and outperform existing multithreaded CPU implementations.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R https://doi.org/10.1109/IPDPS.2019.00033

%0 Conference Paper
%B Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19
%D 2019
%T Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools
%A Qinglei Cao
%A Yu Pei
%A Thomas Herault
%A Kadir Akbudak
%A Aleksandr Mikhalev
%A George Bosilca
%A Hatem Ltaief
%A David Keyes
%A Jack Dongarra
%B Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19
%I ACM
%C Denver, CO
%8 2019-11
%G eng

%0 Journal Article
%J Parallel Computing
%D 2019
%T Performance of Asynchronous Optimized Schwarz with One-sided Communication
%A Ichitaro Yamazaki
%A Edmond Chow
%A Aurelien Bouteiller
%A Jack Dongarra
%X In asynchronous iterative methods on distributed-memory computers, processes update their local solutions using data from other processes without an implicit or explicit global synchronization that corresponds to advancing the global iteration counter. In this work, we test the asynchronous optimized Schwarz domain-decomposition iterative method using various one-sided (remote direct memory access) communication schemes with passive target completion. The results show that when one-sided communication is well-supported, the asynchronous version of optimized Schwarz can outperform the synchronous version even for perfectly balanced partitionings of the problem on a supercomputer with uniform nodes.
%B Parallel Computing
%V 86
%P 66-81
%8 2019-08
%G eng
%U http://www.sciencedirect.com/science/article/pii/S0167819118301261
%R https://doi.org/10.1016/j.parco.2019.05.004

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2019
%T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Sven Hammarling
%A Jakub Sistek
%B ACM Transactions on Mathematical Software
%V 45
%8 2019-06
%G eng
%N 2
%R https://doi.org/10.1145/3264491

%0 Conference Paper
%B IEEE High Performance Extreme Computing Conference (HPEC’19)
%D 2019
%T Progressive Optimization of Batched LU Factorization on GPUs
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%B IEEE High Performance Extreme Computing Conference (HPEC’19)
%I IEEE
%C Waltham, MA
%8 2019-09
%G eng

%0 Journal Article
%J Computing in Science and Engineering
%D 2019
%T Race to Exascale
%A Jack Dongarra
%A Steven Gottlieb
%A William T. Kramer
%X Whether called leadership computing, flagship computing, or just plain exascale, over the next few years, governments around the world are planning to spend over 10 billion dollars on a handful of new computer systems that will strive to reach an exascale level of performance. These systems and projects reflect the widespread and expanding recognition that almost all science and engineering endeavors now are intrinsically reliant on computing power not just for modeling and simulation but for data analysis, big data, and machine learning. Scientists and engineers consider computers as “universal instruments” of insight.
%B Computing in Science and Engineering
%V 21
%P 4-5
%8 2019-03
%G eng
%N 1
%R https://doi.org/10.1109/MCSE.2018.2882574

%0 Conference Paper
%B The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19)
%D 2019
%T Replication is More Efficient Than You Think
%A Anne Benoit
%A Thomas Herault
%A Valentin Le Fèvre
%A Yves Robert
%B The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19)
%I ACM Press
%C Denver, CO
%8 2019-11
%G eng

%0 Conference Paper
%B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019)
%D 2019
%T Reservation Strategies for Stochastic Jobs
%A Guillaume Aupy
%A Ana Gainaru
%A Valentin Honoré
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019)
%I IEEE Computer Society Press
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng

%0 Conference Paper
%B European MPI Users' Group Meeting (EuroMPI '19)
%D 2019
%T Runtime Level Failure Detection and Propagation in HPC Systems
%A Dong Zhong
%A Aurelien Bouteiller
%A Xi Luo
%A George Bosilca
%X As the scale of high-performance computing (HPC) systems continues to grow, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. In order to efficiently run long computing jobs on these systems, handling system failures becomes a prime challenge. We present here the design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Multiple overlapping topologies are used to optimize the detection and propagation, minimizing the incurred overheads and guaranteeing the scalability of the entire framework. The resulting framework has been implemented in the context of a system-level runtime for parallel environments, PMIx Reference RunTime Environment (PRRTE), providing efficient and scalable capabilities of fault management to a large range of programming and execution paradigms. The experimental evaluation of the resulting software stack on different machines demonstrate that the solution is at the same time generic and efficient.
%B European MPI Users' Group Meeting (EuroMPI '19)
%I ACM
%C Zürich, Switzerland
%8 2019-09
%@ 978-1-4503-7175-9
%G eng
%R https://doi.org/10.1145/3343211.3343225

%0 Conference Paper
%B IEEE Cluster 2019
%D 2019
%T Scheduling Independent Stochastic Tasks on Heterogeneous Cloud Platforms
%A Yiqin Gao
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%B IEEE Cluster 2019
%I IEEE Computer Society Press
%C Albuquerque, New Mexico
%8 2019-09
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2019
%T Scheduling Independent Stochastic Tasks under Deadline and Budget Constraints
%A Louis-Claude Canon
%A Aurélie Kong Win Chang
%A Yves Robert
%A Frederic Vivien
%X This article discusses scheduling strategies for the problem of maximizing the expected number of tasks that can be executed on a cloud platform within a given budget and under a deadline constraint. The execution times of tasks follow independent and identically distributed probability laws. The main questions are how many processors to enroll and whether and when to interrupt tasks that have been executing for some time. We provide complexity results and an asymptotically optimal strategy for the problem instance with discrete probability distributions and without deadline. We extend the latter strategy for the general case with continuous distributions and a deadline and we design an efficient heuristic which is shown to outperform standard approaches when running simulations for a variety of useful distribution laws.
%B International Journal of High Performance Computing Applications
%V 34
%P 246-264
%8 2019-06
%G eng
%N 2
%R https://doi.org/10.1177/1094342019852135

%0 Generic
%D 2019
%T SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
%A Mark Gates
%A Jakub Kurzak
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%I International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%C Denver, CO
%8 2019-11
%G eng

%0 Conference Paper
%B International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%D 2019
%T SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
%A Mark Gates
%A Jakub Kurzak
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%X The SLATE (Software for Linear Algebra Targeting Exascale) library is being developed to provide fundamental dense linear algebra capabilities for current and upcoming distributed high-performance systems, both accelerated CPU-GPU based and CPU based. SLATE will provide coverage of existing ScaLAPACK functionality, including the parallel BLAS; linear systems using LU and Cholesky; least squares problems using QR; and eigenvalue and singular value problems. In this respect, it will serve as a replacement for ScaLAPACK, which after two decades of operation, cannot adequately be retrofitted for modern accelerated architectures. SLATE uses modern techniques such as communication-avoiding algorithms, lookahead panels to overlap communication and computation, and task-based scheduling, along with a modern C++ framework. Here we present the design of SLATE and initial reports of several of its components.
%B International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%I ACM
%C Denver, CO
%8 2019-11
%G eng
%R https://doi.org/10.1145/3295500.3356223

%0 Generic
%D 2019
%T SLATE Developers' Guide
%A Ali Charara
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-12
%G eng
%9 SLATE Working Notes

%0 Generic
%D 2019
%T SLATE Mixed Precision Performance Report
%A Ali Charara
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-04
%G eng

%0 Generic
%D 2019
%T SLATE Working Note 12: Implementing Matrix Inversions
%A Jakub Kurzak
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-06
%G eng

%0 Generic
%D 2019
%T SLATE Working Note 13: Implementing Singular Value and Symmetric/Hermitian Eigenvalue Solvers
%A Mark Gates
%A Mohammed Al Farhan
%A Ali Charara
%A Jakub Kurzak
%A Dalal Sukkari
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-09
%G eng
%9 SLATE Working Notes

%0 Conference Paper
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2019
%T Software-Defined Events through PAPI
%A Anthony Danalis
%A Heike Jagode
%A Thomas Herault
%A Piotr Luszczek
%A Jack Dongarra
%X PAPI has been used for almost two decades as an abstraction and standardization layer for profiling hardware-specific performance metrics. However, application developers-and profiling software packages-are quite often interested in information beyond hardware counters, such as the behavior of libraries used by the software that is being profiled. So far, accessing this information has required interfacing directly with the libraries on a case-by-case basis, or low-level binary instrumentation. In this paper, we introduce the new Software-Defined Event (SDE) component of PAPI which aims to enable PAPI to serve as an abstraction and standardization layer for events that originate in software layers as well. Extending PAPI to include SDEs enables monitoring of both types of performance events-hardware-and software-related events-in a uniform way, through the same consistent PAPI interface. Furthermore, implementing SDE as a PAPI component means that the new API is aimed only at the library developers who wish to export events from within their libraries. The API for reading PAPI events-both hardware and software-remains the same, so all legacy codes and tools that use PAPI will not only continue to work, but they will automatically be able to read SDEs wherever those are available. The goal of this paper is threefold. First, we outline our design decisions regarding the functionality we offer through the new SDE interface, and offer simple examples of usage. Second, we illustrate how those events can be utilized by different software packages, specifically, by showcasing their use in the task-based runtime PaRSEC, and the HPCG supercomputing benchmark. Third, we provide a thorough performance analysis of the overhead that results from monitoring different types of SDEs, and showcase the negligible overhead of using PAPI SDE even in cases of extremely heavy use.
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R https://doi.org/10.1109/IPDPSW.2019.00069

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2019
%T Solving Linear Diophantine Systems on Parallel Architectures
%A Dmitry Zaitsev
%A Stanimire Tomov
%A Jack Dongarra
%K Mathematical model
%K Matrix decomposition
%K Parallel architectures
%K Petri nets
%K Software algorithms
%K Sparse matrices
%K Task analysis
%X Solving linear Diophantine systems of equations is applied in discrete-event systems, model checking, formal languages and automata, logic programming, cryptography, networking, signal processing, and chemistry. For modeling discrete systems with Petri nets, a solution in non-negative integer numbers is required, which represents an intractable problem. For this reason, solving such kinds of tasks with significant speedup is highly appreciated. In this paper we design a new solver of linear Diophantine systems based on the parallel-sequential composition of the system clans. The solver is studied and implemented to run on parallel architectures using a two-level parallelization concept based on MPI and OpenMP. A decomposable system is usually represented by a sparse matrix; a minimal clan size of the decomposition restricts the granulation of the technique. MPI is applied for solving systems for clans using a parallel-sequential composition on distributed-memory computing nodes, while OpenMP is applied in solving a single indecomposable system on a single node using multiple cores. A dynamic task-dispatching subsystem is developed for distributing systems on nodes in the process of compositional solution. Computational speedups are obtained on a series of test examples, e.g., illustrating that the best value constitutes up to 45 times speedup obtained on 5 nodes with 20 cores each.
%B IEEE Transactions on Parallel and Distributed Systems
%V 30
%P 1158-1169
%8 2019-05
%G eng
%U https://ieeexplore.ieee.org/document/8482295
%N 5
%R http://dx.doi.org/10.1109/TPDS.2018.2873354

%0 Book Section
%B Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project
%D 2019
%T System Software for Many-Core and Multi-Core Architectures
%A Atsushi Hori
%A Tsujita, Yuichi
%A Shimada, Akio
%A Yoshinaga, Kazumi
%A Mitaro, Namiki
%A Fukazawa, Go
%A Sato, Mikiko
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%E Sato, Mitsuhisa
%X In this project, the software technologies for the post-peta scale computing were explored. More specifically, OS technologies for heterogeneous architectures, lightweight thread, scalable I/O, and fault mitigation were investigated. As for the OS technologies, a new parallel execution model, Partitioned Virtual Address Space (PVAS), for the many-core CPU was proposed. For the heterogeneous architectures, where multi-core CPU and many-core CPU are connected with an I/O bus, an extension of PVAS, Multiple-PVAS, to have a unified virtual address space of multi-core and many-core CPUs was proposed. The proposed PVAS was also enhanced to have multiple processes where process context switch can take place at the user level (named User-Level Process: ULP). As for the scalable I/O, EARTH, optimization techniques for MPI collective I/O, was proposed. Lastly, for the fault mitigation, User Level Fault Mitigation, ULFM was improved to have faster agreement process, and sliding methods to substitute failed nodes with spare nodes was proposed. The funding of this project was ended in 2016; however, many proposed technologies are still being propelled.
%B Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project
%I Springer Singapore
%C Singapore
%P 59–75
%@ 978-981-13-1924-2
%G eng
%U https://doi.org/10.1007/978-981-13-1924-2_4
%R 10.1007/978-981-13-1924-2_4

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T Toward a Modular Precision Ecosystem for High-Performance Computing
%A Hartwig Anzt
%A Goran Flegar
%A Thomas Gruetzmacher
%A Enrique S. Quintana-Orti
%K conjugate gradient
%K GPUs
%K Jacobi method
%K Modular precision
%K multicore processors
%K PageRank
%K parallel numerical linear algebra
%X With the memory bandwidth of current computer architectures being significantly slower than the (floating point) arithmetic performance, many scientific computations only leverage a fraction of the computational power in today’s high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between floating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientific applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a “modular precision ecosystem” that allows for more flexibility in terms of customized data access. For memory-bounded scientific applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm—two applications that are popular in the communities and at the same characteristic representatives for the field of numerical linear algebra and data analytics, respectively.
%B The International Journal of High Performance Computing Applications
%V 33
%P 1069-1078
%8 2019-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342019846547

%0 Journal Article
%J Proceedings in Applied Mathematics and Mechanics
%D 2019
%T Towards a New Peer Review Concept for Scientific Computing ensuring Technical Quality, Software Sustainability, and Result Reproducibility
%A Hartwig Anzt
%A Terry Cojean
%A Eileen Kuhn
%X In this position paper we argue for implementing an alternative peer review process for scientific computing contributions that promotes high quality scientific software developments as fully‐recognized conference submission. The idea is based on leveraging the code reviewers' feedback on scientific software contributions to community software developments as a third‐party review involvement. Providing open access to this technical review would complement the scientific review of the contribution, efficiently reduce the workload of the undisclosed reviewers, improve the algorithm implementation quality and software sustainability, and ensure full reproducibility of the reported results. Using this process creates incentives to publish scientific algorithms in open source software – instead of designing prototype algorithms with the unique purpose of publishing a paper. In addition, the comments and suggestions of the community being archived in the versioning control systems ensure that also community reviewers are receiving credit for the review contributions – unlike reviewers in the traditional peer review process. Finally, it reflects the particularity of the scientific computing community using conferences rather than journals as the main publication venue.
%B Proceedings in Applied Mathematics and Mechanics
%V 19
%8 2019-11
%G eng
%N 1
%R https://doi.org/10.1002/pamm.201900490

%0 Conference Paper
%B Platform for Advanced Scientific Computing Conference (PASC 2019)
%D 2019
%T Towards Continuous Benchmarking
%A Hartwig Anzt
%A Yen Chen Chen
%A Terry Cojean
%A Jack Dongarra
%A Goran Flegar
%A Pratik Nayak
%A Enrique S. Quintana-Orti
%A Yuhsiang M. Tsai
%A Weichung Wang
%X We present an automated performance evaluation framework that enables an automated workflow for testing and performance evaluation of software libraries. Integrating this component into an ecosystem enables sustainable software development, as a community effort, via a web application for interactively evaluating the performance of individual software components. The performance evaluation tool is based exclusively on web technologies, which removes the burden of downloading performance data or installing additional software. We employ this framework for the Ginkgo software ecosystem, but the framework can be used with essentially any software project, including the comparison between different software libraries. The Continuous Integration (CI) framework of Ginkgo is also extended to automatically run a benchmark suite on predetermined HPC systems, store the state of the machine and the environment along with the compiled binaries, and collect results in a publicly accessible performance data repository based on Git. The Ginkgo performance explorer (GPE) can be used to retrieve the performance data from the repository, and visualizes it in a web browser. GPE also implements an interface that allows users to write scripts, archived in a Git repository, to extract particular data, compute particular metrics, and visualize them in many different formats (as specified by the script). The combination of these approaches creates a workflow which enables performance reproducibility and software sustainability of scientific software. In this paper, we present example scripts that extract and visualize performance data for Ginkgo’s SpMV kernels that allow users to identify the optimal kernel for specific problem characteristics.
%B Platform for Advanced Scientific Computing Conference (PASC 2019)
%I ACM Press
%C Zurich, Switzerland
%8 2019-06
%@ 9781450367707
%G eng
%R https://doi.org/10.1145/3324989.3325719

%0 Conference Paper
%B ScalA19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2019
%T Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%K Half precision
%K mixed-precision solvers
%K Tensor cores FP16 arithmetic
%B ScalA19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%I IEEE
%C Denver, CO
%8 2019-11
%G eng

%0 Conference Paper
%B 2019 European Conference on Parallel Processing (Euro-Par 2019)
%D 2019
%T Towards Portable Online Prediction of Network Utilization Using MPI-Level Monitoring
%A Shu-Mei Tseng
%A Bogdan Nicolae
%A George Bosilca
%A Emmanuel Jeannot
%A Aparna Chandramowlishwaran
%A Franck Cappello
%X Stealing network bandwidth helps a variety of HPC runtimes and services to run additional operations in the background without negatively affecting the applications. A key ingredient to make this possible is an accurate prediction of the future network utilization, enabling the runtime to plan the background operations in advance, such as to avoid competing with the application for network bandwidth. In this paper, we propose a portable deep learning predictor that only uses the information available through MPI introspection to construct a recurrent sequence-to-sequence neural network capable of forecasting network utilization. We leverage the fact that most HPC applications exhibit periodic behaviors to enable predictions far into the future (at least the length of a period). Our online approach does not have an initial training phase, it continuously improves itself during application execution without incurring significant computational overhead. Experimental results show better accuracy and lower computational overhead compared with the state-of-the-art on two representative applications.
%B 2019 European Conference on Parallel Processing (Euro-Par 2019)
%I Springer
%C Göttingen, Germany
%8 2019-08
%G eng
%R https://doi.org/10.1007/978-3-030-29400-7_4

%0 Generic
%D 2019
%T Understanding Native Event Semantics
%A Anthony Danalis
%A Heike Jagode
%A Daniel Barry
%A Jack Dongarra
%I 9th JLESC Workshop
%C Knoxville, TN
%8 2019-04
%G eng

%0 Conference Paper
%B 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)
%D 2019
%T Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training
%A Jiali Li
%A Bogdan Nicolae
%A Justin M. Wozniak
%A George Bosilca
%X In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular. Despite the fact that high-performance computing (HPC) technologies have been designed to address such patterns efficiently, the behavior of data-parallel approaches on HPC platforms is not well understood. To address this issue, in this paper we study the behavior of Horovod, a popular data-parallel approach that relies on MPI, on Theta, a pre-Exascale machine at Argonne National Laboratory. Using two representative applications, we explore two aspects: (1) how performance and scalability is affected by important parameters such as number of nodes, number of workers, threads per node, batch size; (2) how computational phases are interleaved withall-reduce communication phases at fine granularity and what consequences this interleaving has in terms of potential bottlenecks. Our findings show that pipelining of back-propagation, gradient reduction and weight updates mitigate the effects of stragglers during all-reduce only partially. Furthermore, there can be significant delays between weights update, which can be leveraged to mask the overhead of additional background operations that are coupled with the training.
%B 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)
%I IEEE
%C Denver, CO
%8 2019-11
%G eng
%R https://doi.org/10.1109/MLHPC49564.2019.00006

%0 Journal Article
%J Parallel Computing
%D 2019
%T Variable-Size Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K Batched algorithms
%K Block-Jacobi
%K Gauss–Jordan elimination
%K Graphics processor
%K matrix inversion
%K sparse linear systems
%X In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variable-size batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix–vector multiplication kernel that transforms the linear systems’ right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVIDIA’s K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver.
%B Parallel Computing
%V 81
%P 131-146
%8 2019-01
%G eng
%R https://doi.org/10.1016/j.parco.2017.12.006

%0 Conference Paper
%B 1st Workshop on Sustainable Scientific Software (CW3S19)
%D 2019
%T What it Takes to keep PAPI Instrumental for the HPC Community
%A Heike Jagode
%A Anthony Danalis
%A Jack Dongarra
%B 1st Workshop on Sustainable Scientific Software (CW3S19)
%C Collegeville, Minnesota
%8 2019-07
%G eng
%U https://collegeville.github.io/CW3S19/WorkshopResources/WhitePapers/JagodeHeike_CW3S19_papi.pdf

%0 Generic
%D 2019
%T What it Takes to keep PAPI Instrumental for the HPC Community
%A Heike Jagode
%A Anthony Danalis
%A Jack Dongarra
%I The 2019 Collegeville Workshop on Sustainable Scientific Software (CW3S19)
%C Collegeville, MN
%8 2019-07
%G eng

%0 Generic
%D 2019
%T Is your scheduling good? How would you know?
%A Anthony Danalis
%A Heike Jagode
%A Jack Dongarra
%X Optimal scheduling is a goal that can rarely be achieved, even in purely theoretical contexts where the nuanced behavior of complex hardware and software systems can be abstracted away, and simplified assumptions can be made. In real runtime systems, task schedulers are usually designed based on intuitions about optimal design and heuristics such as minimizing idle time and load imbalance, as well as maximizing data locality and reuse. This harsh reality is due in part to the very crude tools designers of task scheduling systems have at their disposal for assessing the quality of their assumptions. Examining hardware behavior—such as cache reuse—through counters rarely leads to improvement in scheduler design, and quite often the runtime designers are left with total execution time as their only guiding mechanism.    In this talk we will discuss new methods for illuminating the dark corners of task scheduling on real hardware. We will present our work on extending PAPI—which has long been the de facto standard for accessing hardware events—so that it can be used to access software events. We will focus specifically on the impact this work can have on runtime systems with dynamic schedulers, and discuss illustrative examples.
%I 14th Scheduling for Large Scale Systems Workshop
%C Bordeaux, France
%8 2019-06
%G eng

%0 Journal Article
%J Computer
%D 2018
%T The 30th Anniversary of the Supercomputing Conference: Bringing the Future Closer—Supercomputing History and the Immortality of Now
%A Jack Dongarra
%A Vladimir Getov
%A Kevin Walsh
%K High-performance computing
%K history of computing
%K SC
%K Scientific computing
%K supercomputing
%K Virtual Roundtable
%X A panel of experts—including Gordon Bell, Jack Dongarra, William E. (Bill) Johnston, Horst Simon, Erich Strohmaier, and Mateo Valero—discuss historical reflections on the past 30 years of the Supercomputing (SC) conference, its leading role for the professional community and some exciting future challenges.
%B Computer
%V 51
%P 74–85
%8 2018-11
%G eng
%N 10
%R 10.1109/MC.2018.3971352

%0 Generic
%D 2018
%T Accelerating 2D FFT: Exploit GPU Tensor Cores through Mixed-Precision
%A Xaiohe Cheng
%A Anumeena Soma
%A Eduardo D'Azevedo
%A Kwai Wong
%A Stanimire Tomov
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), ACM Student Research Poster
%C Dallas, TX
%8 2018-11
%G eng

%0 Generic
%D 2018
%T Accelerating Linear Algebra with MAGMA
%A Stanimire Tomov
%A Mark Gates
%A Azzam Haidar
%I ECP Annual Meeting 2018, Tutorial
%C Knoxville, TN
%8 2018-02
%G eng

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T Accelerating NWChem Coupled Cluster through dataflow-based Execution
%A Heike Jagode
%A Anthony Danalis
%A Jack Dongarra
%K CCSD
%K dag
%K dataflow
%K NWChem
%K parsec
%K ptg
%K tasks
%X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWCHEM, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelized in coarse chunks. In this paper, we present our effort of converting the NWCHEM’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PARSEC (Parallel Runtime Scheduling and Execution Controller): a software package designed to enable high-performance computing at scale. We discuss the modularity of our approach and explain how the PARSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWCHEM codebase.  Furthermore, we argue how the CC algorithms can be easily decomposed into finer-grained tasks (compared with the original version of NWCHEM); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWCHEM, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation.
%B The International Journal of High Performance Computing Applications
%V 32
%P 540--551
%8 2018-07
%G eng
%U http://journals.sagepub.com/doi/10.1177/1094342016672543
%N 4
%9 Journal Article
%& 540
%R 10.1177/1094342016672543

%0 Journal Article
%J Journal of Computational Science
%D 2018
%T Accelerating the SVD Bi-Diagonalization of a Batch of Small Matrices using GPUs
%A Tingxing Dong
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K Batched
%K Eigenvalue and singular value problems
%K hardware accelerators
%K numerical linear algebra
%K Two-sided factorization algorithms
%X The acceleration of many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. Standard interfaces have been proposed for some of these problems, called batched problems, so that they get targeted for optimization and used in a standard way in applications, calling them directly from highly optimized, standard numerical libraries, like (batched) BLAS and LAPACK. While most of the developments have been for one-sided factorizations and solvers, many important applications – from big data analytics to information retrieval, low-rank approximations for solvers and preconditioners – require two-sided factorizations, and most notably the SVD factorization. To address these needs and the parallelization challenges related to them, we developed a number of new batched computing techniques and designed batched Basic Linear Algebra Subroutines (BLAS) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. We propose a device functions-based methodology and big-tile setting techniques in our batched BLAS design. The different optimization techniques result in many software versions that must be tuned, for which we adopt an auto-tuning strategy to automatically derive the optimized instances of the routines. We illustrate our batched BLAS approach to optimize batched SVD bi-diagonalization progressively on GPUs. The progression is illustrated on an NVIDIA K40c GPU, and also, ported and presented on AMD Fiji Nano GPU, using AMD's Heterogeneous–Compute Interface for Portability (HIP) C++ runtime API. We demonstrate achieving 80% of the theoretically achievable peak performance for the overall algorithm, and significant acceleration of the Level-2 BLAS GEMV and Level-3 BLAS GEMM needed compared to vendor-optimized libraries on GPUs and multicore CPUs. The optimization techniques in this paper are applicable to the other two-sided factorizations as well.
%B Journal of Computational Science
%V 26
%P 237–245
%8 2018-05
%G eng
%R https://doi.org/10.1016/j.jocs.2018.01.007

%0 Journal Article
%J Parallel Computing
%D 2018
%T Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%K 2-stage
%K accelerator
%K Divide and conquer
%K gpu
%K Singular value decomposition
%K SVD
%X The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today’s high performance computers. For dense matrices, the classic algorithm for the singular value decomposition (SVD) uses a one stage reduction to bidiagonal form, which is limited in performance by the memory bandwidth. To overcome this limitation, a two stage reduction to bidiagonal has been gaining popularity. It first reduces the matrix to band form using high performance Level 3 BLAS, then reduces the band matrix to bidiagonal form. As accelerators such as GPUs and co-processors are becoming increasingly widespread in high-performance computing, a question of great interest to many SVD users is how much the employment of a two stage reduction, as well as other current best practices in GPU computing, can accelerate this important routine. To fulfill this interest, we have developed an accelerated SVD employing a two stage reduction to bidiagonal and a number of other algorithms that are highly optimized for GPUs. Notably, we also parallelize and accelerate the divide and conquer algorithm used to solve the subsequent bidiagonal SVD. By accelerating all phases of the SVD algorithm, we provide a significant speedup compared to existing multi-core and GPU-based SVD implementations. In particular, using a P100 GPU, we illustrate a performance of up to 804 Gflop/s in double precision arithmetic to compute the full SVD of a 20k × 20k matrix in 90 seconds, which is 8.9 ×  faster than MKL on two 10 core Intel Haswell E5-2650 v3 CPUs, 3.7 ×  over the multi-core PLASMA two stage version, and 2.6 ×  over the previously accelerated one stage MAGMA version.
%B Parallel Computing
%V 74
%P 3–18
%8 2018-05
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167819117301758
%! Parallel Computing
%R 10.1016/j.parco.2017.10.004

%0 Conference Paper
%B The 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18)
%D 2018
%T ADAPT: An Event-Based Adaptive Collective Communication Framework
%A Xi Luo
%A Wei Wu
%A George Bosilca
%A Thananon Patinyasakdikul
%A Linnan Wang
%A Jack Dongarra
%X The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while mamtaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3X and 1.5X speedup for CPU data and 2X and 10X speedup for GPU data using ADAPT event-based broadcast and reduce operations.
%B The 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18)
%I ACM Press
%C Tempe, Arizona
%8 2018-06
%@ 9781450357852
%G eng
%R 10.1145/3208040.3208054

%0 Generic
%D 2018
%T Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices
%A Ian Masliah
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Marc Baboulin
%A Joël Falcou
%A Jack Dongarra
%X Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1; 600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.
%B Innovative Computing Laboratory Technical Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-09
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K Dense linear solvers
%K energy efficiency
%K GPU computing
%X Graphics Processing Units (GPUs) are widely used in accelerating dense linear solvers. The matrix factorizations, which dominate the runtime for these solvers, are often designed using a hybrid scheme, where GPUs perform trailing matrix updates, while the CPUs perform the panel factorizations. Consequently, hybrid solutions require high-end CPUs and optimized CPU software in order to deliver high performance. Furthermore, they lack the energy efficiency inherent for GPUs due to the use of less energy-efficient CPUs, as well as CPU-GPU communications. This paper presents analysis and design techniques that overcome the shortcomings of the hybrid algorithms, and allow the design of high-performance and energy-efficient dense LU and Cholesky factorizations that use GPUs only. The full GPU solution eliminates the need for a high-end CPU and optimized CPU software, which leads to a better energy efficiency. We discuss different design choices, and introduce optimized GPU kernels for panel factorizations. The developed solutions achieve 90+ percent of the performance of optimized hybrid solutions, while improving the energy efficiency by 50 percent. They outperform the vendor library by 30-50 percent in single precision, and 15-50 percent in double precision. We also show that hybrid designs trail the proposed solutions in performance when optimized CPU software is not available.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 2700–2712
%8 2018-12
%G eng
%N 12
%R 10.1109/TPDS.2018.2842785

%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2018
%T Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters
%A Ichitaro Yamazaki
%A Ahmad Abdelfattah
%A Akihiro Ida
%A Satoshi Ohshima
%A Stanimire Tomov
%A Rio Yokota
%A Jack Dongarra
%X ppohBEM is an open-source software package im- plementing the boundary element method. One of its main software tasks is the solution of the dense linear system of equations, for which, ppohBEM relies on another software package called HACApK. To reduce the cost of solving the linear system, HACApK hierarchically compresses the coefficient matrix using adaptive cross approximation. This hierarchical compression greatly reduces the storage and time complexities of the solver and enables the solution of large-scale boundary value problems. To extend the capability of ppohBEM, in this paper, we carefully port the HACApK’s linear solver onto GPU clusters. Though the potential of the GPUs has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUs for a solver, like HACApK’s, that requires fine-grained computation and global communication. First, to utilize the GPUs, we integrate the batched GPU kernel that was recently released in the MAGMA software package. We discuss several techniques to improve the performance of the batched kernel. We then study various techniques to address the inter-GPU communication and study their effects on state-of- the-art GPU clusters. We believe that the techniques studied in this paper are of interest to a wide range of software packages running on GPUs, especially with the increasingly complex node architectures and the growing costs of the communication. We also hope that our efforts to integrate the GPU kernel or to setup the inter-GPU communication will influence the design of the future-generation batched kernels or the communication layer within a software stack.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Vancouver, BC, Canada
%8 2018-05
%G eng

%0 Journal Article
%J Proceedings of the IEEE
%D 2018
%T Autotuning in High-Performance Computing Applications
%A Prasanna Balaprakash
%A Jack Dongarra
%A Todd Gamblin
%A Mary Hall
%A Jeffrey Hollingsworth
%A Boyana Norris
%A Richard Vuduc
%K High-performance computing
%K performance tuning programming systems
%X Autotuning refers to the automatic generation of a search space of possible implementations of a computation that are evaluated through models and/or empirical measurement to identify the most desirable implementation. Autotuning has the potential to dramatically improve the performance portability of petascale and exascale applications. To date, autotuning has been used primarily in high-performance applications through tunable libraries or previously tuned application code that is integrated directly into the application. This paper draws on the authors' extensive experience applying autotuning to high-performance applications, describing both successes and future challenges. If autotuning is to be widely used in the HPC community, researchers must address the software engineering challenges, manage configuration overheads, and continue to demonstrate significant performance gains and portability across architectures. In particular, tools that configure the application must be integrated into the application build process so that tuning can be reapplied as the application and target architectures evolve.
%B Proceedings of the IEEE
%V 106
%P 2068–2083
%8 2018-11
%G eng
%N 11
%R 10.1109/JPROC.2018.2841200

%0 Journal Article
%J Proceedings of the IEEE
%D 2018
%T Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Yaohung Tsai
%K Dense numerical linear algebra
%K performance autotuning
%X Computational problems in engineering and scientific disciplines often rely on the solution of many instances of small systems of linear equations, which are called batched solves. In this paper, we focus on the important variants of both batch Cholesky factorization and subsequent substitution. The former requires the linear system matrices to be symmetric positive definite (SPD). We describe the implementation and automated performance engineering of these kernels that implement the factorization and the two substitutions. Our target platforms are graphics processing units (GPUs), which over the past decade have become an attractive high-performance computing (HPC) target for solvers of linear systems of equations. Due to their throughput-oriented design, GPUs exhibit the highest processing rates among the available processors. However, without careful design and coding, this speed is mostly restricted to large matrix sizes. We show an automated exploration of the implementation space as well as a new data layout for the batched class of SPD solvers. Our tests involve the solution of many thousands of linear SPD systems of exactly the same size. The primary focus of our techniques is on the individual matrices in the batch that have dimensions ranging from 5-by-5 up to 100-by-100. We compare our autotuned solvers against the state-of-the-art solvers such as those provided through NVIDIA channels and publicly available in the optimized MAGMA library. The observed performance is competitive and many times superior for many practical cases. The advantage of the presented methodology lies in achieving these results in a portable manner across matrix storage formats and GPU hardware architecture platforms.
%B Proceedings of the IEEE
%V 106
%P 2040–2055
%8 2018-11
%G eng
%N 11
%R 10.1109/JPROC.2018.2868961

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2018
%T Autotuning Techniques for Performance-Portable Point Set Registration in 3D
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A David Keffer
%A Vasileios Maroulas
%A Jack Dongarra
%X We present an autotuning approach applied to exhaustive performance engineering of the EM-ICP algorithm for the point set registration problem with a known reference. We were able to achieve progressively higher performance levels through a variety of code transformations and an automated procedure of generating a large number of implementation variants. Furthermore, we managed to exploit code patterns that are not common when only attempting manual optimization but which yielded in our tests better performance for the chosen registration algorithm. Finally, we also show how we maintained high levels of the performance rate in a portable fashion across a wide range of hardware platforms including multicore, manycore coprocessors, and accelerators. Each of these hardware classes is much different from the others and, consequently, cannot reliably be mastered by a single developer in a short time required to deliver a close-to-optimal implementation. We assert in our concluding remarks that our methodology as well as the presented tools provide a valid automation system for software optimization tasks on modern HPC hardware.
%B Supercomputing Frontiers and Innovations
%V 5
%8 2018-12
%G eng
%& 42
%R 10.14529/jsfi180404

%0 Report
%D 2018
%T Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
%A Jack Dongarra
%A Iain Duff
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jonathan Hogg
%A Pedro Valero Lara
%A Piotr Luszczek
%A Mawussi Zounon
%A Samuel D. Relton
%A Stanimire Tomov
%A Timothy Costa
%A Sarah Knepper
%X This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility.
%8 2018-07
%G eng

%0 Journal Article
%J Journal of Computational Science
%D 2018
%T Batched One-Sided Factorizations of Tiny Matrices Using GPUs: Challenges and Countermeasures
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K batch computation
%K GPU computing
%K matrix factorization
%X The use of batched matrix computations recently gained a lot of interest for applications, where the same operation is applied to many small independent matrices. The batched computational pattern is frequently encountered in applications of data analytics, direct/iterative solvers and preconditioners, computer vision, astrophysics, and more, and often requires specific designs for vectorization and extreme parallelism to map well on today's high-end many-core architectures. This has led to the development of optimized software for batch computations, and to an ongoing community effort to develop standard interfaces for batched linear algebra software. Furthering these developments, we present GPU design and optimization techniques for high-performance batched one-sided factorizations of millions of tiny matrices (of size 32 and less). We quantify the effects and relevance of different techniques in order to select the best-performing LU, QR, and Cholesky factorization designs. While we adapt common optimization techniques, such as optimal memory traffic, register blocking, and concurrency control, we also show that a different mindset and techniques are needed when matrices are tiny, and in particular, sub-vector/warp in size. The proposed routines are part of the MAGMA library and deliver significant speedups compared to their counterparts in currently available vendor-optimized libraries. Notably, we tune the developments for the newest V100 GPU from NVIDIA to show speedups of up to 11.8×.
%B Journal of Computational Science
%V 26
%P 226–236
%8 2018-05
%G eng
%R https://doi.org/10.1016/j.jocs.2018.01.005

%0 Generic
%D 2018
%T Bidiagonal SVD Computation via an Associated Tridiagonal Eigenproblem
%A Osni Marques
%A James Demmel
%A Paulo B. Vasconcelos
%X In this paper, we present an algorithm for the singular value decomposition (SVD) of a bidiagonal matrix by means of the eigenpairs of an associated symmetric tridiagonal matrix. The algorithm is particularly suited for the computation of a subset of singular values and corresponding vectors. We focus on a sequential version of the algorithm, and discuss special cases and implementation details. We use a large set of bidiagonal matrices to assess the accuracy of the implementation in single and double precision, as well as to identify potential shortcomings. We show that the algorithm can be up to three orders of magnitude faster than existing algorithms, which are limited to the computation of a full SVD. We also show time comparisons of an implementation that uses the strategy discussed in the paper as a building block for the computation of the SVD of general matrices.
%B LAPACK Working Note
%I University of Tennessee
%8 2018-04
%G eng

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T Big Data and Extreme-Scale Computing: Pathways to Convergence - Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry
%A Mark Asch
%A Terry Moore
%A Rosa M. Badia
%A Micah Beck
%A Pete Beckman
%A Thierry Bidot
%A François Bodin
%A Franck Cappello
%A Alok Choudhary
%A Bronis R. de Supinski
%A Ewa Deelman
%A Jack Dongarra
%A Anshu Dubey
%A Geoffrey Fox
%A Haohuan Fu
%A Sergi Girona
%A Michael Heroux
%A Yutaka Ishikawa
%A Kate Keahey
%A David Keyes
%A William T. Kramer
%A Jean-François Lavignon
%A Yutong Lu
%A Satoshi Matsuoka
%A Bernd Mohr
%A Stéphane Requena
%A Joel Saltz
%A Thomas Schulthess
%A Rick Stevens
%A Martin Swany
%A Alexander Szalay
%A William Tang
%A Gaël Varoquaux
%A Jean-Pierre Vilotte
%A Robert W. Wisniewski
%A Zhiwei Xu
%A Igor Zacharov
%X Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review.
%B The International Journal of High Performance Computing Applications
%V 32
%P 435–479
%8 2018-07
%G eng
%N 4
%R https://doi.org/10.1177/1094342018778123

%0 Conference Paper
%B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2018
%T Budget-Aware Scheduling Algorithms for Scientific Workflows with Stochastic Task Weights on Heterogeneous IaaS Cloud Platforms
%A Yves Caniou
%A Eddy Caron
%A Aurélie Kong Win Chang
%A Yves Robert
%K budget aware algorithm
%K multi criteria scheduling
%K workflow
%X This paper introduces several budget-aware algorithms to deploy scientific workflows on IaaS cloud platforms, where users can request Virtual Machines (VMs) of different types, each with specific cost and speed parameters. We use a realistic application/platform model with stochastic task weights, and VMs communicating through a datacenter. We extend two well-known algorithms, MinMin and HEFT, and make scheduling decisions based upon machine availability and available budget. During the mapping process, the budget-aware algorithms make conservative assumptions to avoid exceeding the initial budget; we further improve our results with refined versions that aim at re-scheduling some tasks onto faster VMs, thereby spending any budget fraction leftover by the first allocation. These refined variants are much more time-consuming than the former algorithms, so there is a trade-off to find in terms of scalability. We report an extensive set of simulations with workflows from the Pegasus benchmark suite. Most of the time our budget-aware algorithms succeed in achieving efficient makespans while enforcing the given budget, despite (i) the uncertainty in task weights and (ii) the heterogeneity of VMs in both cost and speed values.
%B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Vancouver, BC, Canada
%8 2018-05
%G eng
%R 10.1109/IPDPSW.2018.00014

%0 Journal Article
%J IEEE Transactions on Computers
%D 2018
%T Checkpointing Workflows for Fail-Stop Errors
%A Li Han
%A Louis-Claude Canon
%A Henri Casanova
%A Yves Robert
%A Frederic Vivien
%K checkpoint
%K fail-stop error
%K resilience
%K workflow
%X We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS), which is relevant to many real-world workflow applications. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide how to checkpoint these sub-graphs. We assess the performance of our algorithm for production workflow configurations, comparing it to an approach in which all application data is checkpointed and an approach in which no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures.
%B IEEE Transactions on Computers
%V 67
%P 1105–1120
%8 2018-08
%G eng
%U http://ieeexplore.ieee.org/document/8279499/
%N 8

%0 Generic
%D 2018
%T A Collection of White Papers from the BDEC2 Workshop in Bloomington, IN
%A James Ahrens
%A Christopher M. Biwer
%A Alexandru Costan
%A Gabriel Antoniu
%A Maria S. Pérez
%A Nenad Stojanovic
%A Rosa Badia
%A Oliver Beckstein
%A Geoffrey Fox
%A Shantenu Jha
%A Micah Beck
%A Terry Moore
%A Sunita Chandrasekaran
%A Carlos Costa
%A Thierry Deutsch
%A Luigi Genovese
%A Tarek El-Ghazawi
%A Ian Foster
%A Dennis Gannon
%A Toshihiro Hanawa
%A Tevfik Kosar
%A William Kramer
%A Madhav V. Marathe
%A Christopher L. Barrett
%A Takemasa Miyoshi
%A Alex Pothen
%A Ariful Azad
%A Judy Qiu
%A Bo Peng
%A Ravi Teja
%A Sahil Tyagi
%A Chathura Widanage
%A Jon Koskey
%A Maryam Rahnemoonfar
%A Umakishore Ramachandran
%A Miles Deegan
%A William Tang
%A Osamu Tatebe
%A Michela Taufer
%A Michel Cuende
%A Ewa Deelman
%A Trilce Estrada
%A Rafael Ferreira Da Silva
%A Harrel Weinstein
%A Rodrigo Vargas
%A Miwako Tsuji
%A Kevin G. Yager
%A Wanling Gao
%A Jianfeng Zhan
%A Lei Wang
%A Chunjie Luo
%A Daoyi Zheng
%A Xu Wen
%A Rui Ren
%A Chen Zheng
%A Xiwen He
%A Hainan Ye
%A Haoning Tang
%A Zheng Cao
%A Shujie Zhang
%A Jiahui Dai
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2018-11
%G eng

%0 Journal Article
%J Journal of Advances in Modeling Earth Systems
%D 2018
%T Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling
%A Jian Sun
%A Joshua Fu
%A John Drake
%A Qingzhao Zhu
%A Azzam Haidar
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%K compiler
%K CUDA
%K data transfer
%K gpu
%K hybrid
%K memory layout
%X Global chemistry‐climate models are computationally burdened as the chemical mechanisms become more complex and realistic. Optimization for graphics processing units (GPU) may make longer global simulation with regional detail possible, but limited study has been done to explore the potential benefit for the atmospheric chemistry modeling. Hence, in this study, the second‐order Rosenbrock solver of the chemistry module of CAM4‐Chem is ported to the GPU to gauge potential speed‐up. We find that on the CPU, the fastest performance is achieved using the Intel compiler with a block interleaved memory layout. Different combinations of compiler and memory layout lead to ~11.02× difference in the computational time. In contrast, the GPU version performs the best when using a combination of fully interleaved memory layout with block size equal to the warp size, CUDA streams for independent kernels, and constant memory. Moreover, the most efficient data transfer between CPU and GPU is gained by allocating the memory contiguously during the data initialization on the GPU. Compared to one CPU core, the speed‐up of using one GPU alone reaches a factor of ~11.7× for the computation alone and ~3.82× when the data transfer between CPU and GPU is considered. Using one GPU alone is also generally faster than the multithreaded implementation for 16 CPU cores in a compute node and the single‐source solution (OpenACC). The best performance is achieved by the implementation of the hybrid CPU/GPU version, but rescheduling the workload among the CPU cores is required before the practical CAM4‐Chem simulation.
%B Journal of Advances in Modeling Earth Systems
%V 10
%P 1952–1969
%8 2018-08
%G eng
%N 8
%R https://doi.org/10.1029/2018MS001276

%0 Journal Article
%J Parallel Computing
%D 2018
%T Computing the Expected Makespan of Task Graphs in the Presence of Silent Errors
%A Henri Casanova
%A Julien Herrmann
%A Yves Robert
%K Expected makespan
%K fault-tolerance
%K scheduling
%K Scientific workflows
%K silent errors
%K Task graphs
%X Applications structured as Directed Acyclic Graphs (DAGs) of tasks occur in many domains, including popular scientific workflows. DAG scheduling has thus received an enormous amount of attention. Many of the popular DAG scheduling heuristics make scheduling deci- sions based on path lengths. At large scale compute platforms are subject to various types of failures with non-negligible probabilities of occurrence. Failures that have recently re- ceived increased attention are “silent errors,” which cause data corruption. Tolerating silent errors is done by checking the validity of computed results and possibly re-executing a task from scratch. The execution time of a task then becomes a random variable, and so do path lengths in a DAG. Unfortunately, computing the expected makespan of a DAG (and equivalently computing expected path lengths in a DAG) is a computationally dif- ficult problem. Consequently, designing effective scheduling heuristics in this context is challenging. In this work, we propose an algorithm that computes a first order approxi- mation of the expected makespan of a DAG when tasks are subject to silent errors. We find that our proposed approximation outperforms previously proposed approaches both in terms of approximation error and of speed.
%B Parallel Computing
%V 75
%P 41–60
%8 2018-07
%G eng
%R https://doi.org/10.1016/j.parco.2018.03.004

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2018
%T Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing
%A Anne Benoit
%A Aurelien Cavelan
%A Franck Cappello
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%K checkpointing
%K fail-stop errors
%K Fault tolerance
%K High-performance computing
%K Replication
%K silent errors
%X This paper provides a model and an analytical study of replication as a technique to cope with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale platforms. Compared with fail-stop errors that are immediately detected when they occur, silent errors require a detection mechanism. To detect silent errors, many application-specific techniques are available, either based on algorithms (e.g., ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject to only silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to even correct some errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. Otherwise, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model.
%B Journal of Parallel and Distributed Computing
%V 122
%P 209–225
%8 2018-12
%G eng
%R https://doi.org/10.1016/j.jpdc.2018.08.002

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2018
%T Co-Scheduling Amdhal Applications on Cache-Partitioned Systems
%A Guillaume Aupy
%A Anne Benoit
%A Sicheng Dai
%A Loïc Pottier
%A Padma Raghavan
%A Yves Robert
%A Manu Shantharam
%K cache partitioning
%K co-scheduling
%K complexity results
%X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are as follows: (i) which proportion of cache and (ii) how many processors should be given to each application? In this article, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.
%B International Journal of High Performance Computing Applications
%V 32
%P 123–138
%8 2018-01
%G eng
%N 1
%R https://doi.org/10.1177/1094342017710806

%0 Conference Paper
%B Cluster 2018
%D 2018
%T Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms
%A Guillaume Aupy
%A Anne Benoit
%A Brice Goglin
%A Loïc Pottier
%A Yves Robert
%B Cluster 2018
%I IEEE Computer Society Press
%C Belfast, UK
%8 2018-09
%G eng

%0 Generic
%D 2018
%T Data Movement Interfaces to Support Dataflow Runtimes
%A Aurelien Bouteiller
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%X This document presents the design study and reports on the implementation of a portable hosted accelerator device support in the PaRSEC Dataflow Tasking at Exascale runtime, undertaken as part of the ECP contract 17-SC-20-SC. The document discusses different technological approaches to transfer data to/from hosted accelerators, issues recommendations for technology providers, and presents the design of an OpenMP-based accelerator support in PaRSEC.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-05
%G eng

%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2018)
%D 2018
%T The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Panruo Wu
%A Srikara Pranesh
%A Stanimire Tomov
%A Jack Dongarra
%X As parallel computers approach exascale, power efficiency in high-performance computing (HPC) systems is of increasing concern. Exploiting both the hardware features and algorithms is an effective solution to achieve power efficiency, and to address the energy constraints in modern and future HPC systems. In this work, we present a novel design and implementation of an energy-efficient solution for dense linear systems of equations, which are at the heart of large-scale HPC applications. The proposed energy-efficient linear system solvers are based on two main components: (1) iterative refinement techniques, and (2) reduced-precision computing features in modern accelerators and coprocessors. While most of the energy efficiency approaches aim to reduce the consumption with a minimal performance penalty, our method improves both the performance and the energy efficiency. Compared to highly-optimized linear system solvers, our kernels deliver the same accuracy solution up to   2×  faster and reduce the energy consumption up to half on Intel Knights Landing (KNL) architectures. By efficiently using the Tensor Cores available in the NVIDIA V100 PCIe GPUs, the speedups can be up to   4× , with more than 80% reduction in the energy consumption.
%B International Conference on Computational Science (ICCS 2018)
%I Springer
%C Wuxi, China
%V 10860
%P 586–600
%8 2018-06
%G eng
%U https://rdcu.be/bcKSC
%R https://doi.org/10.1007/978-3-319-93698-7_45

%0 Generic
%D 2018
%T Distributed Termination Detection for HPC Task-Based Environments
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Valentin Le Fèvre
%A Yves Robert
%A Jack Dongarra
%X This paper revisits distributed termination detection algorithms in the context of high-performance computing applications in task systems. We first outline the need to efficiently detect termination in workflows for which the total number of tasks is data dependent and therefore not known statically but only revealed dynamically during execution. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). On the theoretical side, we analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. On the practical side, we provide a highly tuned implementation of each termination detection algorithm within PaRSEC and compare their performance for a variety of benchmarks, extracted from scientific applications that exhibit dynamic behaviors.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-06
%G eng

%0 Conference Paper
%B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids
%D 2018
%T Do moldable applications perform better on failure-prone HPC platforms?
%A Valentin Le Fèvre
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Atsushi Hori
%A Yves Robert
%A Jack Dongarra
%X This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) Rigid applications, which use a constant number of processors throughout execution; (ii) Moldable applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GridShaped applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage.
%B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids
%S LNCS
%I Springer Verlag
%C Turin, Italy
%8 2018-08
%G eng

%0 Conference Proceedings
%B OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence
%D 2018
%T Evaluating Contexts in OpenSHMEM-X Reference Implementation
%A Aurelien Bouteiller
%A Pophale, Swaroop
%A Swen Boehm
%A Baker, Matthew B.
%A Manjunath Gorentla Venkata
%E Manjunath Gorentla Venkata
%E Imam, Neena
%E Pophale, Swaroop
%X Many-core processors are now ubiquitous in supercomputing. This evolution pushes toward the adoption of mixed models in which cores are exploited with threading models (and related programming abstractions, such as OpenMP), while communication between distributed memory domains employ a communication Application Programming Interface (API). OpenSHMEM is a partitioned global address space communication specification that exposes one-sided and synchronization operations. As the threaded semantics of OpenSHMEM are being fleshed out by its standardization committee, it is important to assess the soundness of the proposed concepts. This paper implements and evaluate the ``context'' extension in relation to threaded operations. We discuss the implementation challenges of the context and the associated API in OpenSHMEM-X. We then evaluate its performance in threaded situations on the Infiniband network using micro-benchmarks and the Random Access benchmark and see that adding communication contexts significantly improves message rate achievable by the executing multi-threaded PEs.
%B OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence
%I Springer International Publishing
%C Cham
%P 50–62
%@ 978-3-319-73814-7
%G eng
%R https://doi.org/10.1007/978-3-319-73814-7_4

%0 Generic
%D 2018
%T Evaluation and Design of FFT for Distributed Accelerated Systems
%A Stanimire Tomov
%A Azzam Haidar
%A Daniel Schultz
%A Jack Dongarra
%B ECP WBS 2.3.3.09 Milestone Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-10
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience: Special Issue on Parallel and Distributed Algorithms
%D 2018
%T Evaluation of Dataflow Programming Models for Electronic Structure Theory
%A Heike Jagode
%A Anthony Danalis
%A Reazul Hoque
%A Mathieu Faverge
%A Jack Dongarra
%K CCSD
%K coupled cluster methods
%K dataflow
%K NWChem
%K OpenMP
%K parsec
%K StarPU
%K task-based runtime
%X Dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. In this paper, we evaluate different dataflow programming models for electronic structure methods and compare them in terms of programmability, resource utilization, and scalability. In particular, we evaluate two programming paradigms for expressing scientific applications in a dataflow form: (1) explicit dataflow, where the dataflow is specified explicitly by the developer, and (2) implicit dataflow, where a task scheduling runtime derives the dataflow using per-task data-access information embedded in a serial program. We discuss our findings and present a thorough experimental analysis using methods from the NWChem quantum chemistry application as our case study, and OpenMP, StarPU, and PaRSEC as the task-based runtimes that enable the different forms of dataflow execution. Furthermore, we derive an abstract model to explore the limits of the different dataflow programming paradigms.
%B Concurrency and Computation: Practice and Experience: Special Issue on Parallel and Distributed Algorithms
%V 2018
%P 1–20
%8 2018-05
%G eng
%N e4490
%R https://doi.org/10.1002/cpe.4490

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T A Failure Detector for HPC Platforms
%A George Bosilca
%A Aurelien Bouteiller
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Pierre Sens
%A Jack Dongarra
%K failure detection
%K Fault tolerance
%K MPI
%X Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This article describes the design and evaluation of a robust failure detector that can maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage uses a nonuniform variant of a reliable broadcast over a circulant graph overlay network and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan Oak Ridge National Laboratory supercomputer, show that the algorithm performs extremely well and exhibits all the desired properties of an exascale-ready algorithm.
%B The International Journal of High Performance Computing Applications
%V 32
%P 139–158
%8 2018-01
%G eng
%N 1
%R https://doi.org/10.1177/1094342017711505

%0 Conference Paper
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%D 2018
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Li Han
%A Valentin Le Fèvre
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%X This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target failstop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows.
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%I IEEE Computer Society Press
%C Eugene, OR
%8 2018-08
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Stanimire Tomov
%A Jack Dongarra
%X We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6x for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 973–984
%8 2018-05
%G eng
%N 5
%R 10.1109/TPDS.2017.2783929

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18)
%D 2018
%T Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%A Nicholas J. Higham
%X Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax = b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16-FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic.
%B The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18)
%I IEEE
%C Dallas, TX
%8 2018-11
%G eng
%R https://doi.org/10.1109/SC.2018.00050

%0 Generic
%D 2018
%T Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%I GPU Technology Conference (GTC), Poster
%C San Jose, CA
%8 2018-03
%G eng

%0 Conference Paper
%B 8th Workshop on Irregular Applications: Architectures and Algorithms
%D 2018
%T High-Performance GPU Implementation of PageRank with Reduced Precision based on Mantissa Segmentation
%A Anzt, Hartwig
%A Thomas Gruetzmacher
%A Enrique S. Quintana-Orti
%A Scheidegger, Florian
%B 8th Workshop on Irregular Applications: Architectures and Algorithms
%G eng

%0 Generic
%D 2018
%T Implementation of the C++ API for Batch BLAS
%A Ahmad Abdelfattah
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng
%1 07

%0 Journal Article
%J Parallel Computing
%D 2018
%T Incomplete Sparse Approximate Inverses for Parallel Preconditioning
%A Hartwig Anzt
%A Thomas Huckle
%A Jürgen Bräckle
%A Jack Dongarra
%X In this paper, we propose a new preconditioning method that can be seen as a generalization of block-Jacobi methods, or as a simplification of the sparse approximate inverse (SAI) preconditioners. The “Incomplete Sparse Approximate Inverses” (ISAI) is in particular efficient in the solution of sparse triangular linear systems of equations. Those arise, for example, in the context of incomplete factorization preconditioning. ISAI preconditioners can be generated via an algorithm providing fine-grained parallelism, which makes them attractive for hardware with a high concurrency level. In a study covering a large number of matrices, we identify the ISAI preconditioner as an attractive alternative to exact triangular solves in the context of incomplete factorization preconditioning.
%B Parallel Computing
%V 71
%P 1–22
%8 2018-01
%G eng
%U http://www.sciencedirect.com/science/article/pii/S016781911730176X
%! Parallel Computing
%R 10.1016/j.parco.2017.10.003

%0 Generic
%D 2018
%T Initial Integration and Evaluation of SLATE and STRUMPACK
%A Pieter Ghysels
%A Sherry Li
%A Asim YarKhan
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-12
%G eng

%0 Generic
%D 2018
%T Initial Integration and Evaluation of SLATE Parallel BLAS in LATTE
%A Asim YarKhan
%A Gerald Ragghianti
%A Jack Dongarra
%A Marc Cawkwell
%A Danny Perez
%A Arthur Voter
%B Innovative Computing Laboratory Technical Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng

%0 Journal Article
%J Concurrency Computation: Practice and Experience
%D 2018
%T Investigating Power Capping toward Energy-Efficient Scientific Applications
%A Azzam Haidar
%A Heike Jagode
%A Phil Vaccaro
%A Asim YarKhan
%A Stanimire Tomov
%A Jack Dongarra
%K energy efficiency
%K High Performance Computing
%K Intel Xeon Phi
%K Knights landing
%K papi
%K performance analysis
%K Performance Counters
%K power efficiency
%X The emergence of power efficiency as a primary constraint in processor and system design poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers, which may house petascale or exascale-level computing systems. At these extreme scales, understanding and improving the energy efficiency of numerical libraries and their related applications becomes a crucial part of the successful implementation and operation of the computing system. In this paper, we study and investigate the practice of controlling a compute system's power usage, and we explore how different power caps affect the performance of numerical algorithms with different computational intensities. Further, we determine the impact, in terms of performance and energy usage, that these caps have on a system running scientific applications. This analysis will enable us to characterize the types of algorithms that benefit most from these power management schemes. Our experiments are performed using a set of representative kernels and several popular scientific benchmarks. We quantify a number of power and performance measurements and draw observations and conclusions that can be viewed as a roadmap to achieving energy efficiency in the design and execution of scientific algorithms.
%B Concurrency Computation: Practice and Experience
%V 2018
%P 1-14
%8 2018-04
%G eng
%U https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4485
%N e4485
%R https://doi.org/10.1002/cpe.4485

%0 Conference Paper
%B SBAC-PAD
%D 2018
%T A Jaccard Weights Kernel Leveraging Independent Thread Scheduling on GPUs
%A Anzt, Hartwig
%A Jack Dongarra
%B SBAC-PAD
%I IEEE
%C Lyon, France
%G eng
%U https://ieeexplore.ieee.org/document/8645946

%0 Generic
%D 2018
%T Least Squares Performance Report
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-12
%G eng
%9 SLATE Working Notes
%1 09

%0 Generic
%D 2018
%T Linear Systems Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Ali Charara
%A Asim YarKhan
%A Jamie Finney
%A Gerald Ragghianti
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-09
%G eng
%9 SLATE Working Notes
%1 08

%0 Generic
%D 2018
%T MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Azzam Haidar
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster
%C Dallas, TX
%8 2018-11
%G eng

%0 Generic
%D 2018
%T MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR)
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Ichitaro Yamazaki
%A Jack Dongarra
%I NSF PI Meeting, Poster
%C Washington, DC
%8 2018-04
%G eng
%R https://doi.org/10.6084/m9.figshare.6174143.v3

%0 Journal Article
%J Journal of Computational Science
%D 2018
%T Multi-Level Checkpointing and Silent Error Detection for Linear Workflows
%A Anne Benoit
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%X We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for fail-stop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also show how to combine all these techniques and solve the problem with both fail-stop and silent errors. Simulation results demonstrate that these extensions lead to significantly improved performance compared to the standard single-level checkpointing algorithm.
%B Journal of Computational Science
%V 28
%P 398–415
%8 2018-09
%G eng

%0 Conference Paper
%B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award
%D 2018
%T Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms
%A Thomas Herault
%A Yves Robert
%A Aurelien Bouteiller
%A Dorian Arnold
%A Kurt Ferreira
%A George Bosilca
%A Jack Dongarra
%X In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.
%B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award
%I IEEE
%C Vancouver, BC, Canada
%8 2018-05
%G eng
%R 10.1109/IPDPSW.2018.00127

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T Optimization and Performance Evaluation of the IDR Iterative Krylov Solver on GPUs
%A Hartwig Anzt
%A Moritz Kreutzer
%A Eduardo Ponce
%A Gregory D. Peterson
%A Gerhard Wellein
%A Jack Dongarra
%K co-design
%K gpu
%K Induced dimension reduction (IDR)
%K kernel fusion
%K kernel overlap
%K roofline performance model
%X In this paper, we present an optimized GPU implementation for the induced dimension reduction algorithm. We improve data locality, combine it with an efficient sparse matrix vector kernel, and investigate the potential of overlapping computation with communication as well as the possibility of concurrent kernel execution. A comprehensive performance evaluation is conducted using a suitable performance model. The analysis reveals efficiency of up to 90%, which indicates that the implementation achieves performance close to the theoretically attainable bound.
%B The International Journal of High Performance Computing Applications
%V 32
%P 220–230
%8 2018-03
%G eng
%R https://doi.org/10.1177/1094342016646844

%0 Conference Paper
%B IEEE High Performance Extreme Computing Conference (HPEC’18)
%D 2018
%T Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X This paper introduces several frameworks for the design and implementation of high performance GPU kernels that target batch workloads with irregular sizes. Such workloads are ubiquitous in many scientific applications, including sparse direct solvers, astrophysics, and quantum chemistry. The paper addresses two main categories of frameworks, taking the Cholesky factorization as a case study. The first uses hostside kernel launches, and the second uses device-side launches. Within each category, different design options are introduced, with an emphasis on the advantages and the disadvantages of each approach. Our best performing design outperforms the state-of-the-art CPU implementation, scoring up to 4.7× speedup in double precision on a Pascal P100 GPU.
%B IEEE High Performance Extreme Computing Conference (HPEC’18)
%I IEEE
%C Waltham, MA
%8 2018-09
%G eng

%0 Generic
%D 2018
%T PAPI: Counting outside the Box
%A Anthony Danalis
%A Heike Jagode
%A Jack Dongarra
%I 8th JLESC Meeting
%C Barcelona, Spain
%8 2018-04
%G eng

%0 Generic
%D 2018
%T PAPI's New Software-Defined Events for In-Depth Performance Analysis
%A Heike Jagode
%A Anthony Danalis
%A Jack Dongarra
%X One of the most recent developments of the Performance API (PAPI) is the addition of Software-Defined Events (SDE). PAPI has successfully served the role of the abstraction and unification layer for hardware performance counters for over a decade. This talk presents our effort to extend this role to encompass performance critical information that does not originate in hardware, but rather in critical software layers, such as libraries and runtime systems. Our overall objective is to enable monitoring of both types of performance events, hardware- and software-related events, in a uniform way, through one consistent PAPI interface. Performance analysts will be able to form a complete picture of the entire application performance without learning new instrumentation primitives. In this talk, we outline PAPI's new SDE API and showcase the usefulness of SDE through its employment in software layers as diverse as the math library MAGMA, the dataflow runtime PaRSEC, and the state-of-the-art chemistry application NWChem. We outline the process of instrumenting these software packages and highlight the performance information that can be acquired with SDEs.
%I CCDSC 2018: Workshop on Clusters, Clouds, and Data for Scientific Computing
%C Lyon, France
%8 2018-09
%G eng

%0 Generic
%D 2018
%T Parallel BLAS Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I University of Tennessee
%8 2018-04
%G eng
%1 05

%0 Generic
%D 2018
%T Parallel Norms Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng
%1 06

%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2018
%T ParILUT - A New Parallel Threshold ILU
%A Hartwig Anzt
%A Edmond Chow
%A Jack Dongarra
%X We propose a parallel algorithm for computing a threshold incomplete LU (ILU) factorization. The main idea is to interleave a parallel fixed-point iteration that approximates an incomplete factorization for a given sparsity pattern with a procedure that adjusts the pattern. We describe and test a strategy for identifying nonzeros to be added and nonzeros to be removed from the sparsity pattern. The resulting pattern may be different and more effective than that of existing threshold ILU algorithms. Also in contrast to other parallel threshold ILU algorithms, much of the new algorithm has fine-grained parallelism.
%B SIAM Journal on Scientific Computing
%I SIAM
%V 40
%P C503–C519
%8 2018-07
%G eng
%N 4
%R https://doi.org/10.1137/16M1079506

%0 Conference Paper
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%D 2018
%T A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures
%A Anne Benoit
%A Swann Perarnau
%A Loïc Pottier
%A Yves Robert
%X This work presents a realistic performance model to execute scientific workflows on high-bandwidth memory architectures such as the Intel Knights Landing. We provide a detailed analysis of the execution time on such platforms, taking into account transfers from both fast and slow memory and their overlap with computations. We discuss several scheduling and mapping strategies: not only tasks must be assigned to computing resource, but also one has to decide which fraction of input and output data will reside in fast memory, and which will have to stay in slow memory. Extensive simulations allow us to assess the impact of the mapping strategies on performance. We also conduct actual experiments for a simple 1D Gauss-Seidel kernel, which assess the accuracy of the model and further demonstrate the importance of a tuned memory management. Altogether, our model and results lay the foundations for further studies and experiments on dual-memory systems.
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%I IEEE Computer Society Press
%C Eugene, OR
%8 2018-08
%G eng

%0 Journal Article
%J Parallel Computing
%D 2018
%T PMIx: Process Management for Exascale Environments
%A Ralph Castain
%A Joshua Hursey
%A Aurelien Bouteiller
%A David Solt
%B Parallel Computing
%V 79
%P 9–29
%8 2018-01
%G eng
%U https://linkinghub.elsevier.com/retrieve/pii/S0167819118302424https://api.elsevier.com/content/article/PII:S0167819118302424?httpAccept=text/xmlhttps://api.elsevier.com/content/article/PII:S0167819118302424?httpAccept=text/plain
%! Parallel Computing
%R 10.1016/j.parco.2018.08.002

%0 Generic
%D 2018
%T Production Implementations of Pipelined & Communication-Avoiding Iterative Linear Solvers
%A Mark Hoemmen
%A Ichitaro Yamazaki
%I SIAM Conference on Parallel Processing for Scientific Computing
%C Tokyo, Japan
%8 2018-03
%G eng

%0 Book Section
%B Topics in Parallel and Distributed Computing
%D 2018
%T Scheduling for Fault-Tolerance: An Introduction
%A Guillaume Aupy
%A Yves Robert
%B Topics in Parallel and Distributed Computing
%I Springer International Publishing
%P 143–170
%@ 978-3-319-93108-1
%G eng
%R 10.1007/978-3-319-93109-8

%0 Journal Article
%J SIAM Review
%D 2018
%T The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%K bidiagonal matrix
%K bisection
%K Divide and conquer
%K Hestenes method
%K Jacobi method
%K Kogbetliantz method
%K MRRR
%K QR iteration
%K Singular value decomposition
%K SVD
%X The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EISPACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy.
%B SIAM Review
%V 60
%P 808–865
%8 2018-11
%G eng
%U https://epubs.siam.org/doi/10.1137/17M1117732
%N 4
%! SIAM Rev.
%R 10.1137/17M1117732

%0 Generic
%D 2018
%T Software-Defined Events (SDEs) in MAGMA-Sparse
%A Heike Jagode
%A Anthony Danalis
%A Hartwig Anzt
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Erik Boman
%A Stanimire Tomov
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-12
%G eng

%0 Generic
%D 2018
%T Software-Defined Events through PAPI for In-Depth Analysis of Application Performance
%A Anthony Danalis
%A Heike Jagode
%A Jack Dongarra
%I 5th Platform for Advanced Scientific Computing Conference (PASC18)
%C Basel, Switzerland
%8 2018-07
%G eng

%0 Generic
%D 2018
%T Solver Interface & Performance on Cori
%A Hartwig Anzt
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Erik Boman
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-06
%G eng

%0 Journal Article
%J Concurrency Computation: Practice and Experience
%D 2018
%T A Survey of MPI Usage in the US Exascale Computing Project
%A David E. Bernholdt
%A Swen Boehm
%A George Bosilca
%A Manjunath Gorentla Venkata
%A Ryan E. Grant
%A Thomas Naughton
%A Howard P. Pritchard
%A Martin Schulz
%A Geoffroy R. Vallee
%K exascale
%K MPI
%X The Exascale Computing Project (ECP) is currently the primary effort in theUnited States focused on developing “exascale” levels of computing capabilities, including hardware, software, and applications. In order to obtain amore thorough understanding of how the software projects under the ECPare using, and planning to use theMessagePassing Interface (MPI), and help guide the work of our own project within the ECP, we created a survey.Of the 97 ECP projects active at the time the survey was distributed, we received 77 responses, 56 of which reported that their projects were usingMPI. This paper reports the results of that survey for the benefit of the broader community of MPI developers.
%B Concurrency Computation: Practice and Experience
%8 2018-09
%G eng
%9 Special Issue
%R https://doi.org/10.1002/cpe.4851

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Panruo Wu
%A Mawussi Zounon
%A Jack Dongarra
%K linear algebra
%K multithreading
%K runtime
%K symmetric indefinite matrices
%X Recently, the Open Multi-Processing (OpenMP) standard has incorporated task-based programming, where a function call with input and output data is treated as a task. At run time, OpenMP's superscalar scheduler tracks the data dependencies among the tasks and executes the tasks as their dependencies are resolved. On a shared-memory architecture with multiple cores, the independent tasks are executed on different cores in parallel, thereby enabling parallel execution of a seemingly sequential code. With the emergence of many-core architectures, this type of programming paradigm is gaining attention-not only because of its simplicity, but also because it breaks the artificial synchronization points of the program and improves its thread-level parallelization. In this paper, we use these new OpenMP features to develop a portable high-performance implementation of a dense symmetric indefinite linear solver. Obtaining high performance from this kind of solver is a challenge because the symmetric pivoting, which is required to maintain numerical stability, leads to data dependencies that prevent us from using some common performance-improving techniques. To fully utilize a large number of cores through tasking, while conforming to the OpenMP standard, we describe several techniques. Our performance results on current many-core architectures-including Intel's Broadwell, Intel's Knights Landing, IBM's Power8, and Arm's ARMv8-demonstrate the portable and superior performance of our implementation compared with the Linear Algebra PACKage (LAPACK). The resulting solver is now available as a part of the PLASMA software package.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 1879–1892
%8 2018-08
%G eng
%N 8
%R 10.1109/TPDS.2018.2808964

%0 Journal Article
%J International Journal of Computational Science and Engineering (IJCSE)
%D 2018
%T Task Based Cholesky Decomposition on Xeon Phi Architectures using OpenMP
%A Joseph Dorris
%A Asim YarKhan
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X The increasing number of computational cores in modern many-core processors, as represented by the Intel Xeon Phi architectures, has created the need for an open-source, high performance and scalable task-based dense linear algebra package that can efficiently use this type of many-core hardware. In this paper, we examined the design modifications necessary when porting PLASMA, a task-based dense linear algebra library, run effectively on two generations of Intel's Xeon Phi architecture, known as knights corner (KNC) and knights landing (KNL). First, we modified PLASMA's tiled Cholesky decomposition to use OpenMP tasks for its scheduling mechanism to enable Xeon Phi compatibility. We then compared the performance of our modified code to that of the original dynamic scheduler running on an Intel Xeon Sandy Bridge CPU. Finally, we looked at the performance of the OpenMP tiled Cholesky decomposition on knights corner and knights landing processors. We detail the optimisations required to obtain performance on these platforms and compare with the highly tuned Intel MKL math library.
%B International Journal of Computational Science and Engineering (IJCSE)
%V 17
%8 2018-10
%G eng
%R http://dx.doi.org/10.1504/IJCSE.2018.095851

%0 Generic
%D 2018
%T Tensor Contraction on Distributed Hybrid Architectures using a Task-Based Runtime System
%A George Bosilca
%A Damien Genet
%A Robert Harrison
%A Thomas Herault
%A Mohammad Mahdi Javanmard
%A Chong Peng
%A Edward Valeev
%X The needs for predictive simulation of electronic structure in chemistry and materials science calls for fast/reduced-scaling formulations of quantum n-body methods that replace the traditional dense tensors with element-, block-, rank-, and block-rank-sparse (data-sparse) tensors. The resulting, highly irregular data structures are a poor match to imperative, bulk-synchronous parallel programming style due to the dynamic nature of the problem and to the lack of clear domain decomposition to guarantee a fair load-balance. TESSE runtime and the associated programming model aim to support performance-portable composition of applications involving irregular and dynamically changing data. In this paper we report an implementation of irregular dense tensor contraction in a paradigmatic electronic structure application based on the TESSE extension of PaRSEC, a distributed hybrid task runtime system, and analyze the resulting performance on a distributed memory cluster of multi-GPU nodes. Unprecedented strong scaling and promising efficiency indicate a viable future for task-based programming of complete production-quality reduced scaling models of electronic structure.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-12
%G eng

%0 Generic
%D 2018
%T Tensor Contractions using Optimized Batch GEMM Routines
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%I GPU Technology Conference (GTC), Poster
%C San Jose, CA
%8 2018-03
%G eng

%0 Conference Paper
%B ISC High Performance (ISC'18), Best Poster
%D 2018
%T Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Jack Dongarra
%B ISC High Performance (ISC'18), Best Poster
%C Frankfurt, Germany
%8 2018-06
%G eng

%0 Generic
%D 2018
%T Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Jack Dongarra
%I ISC High Performance (ISC18), Best Poster Award
%C Frankfurt, Germany
%8 2018-06
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2018
%T Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning
%A Edmond Chow
%A Hartwig Anzt
%A Jennifer Scott
%A Jack Dongarra
%X When using incomplete factorization preconditioners with an iterative method to solve large sparse linear systems, each application of the preconditioner involves solving two sparse triangular systems. These triangular systems are challenging to solve efficiently on computers with high levels of concurrency. On such computers, it has recently been proposed to use Jacobi iterations, which are highly parallel, to approximately solve the triangular systems from incomplete factorizations. The effectiveness of this approach, however, is problem-dependent: the Jacobi iterations may not always converge quickly enough for all problems. Thus, as a necessary and important step to evaluate this approach, we experimentally test the approach on a large number of realistic symmetric positive definite problems. We also show that by using block Jacobi iterations, we can extend the range of problems for which such an approach can be effective. For block Jacobi iterations, it is essential for the blocking to be cognizant of the matrix structure.
%B Journal of Parallel and Distributed Computing
%V 119
%P 219–230
%8 2018-11
%G eng
%R https://doi.org/10.1016/j.jpdc.2018.04.017

%0 Conference Paper
%B SBAC-PAD
%D 2018
%T Variable-Size Batched Condition Number Calculation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Thomas Gruetzmacher
%B SBAC-PAD
%C Lyon, France
%8 2018-09
%G eng
%U https://ieeexplore.ieee.org/document/8645907

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2017
%T Accelerating NWChem Coupled Cluster through Dataflow-Based Execution
%A Heike Jagode
%A Anthony Danalis
%A Jack Dongarra
%K CCSD
%K dag
%K dataflow
%K NWChem
%K parsec
%K ptg
%K tasks
%X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWChem, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelized in coarse chunks. In this paper, we present our effort of converting the NWChem’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PaRSEC (Parallel Runtime Scheduling and Execution Controller): a software package designed to enable high-performance computing at scale. We discuss the modularity of our approach and explain how the PaRSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWChem codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer-grained tasks (compared with the original version of NWChem); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWChem, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation.
%B The International Journal of High Performance Computing Applications
%P 1–13
%8 2017-01
%G eng
%U http://journals.sagepub.com/doi/10.1177/1094342016672543
%R 10.1177/1094342016672543

%0 Generic
%D 2017
%T Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%I SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation
%C Atlanta, GA
%8 2017-03
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2017
%T Argobots: A Lightweight Low-Level Threading and Tasking Framework
%A Sangmin Seo
%A Abdelhalim Amer
%A Pavan Balaji
%A Cyril Bordage
%A George Bosilca
%A Alex Brooks
%A Philip Carns
%A Adrian Castello
%A Damien Genet
%A Thomas Herault
%A Shintaro Iwasaki
%A Prateek Jindal
%A Sanjay Kale
%A Sriram Krishnamoorthy
%A Jonathan Lifflander
%A Huiwei Lu
%A Esteban Meneses
%A Mar Snir
%A Yanhua Sun
%A Kenjiro Taura
%A Pete Beckman
%K Argobots
%K context switch
%K I/O
%K interoperability
%K lightweight
%K MPI
%K OpenMP
%K stackable scheduler
%K tasklet
%K user-level thread
%X In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, are either too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing a rich set of controls to allow specialization by the user or high-level programming model. We describe the design, implementation, and optimization of Argobots and present integrations with three example high-level models: OpenMP, MPI, and co-located I/O service. Evaluations show that (1) Argobots outperforms existing generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency hiding capabilities; and (4) I/O service with Argobots reduces interference with co-located applications, achieving performance competitive with that of the Pthreads version.
%B IEEE Transactions on Parallel and Distributed Systems
%8 2017-10
%G eng
%U http://ieeexplore.ieee.org/document/8082139/
%R 10.1109/TPDS.2017.2766062

%0 Conference Paper
%B The 3rd International Workshop on Fault Tolerant Systems (FTS)
%D 2017
%T Assuming failure independence: are we right to be wrong?
%A Guillaume Aupy
%A Yves Robert
%A Frederic Vivien
%X This paper revisits the failure temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is incorrect, and we propose a new method to detect failure cascades, i.e., series of non-independent consecutive failures. We use this new method to assess whether public archive failure logs contain failure cascades. Then we design and compare several cascadeaware checkpointing algorithms to quantify the maximum gain that could be obtained, and we report extensive simulation results with archive and synthetic failure logs. Altogether, there are a few logs that contain cascades, but we show that the gain that can be achieved from this knowledge is not significant. The conclusion is that we can wrongly, but safely, assume failure independence!
%B The 3rd International Workshop on Fault Tolerant Systems (FTS)
%I IEEE
%C Honolulu, Hawaii
%8 2017-09
%G eng

%0 Conference Paper
%B Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2017
%T Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Yu Pei
%A Jack Dongarra
%K batch computation
%K Cholesky Factorization
%K data layout
%K GPU computing
%K numerical linear algebra
%X Batch matrix operations address the case of solving the same linear algebra problem for a very large number of very small matrices. In this paper, we focus on implementing the batch Cholesky factorization in CUDA, in single precision arithmetic, for NVIDIA GPUs. Specifically, we look into the benefits of using noncanonical data layouts, where consecutive memory locations store elements with the same row and column index in a set of consecutive matrices. We discuss a number of different implementation options and tuning parameters. We demonstrate superior performance to traditional implementations for the case of very small matrices.
%B Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Orlando, FL
%8 2017-06
%G eng
%R 10.1109/IPDPSW.2017.18

%0 Conference Proceedings
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%D 2017
%T Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K block-Jacobi preconditioner
%K Gauss-Jordan elimination
%K graphics processing units (GPUs)
%K iterative methods
%K matrix inversion
%K sparse linear systems
%X In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner.    Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%S PMAM'17
%I ACM
%C New York, NY, USA
%P 1–10
%8 2017-02
%@ 978-1-4503-4883-6
%G eng
%U http://doi.acm.org/10.1145/3026937.3026940
%R 10.1145/3026937.3026940

%0 Generic
%D 2017
%T BDEC Pathways to Convergence: Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry
%E Terry Moore
%E Mark Asch
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-11
%G eng
%U http://www.exascale.org/bdec/report

%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2017
%T Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation
%A Mathieu Faverge
%A Julien Langou
%A Yves Robert
%A Jack Dongarra
%K Algorithm design and analysis
%K Approximation algorithms
%K Kernel
%K Multicore processing
%K Shape
%K Software algorithms
%K Transforms
%X We study tiled algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthog-onal transformations: (i) the tiled bidiagonalization algorithm BIDIAG, which is a tiled version of the standard scalar bidiago-nalization algorithm; and (ii) the R-bidiagonalization algorithm R-BIDIAG, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R- factor. For both BIDIAG and R-BIDIAG, we use four main types of reduction trees, namely FLATTS, FLATTT, GREEDY, and a newly introduced auto-adaptive tree, AUTO. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BIDIAG has a shorter critical path length than BIDIAG for tall and skinny matrices, and (ii) GREEDY based schemes are much better than earlier proposed algorithms with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared- memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Orlando, FL
%8 2017-05
%G eng
%R 10.1109/IPDPS.2017.46

%0 Book Section
%B Handbook of Big Data Technologies
%D 2017
%T Bringing High Performance Computing to Big Data Algorithms
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Handbook of Big Data Technologies
%I Springer
%@ 978-3-319-49339-8
%G eng
%R 10.1007/978-3-319-49340-4

%0 Generic
%D 2017
%T C++ API for Batch BLAS
%A Ahmad Abdelfattah
%A Konstantin Arturov
%A Cris Cecka
%A Jack Dongarra
%A Chip Freitag
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Panruo Wu
%B SLATE Working Notes
%I University of Tennessee
%8 2017-12
%G eng
%1 04

%0 Generic
%D 2017
%T C++ API for BLAS and LAPACK
%A Mark Gates
%A Piotr Luszczek
%A Ahmad Abdelfattah
%A Jakub Kurzak
%A Jack Dongarra
%A Konstantin Arturov
%A Cris Cecka
%A Chip Freitag
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-06
%G eng
%1 02

%0 Generic
%D 2017
%T The Case for Directive Programming for Accelerator Autotuner Optimization
%A Diana Fayad
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Jack Dongarra
%X In this work, we present the use of compiler pragma directives for parallelizing autotuning of specialized compute kernels for hardware accelerators. A set of constructs, that include prallelizing a source code that prune a generated search space with a large number of constraints for an autotunning infrastructure. For a better performance we studied optimization aimed at minimization of the run time.We also studied the behavior of the parallel load balance and the speedup on four different machines: x86, Xeon Phi, ARMv8, and POWER8.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-10
%G eng

%0 Conference Paper
%B IEEE Cluster
%D 2017
%T Checkpointing Workflows for Fail-Stop Errors
%A Li Han
%A Louis-Claude Canon
%A Henri Casanova
%A Yves Robert
%A Frederic Vivien
%X We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class  of graphs, Minimal Series-Parallel Graphs (M-SPGS). It turns out that many real-world workflow applications are naturally structured as M-SPGS. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide which tasks in these sub-graphs should be checkpointed. Furthermore, it is possible to efficiently compute the expected makespan for the solution produced by this algorithm, using a first-order approximation of task weights and existing evaluation algorithms for 2-state probabilistic DAGs. We assess the performance of our algorithm for production workflow configurations, comparing it to (i) an approach in which all application data is checkpointed, which corresponds to the standard way in which most production workflows are executed today; and (ii) an approach in which no application data is checkpointed. Our results demonstrate that our algorithm strikes a good compromise between these two approaches, leading to lower checkpointing overhead than the former and to better resilience to failure than the latter.
%B IEEE Cluster
%I IEEE
%C Honolulu, Hawaii
%8 2017-09
%G eng

%0 Generic
%D 2017
%T Comparing performance of s-step and pipelined GMRES on distributed-memory multicore CPUs
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Piotr Luszczek
%A Jack Dongarra
%I SIAM Annual Meeting
%C Pittsburgh, Pennsylvania
%8 2017-07
%G eng

%0 Conference Paper
%B 19th Workshop on Advances in Parallel and Distributed Computational Models
%D 2017
%T Co-Scheduling Algorithms for Cache-Partitioned Systems
%A Guillaume Aupy
%A Anne Benoit
%A Loïc Pottier
%A Padma Raghavan
%A Yves Robert
%A Manu Shantharam
%K Computational modeling
%K Degradation
%K Interference
%K Mathematical model
%K Program processors
%K Supercomputers
%K Throughput
%X Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? Here, we assign rational numbers of processors to each application, since they can be shared across applications through multi-threading. In this paper, we provide answers to (i) and (ii) for perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for general applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.
%B 19th Workshop on Advances in Parallel and Distributed Computational Models
%I IEEE Computer Society Press
%C Orlando, FL
%8 2017-05
%G eng
%R 10.1109/IPDPSW.2017.60

%0 Generic
%D 2017
%T Dataflow Programming Paradigms for Computational Chemistry Methods
%A Heike Jagode
%X The transition to multicore and heterogeneous architectures has shaped the High Performance Computing (HPC) landscape over the past decades. With the increase in scale, complexity, and heterogeneity of modern HPC platforms, one of the grim challenges for traditional programming models is to sustain the expected performance at scale. By contrast, dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. This work introduces dataflow programming models for computational chemistry methods, and compares different dataflow executions in terms of programmability, resource utilization, and scalability.      This effort is driven by computational chemistry applications, considering that they comprise one of the driving forces of HPC. In particular, many-body methods, such as Coupled Cluster methods (CC), which are the "gold standard" to compute energies in quantum chemistry, are of particular interest for the applied chemistry community. On that account, the latest development for CC methods is used as the primary vehicle for this research, but our effort is not limited to CC and can be applied across other application domains.      Two programming paradigms for expressing CC methods into a dataflow form, in order to make them capable of utilizing task scheduling systems, are presented. Explicit dataflow, is the programming model where the dataflow is explicitly specified by the developer, is contrasted with implicit dataflow, where a task scheduling runtime derives the dataflow. An abstract model is derived to explore the limits of the different dataflow programming paradigms.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%C Knoxville, TN
%8 2017-05
%U http://trace.tennessee.edu/utk_graddiss/4469/
%9 PhD Dissertation (Computer Science)

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2017
%T Design and Implementation of the PULSAR Programming System for Large Scale Computing
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Yves Robert
%A Jack Dongarra
%X The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.
%B Supercomputing Frontiers and Innovations
%V 4
%G eng
%U http://superfri.org/superfri/article/view/121/210
%N 1
%R 10.14529/jsfi170101

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
%A Jack Dongarra
%A Sven Hammarling
%A Nicholas J. Higham
%A Samuel Relton
%A Pedro Valero-Lara
%A Mawussi Zounon
%K Batched BLAS
%K BLAS
%K High-performance computing
%K Memory management
%K Parallel processing
%K Scientific computing
%X A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on a general matrix-matrix multiplication (GEMM), to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL1 CuBLAS2 by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, for large numbers of double precision GEMM operations using matrices of size 2 × 2 to 20 × 20.
%B International Conference on Computational Science (ICCS 2017)
%I Elsevier
%C Zürich, Switzerland
%8 2017-06
%G eng
%R DOI:10.1016/j.procs.2017.05.138

%0 Generic
%D 2017
%T Designing SLATE: Software for Linear Algebra Targeting Exascale
%A Jakub Kurzak
%A Panruo Wu
%A Mark Gates
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Gerald Ragghianti
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-10
%G eng
%9 SLATE Working Notes
%1 03

%0 Conference Proceedings
%B ScalA17
%D 2017
%T Dynamic Task Discovery in PaRSEC- A data-flow task-based Runtime
%A Reazul Hoque
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%K data-flow
%K dynamic task-graph
%K parsec
%K task-based runtime
%X Successfully exploiting distributed collections of heterogeneous many-cores architectures with complex memory hierarchy through a portable programming model is a challenge for application developers. The literature is not short of proposals addressing this problem, including many evolutionary solutions that seek to extend the capabilities of current message passing paradigms with intranode features (MPI+X). A different, more revolutionary, solution explores data-flow task-based runtime systems as a substitute to both local and distributed data dependencies management. The solution explored in this paper, PaRSEC, is based on such a programming paradigm, supported by a highly efficient task-based runtime. This paper compares two programming paradigms present in PaRSEC, Parameterized Task Graph (PTG) and Dynamic Task Discovery (DTD) in terms of capabilities, overhead and potential benefits.
%B ScalA17
%I ACM
%C Denver
%8 2017-09
%@ 978-1-4503-5125-6
%G eng
%U https://dl.acm.org/citation.cfm?doid=3148226.3148233
%R 10.1145/3148226.3148233

%0 Conference Paper
%B ACM MultiMedia Workshop 2017
%D 2017
%T Efficient Communications in Training Large Scale Neural Networks
%A Yiyang Zhao
%A Linnan Wan
%A Wei Wu
%A George Bosilca
%A Richard Vuduc
%A Jinmian Ye
%A Wenqi Tang
%A Zenglin Xu
%X We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for sub-gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to P, where P is the number of GPUs, while the cost of more conventional Minimum Spanning Tree (MST) scales like O(logP). LP also demonstrate up to 2x faster bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD.
%B ACM MultiMedia Workshop 2017
%I ACM
%C Mountain View, CA
%8 2017-10
%G eng

%0 Journal Article
%J Procedia Computer Science
%D 2017
%T Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X This paper presents new algorithmic approaches and optimization techniques for LU factorization and matrix inversion of millions of very small matrices using GPUs. These problems appear in many scientific applications including astrophysics and generation of block-Jacobi preconditioners. We show that, for very small problem sizes, design and optimization of GPU kernels require a mindset different from the one usually used when designing LAPACK algorithms for GPUs. Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. We also take advantage of the small matrix sizes to eliminate the intermediate row interchanges in both the factorization and inversion kernels. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorization, and 14× for the inversion, using double precision arithmetic on a Pascal P100 GPU.
%B Procedia Computer Science
%V 108
%P 606–615
%8 2017-06
%G eng
%R https://doi.org/10.1016/j.procs.2017.05.250

%0 Journal Article
%J Journal of Computational Science
%D 2017
%T Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K GPU computing; Cholesky factorization; Batched execution
%X This paper presents a GPU-accelerated Cholesky factorization for two different modes of operation. The first one is the batch mode, where many independent factorizations on small matrices can be performed concurrently. This mode supports fixed size and variable size problems, and is found in many scientific applications. The second mode is the native mode, where one factorization is performed on a large matrix without any CPU involvement, which allows the CPU do other useful work. We show that, despite the different workloads, both modes of operation share a common code-base that uses the GPU only. We also show that the developed routines achieve significant speedups against a multicore CPU using the MKL library, and against a GPU implementation by cuSOLVER. This work is part of the MAGMA library.
%B Journal of Computational Science
%V 20
%P 85–93
%8 2017-05
%G eng
%R https://doi.org/10.1016/j.jocs.2016.12.009

%0 Generic
%D 2017
%T Flexible Batched Sparse Matrix Vector Product on GPUs
%A Hartwig Anzt
%A Collins, Gary
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%I ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C Denver, Colorado
%8 2017-11
%G eng

%0 Conference Paper
%B 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17)
%D 2017
%T Flexible Batched Sparse Matrix-Vector Product on GPUs
%A Hartwig Anzt
%A Gary Collins
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%X We propose a variety of batched routines for concurrently processing a large collection of small-size, independent sparse matrix-vector products (SpMV) on graphics processing units (GPUs). These batched SpMV kernels are designed to be flexible in order to handle a batch of matrices which differ in size, nonzero count, and nonzero distribution. Furthermore, they support three most commonly used sparse storage formats: CSR, COO and ELL. Our experimental results on a state-of-the-art GPU reveal performance improvements of up to 25X compared to non-batched SpMV routines.
%B 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17)
%I ACM Press
%C Denver, CO
%8 2017-11
%G eng
%R http://dx.doi.org/10.1145/3148226.3148230

%0 Journal Article
%J ISC High Performance 2017
%D 2017
%T A Framework for Out of Memory SVD Algorithms
%A Khairul Kabir
%A Azzam Haidar
%A Stanimire Tomov
%A Aurelien Bouteiller
%A Jack Dongarra
%X Many important applications – from big data analytics to information retrieval, gene expression analysis, and numerical weather prediction – require the solution of large dense singular value decompositions (SVD). In many cases the problems are too large to fit into the computer’s main memory, and thus require specialized out-of-core algorithms that use disk storage. In this paper, we analyze the SVD communications, as related to hierarchical memories, and design a class of algorithms that minimizes them. This class includes out-of-core SVDs but can also be applied between other consecutive levels of the memory hierarchy, e.g., GPU SVD using the CPU memory for large problems. We call these out-of-memory (OOM) algorithms. To design OOM SVDs, we first study the communications for both classical one-stage blocked SVD and two-stage tiled SVD. We present the theoretical analysis and strategies to design, as well as implement, these communication avoiding OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models.
%B ISC High Performance 2017
%P 158–178
%8 2017-06
%G eng
%R https://doi.org/10.1007/978-3-319-58667-0_9

%0 Conference Paper
%B Proceedings of the General Purpose GPUs (GPGPU-10)
%D 2017
%T High-performance Cholesky Factorization for GPU-only Execution
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%X We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. We achieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8× faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.
%B Proceedings of the General Purpose GPUs (GPGPU-10)
%I ACM
%C Austin, TX
%8 2017-02
%G eng
%R https://doi.org/10.1145/3038228.3038237

%0 Conference Paper
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%D 2017
%T Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale
%A Anne Benoit
%A Franck Cappello
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%X This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint. We provide a detailed analytical study of both scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that corroborates the analytical model.
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%I ACM
%C Washington, DC
%8 2017-06
%G eng
%R 10.1145/3086157.3086162

%0 Conference Proceedings
%B Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award
%D 2017
%T Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Piotr Luszczek
%A Jack Dongarra
%X We compare the performance of pipelined and s-step GMRES, respectively referred to as l-GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s-GMRES requires fewer all-reduces, while l-GMRES overlaps the all-reduces with computation. To combine the best features of two algorithms, we propose another variant, (l, t)-GMRES, that not only does fewer global all-reduces than standard GMRES, but also overlaps those all-reduces with other work. We implemented the thread-parallelism and communication-overlap in two different ways. The first uses nonblocking MPI collectives with thread-parallel computational kernels. The second relies on a shared-memory task scheduler. In our experiments, (l, t)-GMRES performed better than l-GMRES by factors of up to 1.67×. In addition, though we only used 50 nodes, when the latency cost became significant, our variant performed up to 1.22× better than s-GMRES by hiding all-reduces.
%B Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award
%C Orlando, FL
%8 2017-06
%G eng
%R https://doi.org/10.1109/IPDPSW.2017.65

%0 Conference Paper
%B ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2017
%T Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers
%A Azzam Haidar
%A Panruo Wu
%A Stanimire Tomov
%A Jack Dongarra
%X The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today’s powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique – we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach.
%B ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%I ACM
%C Denver, CO
%8 11/2017
%G eng

%0 Generic
%D 2017
%T LAWN 294: Aasen's Symmetric Indenite Linear Solvers in LAPACK
%A Ichitaro Yamazaki
%A Jack Dongarra
%X Recently, we released two LAPACK subroutines that implement Aasen's algorithms for solving a symmetric indefinite linear system of equations. The first implementation is based on a partitioned right-looking variant of Aasen's algorithm (the column-wise left-looking panel factorization, followed by the right-looking trailing submatrix update using the panel). The second implements the two-stage left-looking variant of the algorithm (the block-wise left- looking algorithm that reduces the matrix to the symmetric band form, followed by the band LU factorization). In this report, we discuss our implementations and present our experimental results to compare the stability and performance of these two new solvers with those of the other two symmetric indefinite solvers in LAPACK (i.e., the Bunch-Kaufman and rook pivoting algorithms).
%B LAPACK Working Note
%I University of Tennessee
%8 2017-12
%G eng

%0 Journal Article
%J International Journal of High Performance Computing and Networking
%D 2017
%T A Look Back on 30 Years of the Gordon Bell Prize
%A Gordon Bell
%A David Bailey
%A Alan H. Karp
%A Jack Dongarra
%A Kevin Walsh
%K benchmarks
%K Computational Science
%K Gordon Bell Prize
%K High Performance Computing
%K HPC Cost-Performance
%K HPC Progress
%K HPC Recognition
%K HPC special hardware
%K HPPC Award. HPC Prize
%K Technical Computing
%X The Gordon Bell Prize is awarded each year by the Association for Computing Machinery to recognize outstanding achievement in high-performance computing (HPC). The purpose of the award is to track the progress of parallel computing with particular emphasis on rewarding innovation in applying HPC to applications in science, engineering, and large-scale data analytics. Prizes may be awarded for peak performance or special achievements in scalability and time-to-solution on important science and engineering problems. Financial support for the US$10,000 award is provided through an endowment by Gordon Bell, a pioneer in high-performance and parallel computing. This article examines the evolution of the Gordon Bell Prize and the impact it has had on the field.
%B International Journal of High Performance Computing and Networking
%V 31
%P 469–484
%G eng
%U http://journals.sagepub.com/doi/10.1177/1094342017738610
%N 6

%0 Generic
%D 2017
%T MAGMA Tensors and Batched Computing for Accelerating Applications on GPUs
%A Stanimire Tomov
%A Azzam Haidar
%I GPU Technology Conference (GTC17), Presentation in Session S7728
%C San Jose, CA
%8 2017-05
%G eng

%0 Generic
%D 2017
%T MagmaDNN – High-Performance Data Analytics for Manycore GPUs and CPUs
%A Lucien Ng
%A Kwai Wong
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%I 2017 Summer Research Experiences for Undergraduate (REU), Presentation
%C Knoxville, TN
%8 2017-12
%G eng

%0 Generic
%D 2017
%T MAGMA-sparse Interface Design Whitepaper
%A Hartwig Anzt
%A Erik Boman
%A Jack Dongarra
%A Goran Flegar
%A Mark Gates
%A Mike Heroux
%A Mark Hoemmen
%A Jakub Kurzak
%A Piotr Luszczek
%A Sivasankaran Rajamanickam
%A Stanimire Tomov
%A Stephen Wood
%A Ichitaro Yamazaki
%X In this report we describe the logic and interface we develop for the MAGMA-sparse library  to allow for easy integration as third-party library into a top-level software ecosystem. The  design choices are based on extensive consultation with other software library developers, in  particular the Trilinos software development team. The interface documentation is at this point  not exhaustive, but a first proposal for setting a standard. Although the interface description  targets the MAGMA-sparse software module, we hope that the design choices carry beyond this  specific library, and are attractive for adoption in other packages.  This report is not intended as static document, but will be updated over time to reflect the agile  software development in the ECP 1.3.3.11 STMS11-PEEKS project.
%B Innovative Computing Laboratory Technical Report
%8 2017-09
%G eng
%9 Technical Report

%0 Conference Paper
%B International Conference on Supercomputing (ICS '17)
%D 2017
%T Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%B International Conference on Supercomputing (ICS '17)
%I ACM
%C Chicago, Illinois
%8 2017-06
%G eng
%U http://dl.acm.org/citation.cfm?id=3079103
%R 10.1145/3079079.3079103

%0 Conference Paper
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%D 2017
%T Optimal Checkpointing Period with replicated execution on heterogeneous platforms
%A Anne Benoit
%A Aurelien Cavelan
%A Valentin Le Fèvre
%A Yves Robert
%X In this paper, we design and analyze strategies to replicate the execution of an application on two different platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~W for a periodic checkpointing strategy where both platforms concurrently try and execute W units of work before checkpointing. The first platform that completes its pattern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint.    We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use first or second-order approximations to compute overheads and optimal pattern sizes, and show through extensive simulations that these models are very accurate. The simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively different speeds: in average, over a wide range of scenarios, the overhead is reduced by 30%. The simulations also demonstrate that the periodic checkpointing strategy is globally more efficient, unless platform speeds are quite close.
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%I IEEE Computer Society Press
%C Washington, DC
%8 2017-06
%G eng
%R 10.1145/3086157.3086165

%0 Conference Paper
%B Euro-Par 2017
%D 2017
%T Optimized Batched Linear Algebra for Modern Architectures
%A Jack Dongarra
%A Sven Hammarling
%A Nicholas J. Higham
%A Samuel Relton
%A Mawussi Zounon
%X Solving large numbers of small linear algebra problems simultaneously is becoming increasingly important in many application areas. Whilst many researchers have investigated the design of efficient batch linear algebra kernels for GPU architectures, the common approach for many/multi-core CPUs is to use one core per subproblem in the batch. When solving batches of very small matrices, 2 × 2 for example, this design exhibits two main issues: it fails to fully utilize the vector units and the cache of modern architectures, since the matrices are too small. Our approach to resolve this is as follows: given a batch of small matrices spread throughout the primary memory, we first reorganize the elements of the matrices into a contiguous array, using a block interleaved memory format, which allows us to process the small independent problems as a single large matrix problem and enables cross-matrix vectorization. The large problem is solved using blocking strategies that attempt to optimize the use of the cache. The solution is then converted back to the original storage format. To explain our approach we focus on two BLAS routines: general matrix-matrix multiplication (GEMM) and the triangular solve (TRSM). We extend this idea to LAPACK routines using the Cholesky factorization and solve (POSV). Our focus is primarily on very small matrices ranging in size from 2 × 2 to 32 × 32. Compared to both MKL and OpenMP implementations, our approach can be up to 4 times faster for GEMM, up to 14 times faster for TRSM, and up to 40 times faster for POSV on the new Intel Xeon Phi processor, code-named Knights Landing (KNL). Furthermore, we discuss strategies to avoid data movement between sockets when using our interleaved approach on a NUMA node.
%B Euro-Par 2017
%I Springer
%C Santiago de Compostela, Spain
%8 2017-08
%G eng
%R https://doi.org/10.1007/978-3-319-64203-1_37

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices
%A Tingxing Dong
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X A challenging class of problems arising in many GPU applications, called batched problems, involves linear algebra operations on many small-sized matrices. We designed batched BLAS (Basic Linear Algebra Subroutines) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. We proposed device functions and big-tile settings in our batched BLAS design. We adopted auto-tuning to optimize different instances of GEMV routines. We illustrated our batched BLAS approach to optimize batched bi-diagonalization progressively on a K40c GPU. The optimization techniques in this paper are applicable to the other two-sided factorizations as well.
%B International Conference on Computational Science (ICCS 2017)
%I Procedia Computer Science
%C Zurich, Switzerland
%8 2017-06
%G eng
%U http://www.sciencedirect.com/science/article/pii/S1877050917308645
%R https://doi.org/10.1016/j.procs.2017.05.237

%0 Conference Paper
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17)
%D 2017
%T Out of Memory SVD Solver for Big Data
%A Azzam Haidar
%A Khairul Kabir
%A Diana Fayad
%A Stanimire Tomov
%A Jack Dongarra
%X Many applications – from data compression to numerical weather prediction and information retrieval – need to compute large dense singular value decompositions (SVD). When the problems are too large to fit into the computer’s main memory, specialized out-of-core algorithms that use disk storage are required. A typical example is when trying to analyze a large data set through tools like MATLAB or Octave, but the data is just too large to be loaded. To overcome this, we designed a class of out-of-memory (OOM) algorithms to reduce, as well as overlap communication with computation. Of particular interest is OOM algorithms for matrices of size m×n, where m >> n or m << n, e.g., corresponding to cases of too many variables, or too many observations. To design OOM SVDs, we first study the communications cost for the SVD techniques as well as for the QR/LQ factorization followed by SVD. We present the theoretical analysis about the data movement cost and strategies to design OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models. Moreover, our experimental results show the feasibility and superiority of the OOM SVD.
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17)
%I IEEE
%C Waltham, MA
%8 2017-09
%G eng

%0 Book Section
%B Exascale Scientific Applications: Scalability and Performance Portability
%D 2017
%T Performance Analysis and Debugging Tools at Scale
%A Scott Parker
%A John Mellor-Crummey
%A Dong H. Ahn
%A Heike Jagode
%A Holger Brunst
%A Sameer Shende
%A Allen D. Malony
%A David DelSignore
%A Ronny Tschuter
%A Ralph Castain
%A Kevin Harms
%A Philip Carns
%A Ray Loy
%A Kalyan Kumaran
%X This chapter explores present-day challenges and those likely to arise as new hardware and software technologies are introduced on the path to exascale. It covers some of the underlying hardware, software, and techniques that enable tools and debuggers. Performance tools and debuggers are critical components that enable computational scientists to fully exploit the computing power of While high-performance computing systems. Instrumentation is the insertion of code to perform measurement in a program. It is vital step in performance analysis, especially for parallel programs. The essence of a debugging tool is enabling observation, exploration, and control of program state, such that a developer can, for example, verify that what is currently occurring correlates to what is intended. The increased complexity and volume of performance and debugging data likely to be seen on exascale systems risks overwhelming tool users. Tools and debuggers may need to develop advanced techniques such as automated filtering and analysis to reduce the complexity seen by the user.
%B Exascale Scientific Applications: Scalability and Performance Portability
%I Chapman & Hall / CRC Press
%P 17-50
%8 2017-11
%@ 9781315277400
%G eng
%& 2
%R https://doi.org/10.1201/b21930

%0 Generic
%D 2017
%T PLASMA 17 Performance Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng

%0 Generic
%D 2017
%T PLASMA 17.1 Functionality Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng

%0 Conference Proceedings
%B Proceedings of the 24th European MPI Users' Group Meeting
%D 2017
%T PMIx: Process Management for Exascale Environments
%A Castain, Ralph H.
%A David Solt
%A Joshua Hursey
%A Aurelien Bouteiller
%X High-Performance Computing (HPC) applications have historically executed in static resource allocations, using programming models that ran independently from the resident system management stack (SMS). Achieving exascale performance that is both cost-effective and fits within site-level environmental constraints will, however, require that the application and SMS collaboratively orchestrate the flow of work to optimize resource utilization and compensate for on-the-fly faults. The Process Management Interface - Exascale (PMIx) community is committed to establishing scalable workflow orchestration by defining an abstract set of interfaces by which not only applications and tools can interact with the resident SMS, but also the various SMS components can interact with each other. This paper presents a high-level overview of the goals and current state of the PMIx standard, and lays out a roadmap for future directions.
%B Proceedings of the 24th European MPI Users' Group Meeting
%S EuroMPI '17
%I ACM
%C New York, NY, USA
%P 14:1–14:10
%@ 978-1-4503-4849-2
%G eng
%U http://doi.acm.org/10.1145/3127024.3127027
%R 10.1145/3127024.3127027

%0 Generic
%D 2017
%T POMPEI: Programming with OpenMP4 for Exascale Investigations
%A Jack Dongarra
%A Azzam Haidar
%A Oscar Hernandez
%A Stanimire Tomov
%A Manjunath Gorentla Venkata
%X The objective of the Programming with OpenMP4 for Exascale Investigations (POMPEI) project is to explore new task-based programming techniques together with data structure centric programming for scientific applications to harness the potential of extreme-scale systems. Tasking is a well established by now approach on such systems as it has been used successfully to handle their large-scale parallelism and heterogeneity, which are leading challenges on the way to exascale computing. The approach is to harness the latest features of OpenMP4.5 and OpenACC2.5 to design abstractions shared among tasks and mapped efficiently to data-structure driven programming paradigms. This technical report describes the approach, along with its reference implementation and results for dense linear algebra algorithms.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-12
%G eng

%0 Conference Paper
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist
%D 2017
%T Power-aware Computing: Measurement, Control, and Performance Analysis for Intel Xeon Phi
%A Azzam Haidar
%A Heike Jagode
%A Asim YarKhan
%A Phil Vaccaro
%A Stanimire Tomov
%A Jack Dongarra
%X The emergence of power efficiency as a primary constraint in processor and system designs poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers in particular for peta- and exa- scale systems. Understanding and improving the energy efficiency of numerical simulation becomes very crucial.    We present a detailed study and investigation toward control- ling power usage and exploring how different power caps affect the performance of numerical algorithms with different computa- tional intensities, and determine the impact and correlation with performance of scientific applications.    Our analyses is performed using a set of representatives kernels, as well as many highly used scientific benchmarks. We quantify a number of power and performance measurements, and draw observations and conclusions that can be viewed as a roadmap toward achieving energy efficiency computing algorithms.
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist
%I IEEE
%C Waltham, MA
%8 2017-09
%G eng
%R https://doi.org/10.1109/HPEC.2017.8091085

%0 Generic
%D 2017
%T Power-Aware HPC on Intel Xeon Phi KNL Processors
%A Azzam Haidar
%A Heike Jagode
%A Asim YarKhan
%A Phil Vaccaro
%A Stanimire Tomov
%A Jack Dongarra
%I ISC High Performance (ISC17), Intel Booth Presentation
%C Frankfurt, Germany
%8 2017-06
%G eng

%0 Journal Article
%J Parallel Computing
%D 2017
%T Preconditioned Krylov Solvers on GPUs
%A Hartwig Anzt
%A Mark Gates
%A Jack Dongarra
%A Moritz Kreutzer
%A Gerhard Wellein
%A Martin Kohler
%K gpu
%K ILU
%K Jacobi
%K Krylov solvers
%K Preconditioning
%X In this paper, we study the effect of enhancing GPU-accelerated Krylov solvers with preconditioners. We consider the BiCGSTAB, CGS, QMR, and IDR(s) Krylov solvers. For a large set of test matrices, we assess the impact of Jacobi and incomplete factorization preconditioning on the solvers’ numerical stability and time-to-solution performance. We also analyze how the use of a preconditioner impacts the choice of the fastest solver.
%B Parallel Computing
%8 2017-06
%G eng
%U http://www.sciencedirect.com/science/article/pii/S0167819117300777
%! Parallel Computing
%R 10.1016/j.parco.2017.05.006

%0 Generic
%D 2017
%T Report on the TianHe-2A System
%A Jack Dongarra
%X The TianHe-2A (TH-2A) compute system, designed by China’s National University of Defense Technology (NUDT), is an upgrade of the TianHe-2 (TH-2) system. TianHe is sometimes referred to as “Milkyway,” and the latest iteration of this system is currently undergoing assembly and testing at China’s National Supercomputer Center in Guangzhou (NSCC-GZ). At the time of this report, the system is 25% complete and should be fully functional by November 2017. The most significant enhancement to the system is the upgrade to the TianHe-2 nodes; the old Intel Xeon Phi Knights Corner (KNC) accelerators will be replaced with a proprietary accelerator called the Matrix-2000. In addition, the network has been enhanced, the memory increased, and the number of cabinets expanded. The completed system, when fully integrated with 4,981,760 cores and 3.4 PB of primary memory, will have a theoretical peak performance of 94.97 Pflop/s, which is roughly double the performance of the existing TianHe-2 system. NUDT also developed the heterogeneous programming environment for the Matrix-20002 with support for OpenMP and OpenCL.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-09
%G eng

%0 Conference Paper
%B International Conference on Parallel Processing (ICPP)
%D 2017
%T Resilience for Stencil Computations with Latent Errors
%A Aiman Fang
%A Aurelien Cavelan
%A Yves Robert
%A Andrew Chien
%X Projections and measurements of error rates in near-exascale and exascale systems suggest a dramatic growth, due to extreme scale (109,109 cores), concurrency, software complexity, and deep submicron transistor scaling. Such a growth makes resilience a critical concern, and may increase the incidence of errors that "escape," silently corrupting application state. Such errors can often be revealed by application software tests but with long latencies, and thus are known as latent errors. We explore how to efficiently recover from latent errors, with an approach called application-based focused recovery (ABFR). Specifically we present a case study of stencil computations, a widely useful computational structure, showing how ABFR focuses recovery effort where needed, using intelligent testing and pruning to reduce recovery effort, and enables recovery effort to be overlapped with application computation. We analyze and characterize the ABFR approach on stencils, creating a performance model parameterized by error rate and detection interval (latency). We compare projections from the model to experimental results with the Chombo stencil application, validating the model and showing that ABFR on stencil can achieve a significant reductions in error recovery cost (up to 400x) and recovery latency (up to 4x). Such reductions enable efficient execution at scale with high latent error rates.
%B International Conference on Parallel Processing (ICPP)
%I IEEE Computer Society Press
%C Bristol, UK
%8 2017-08
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications (IJHPCA)
%D 2017
%T Resilient Co-Scheduling of Malleable Applications
%A Anne Benoit
%A Loïc Pottier
%A Yves Robert
%K co-scheduling
%K complexity results
%K heuristics
%K Redistribution
%K resilience
%K simulations
%X Recently, the benefits of co-scheduling several applications have been demonstrated in a fault-free context, both in terms of performance and energy savings. However, large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between applications and significantly degrade performance. In this article, we aim at minimizing the expected completion time of a set of co-scheduled applications. We propose to redistribute the resources assigned to each application upon the occurrence of failures, and upon the completion of some applications, in order to achieve this goal. First, we introduce a formal model and establish complexity results. The problem is NP-complete for malleable applications, even in a fault-free context. Therefore, we design polynomial-time heuristics that perform redistributions and account for processor failures. A fault simulator is used to perform extensive simulations that demonstrate the usefulness of redistribution and the performance of the proposed heuristics.
%B International Journal of High Performance Computing Applications (IJHPCA)
%8 2017-05
%G eng
%R 10.1177/1094342017704979

%0 Generic
%D 2017
%T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Aurelien Bouteiller
%A Anthony Danalis
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Stephen Wood
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-06
%G eng
%9 SLATE Working Notes
%1 01

%0 Conference Paper
%B IEEE International Conference on Big Data
%D 2017
%T Sampling Algorithms to Update Truncated SVD
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%B IEEE International Conference on Big Data
%I IEEE
%C Boston, MA
%8 2017-12
%G eng

%0 Conference Paper
%B IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017)
%D 2017
%T Scaling Point Set Registration in 3D Across Thread Counts on Multicore and Hardware Accelerator Platforms through Autotuning for Large Scale Analysis of Scientific Point Clouds
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A David Keffer
%A Jack Dongarra
%X In this article, we present an autotuning approach applied to systematic performance engineering of the EM-ICP (Expectation-Maximization Iterative Closest Point) algorithm for the point set registration problem. We show how we were able to exceed the performance achieved by the reference code through multiple dependence transformations and automated procedure of generating and evaluating numerous implementation variants. Furthermore, we also managed to exploit code transformations that are not that common during manual optimization but yielded better performance in our tests for the EM-ICP algorithm. Finally, we maintained high levels of performance rate in a portable fashion across a wide range of HPC hardware platforms including multicore, many-core, and GPU-based accelerators. More importantly, the results indicate consistently high performance level and ability to move the task of data analysis through point-set registration to any modern compute platform without the concern of inferior asymptotic efficiency.
%B IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017)
%I IEEE
%C Boston, MA
%8 2017-12
%G eng
%R https://doi.org/10.1109/BigData.2017.8258258

%0 Generic
%D 2017
%T Small Tensor Operations on Advanced Architectures for High-Order Applications
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%B University of Tennessee Computer Science Technical Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-04
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2017
%T Solving Dense Symmetric Indefinite Systems using GPUs
%A Marc Baboulin
%A Jack Dongarra
%A Adrien Remy
%A Stanimire Tomov
%A Ichitaro Yamazaki
%X This paper studies the performance of different algorithms for solving a dense symmetric indefinite linear system of equations on multicore CPUs with a Graphics Processing Unit (GPU). To ensure the numerical stability of the factorization, pivoting is required. Obtaining high performance of such algorithms on the GPU is difficult because all the existing pivoting strategies lead to frequent synchronizations and irregular data accesses. Until recently, there has not been any implementation of these algorithms on a hybrid CPU/GPU architecture. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive data transfer and synchronization between the CPU and GPU, or on the GPU (e.g., factorizing the matrix entirely on the GPU or in a communication-avoiding fashion). We also study the performance of the solver using iterative refinements along with the factorization without pivoting combined with the preprocessing technique based on random butterfly transformations, or with the mixed-precision algorithm where the matrix is factorized in single precision. This randomization algorithm only has a probabilistic proof on the numerical stability, and for this paper, we only focused on the mixed-precision algorithm without pivoting. However, they demonstrate that we can obtain good performance on the GPU by avoiding the pivoting and using the lower precision arithmetics, respectively. As illustrated with the application in acoustics studied in this paper, in many practical cases, the matrices can be factorized without pivoting. Because the componentwise backward error computed in the iterative refinement signals when the algorithm failed to obtain the desired accuracy, the user can use these potentially unstable but efficient algorithms in most of the cases and fall back to a more stable algorithm with pivoting only in the case of the failure.
%B Concurrency and Computation: Practice and Experience
%V 29
%8 2017-03
%G eng
%U http://onlinelibrary.wiley.com/doi/10.1002/cpe.4055/full
%N 9
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.4055

%0 Journal Article
%J IEEE Embedded Systems Letters
%D 2017
%T Structure-aware Linear Solver for Realtime Convex Optimization for Embedded Systems
%A Ichitaro Yamazaki
%A Saeid Nooshabadi
%A Stanimire Tomov
%A Jack Dongarra
%K Karush Kuhn Tucker (KKT)
%K Realtime embedded convex optimization solver
%X With the increasing sophistication in the use of optimization algorithms such as deep learning on embedded systems, the convex optimization solvers on embedded systems have found widespread use. This letter presents a novel linear solver technique to reduce the run-time of convex optimization solver by using the property that some parameters are fixed during the solution iterations of a solve instance. Our experimental results show that the run-time can be reduced by two orders of magnitude.
%B IEEE Embedded Systems Letters
%V 9
%P 61–64
%8 2017-05
%G eng
%U http://ieeexplore.ieee.org/document/7917357/
%N 3
%R 10.1109/LES.2017.2700401

%0 Conference Paper
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC)
%D 2017
%T Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A Jack Dongarra
%X With NVIDA Tegra Jetson X1 and Pascal P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. In this talk, we will introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how they relate to NVIDIA platforms.
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC)
%I IEEE
%C Waltham, MA
%8 2017-09
%G eng
%R https://doi.org/10.1109/HPEC.2017.8091031

%0 Journal Article
%J IEEE Transactions on Computers
%D 2017
%T Towards Optimal Multi-Level Checkpointing
%A Anne Benoit
%A Aurelien Cavelan
%A Valentin Le Fèvre
%A Yves Robert
%A Hongyang Sun
%K checkpointing
%K Dynamic programming
%K Error analysis
%K Heuristic algorithms
%K Optimized production technology
%K protocols
%K Shape
%B IEEE Transactions on Computers
%V 66
%P 1212–1226
%8 2017-07
%G eng
%N 7
%R 10.1109/TC.2016.2643660

%0 Conference Paper
%B EuroMPI
%D 2017
%T Using Software-Based Performance Counters to Expose Low-Level Open MPI Performance Information
%A David Eberius
%A Thananon Patinyasakdikul
%A George Bosilca
%K MPI
%K Performance Counters
%K Profiling
%K Tools
%X This paper details the implementation and usage of software-based performance counters to understand the performance of a particular implementation of the MPI standard, Open MPI.  Such counters can expose intrinsic features of the software stack that are not available otherwise in a generic and portable way. The PMPI-interface is useful for instrumenting MPI applications at a user level, however it is insufficient for providing meaningful internal MPI performance details.  While the Peruse interface provides more detailed information on state changes within Open MPI, it has not seen widespread adoption.  We introduce a simple low-level approach that instruments the Open MPI code at key locations to provide fine-grained MPI performance metrics.  We evaluate the overhead associated with adding these counters to Open MPI as well as their use in determining bottlenecks and areas for improvement both in user code and the MPI implementation itself.
%B EuroMPI
%I ACM
%C Chicago, IL
%8 2017-09
%@ 978-1-4503-4849-2/17/09
%G eng
%U https://dl.acm.org/citation.cfm?id=3127024
%R https://doi.org/10.1145/3127024.3127039

%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%A Andres E. Thomas
%X In this work we present new kernels for the generation and application of block-Jacobi precon-ditioners that accelerate the iterative solution of sparse linear systems on graphics processing units (GPUs). Our approach departs from the conventional LU factorization and decomposes the diagonal blocks of the matrix using the Gauss-Huard method. When enhanced with column pivoting, this method is as stable as LU with partial/row pivoting. Due to extensive use of GPU registers and integration of implicit pivoting, our variable size batched Gauss-Huard implementation outperforms the batched version of LU factorization. In addition, the application kernel combines the conventional two-stage triangular solve procedure, consisting of a backward solve followed by a forward solve, into a single stage that performs both operations simultaneously.
%B International Conference on Computational Science (ICCS 2017)
%I Procedia Computer Science
%C Zurich, Switzerland
%V 108
%P 1783-1792
%8 2017-06
%G eng
%R https://doi.org/10.1016/j.procs.2017.05.186

%0 Conference Paper
%B 46th International Conference on Parallel Processing (ICPP)
%D 2017
%T Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K graphics processing units
%K Jacobian matrices
%K Kernel
%K linear systems
%K Parallel processing
%K Sparse matrices
%X We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. All kernels heavily exploit the registers of the graphics processing unit (GPU) in order to deliver high performance for small problems. The development of these kernels is motivated by the need for tackling this embarrassingly parallel scenario in the context of block-Jacobi preconditioning that is relevant for the iterative solution of sparse linear systems.
%B 46th International Conference on Parallel Processing (ICPP)
%I IEEE
%C Bristol, United Kingdom
%8 2017-08
%G eng
%U http://ieeexplore.ieee.org/abstract/document/8025283/?reload=true
%R 10.1109/ICPP.2017.18

%0 Journal Article
%J Computing in Science & Engineering
%D 2017
%T With Extreme Computing, the Rules Have Changed
%A Jack Dongarra
%A Stanimire Tomov
%A Piotr Luszczek
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Azzam Haidar
%A Ahmad Abdelfattah
%X On the eve of exascale computing, traditional wisdom no longer applies. High-performance computing is gone as we know it. This article discusses a range of new algorithmic techniques emerging in the context of exascale computing, many of which defy the common wisdom of high-performance computing and are considered unorthodox, but could turn out to be a necessity in near future.
%B Computing in Science & Engineering
%V 19
%P 52-62
%8 2017-05
%G eng
%N 3
%R https://doi.org/10.1109/MCSE.2017.48

%0 Generic
%D 2016
%T 2016 Dense Linear Algebra Software Packages Survey
%A Jack Dongarra
%A Jim Demmel
%A Julien Langou
%A Julie Langou
%X The 2016 Dense Linear Algebra Software Packages Survey was administered from January 1st 2016 to April 12 2016. 234 respondents answered the survey. The survey was advertised directly to the Linear Algebra community via our LAPACK/ScaLAPACK forum, NA Digest and we also directly contacted vendors and linear algebra experts. The breakdown of respondents was: 74% researchers or scientists, 25% were Principal Investigators and 25% Software maintainers or System administrators. The goal of the survey was to get the Linear Algebra community opinion and provide input on dense linear algebra software packages, in particular LAPACK, ScaLAPACK, PLASMA and MAGMA. The ultimate purpose of the survey was to improve these libraries to benefit our user community. The survey would allow the team to prioritize the many possible improvements that could be done. We also asked input from users accessing these libraries via 3rd party interfaces, for example MATLAB, Intel’s MKL, Python’s NumPy, AMD's ACML, and many others.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2016-09
%G eng

%0 Generic
%D 2016
%T Accelerating Tensor Contractions for High-Order FEM on CPUs, GPUs, and KNLs
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Veselin Dobrev
%A Ian Karlin
%A Tzanio Kolev
%A Stanimire Tomov
%A Jack Dongarra
%I moky Mountains Computational Sciences and Engineering Conference (SMC16), Poster
%C Gatlinburg, TN
%8 2016-09
%G eng

%0 Journal Article
%J VECPAR
%D 2016
%T Accelerating the Conjugate Gradient Algorithm with GPU in CFD Simulations
%A Hartwig Anzt
%A Marc Baboulin
%A Jack Dongarra
%A Yvan Fournier
%A Frank Hulsemann
%A Amal Khabou
%A Yushan Wang
%X This paper illustrates how GPU computing can be used to accelerate computational fluid dynamics (CFD) simulations. For sparse linear systems arising from finite volume discretization, we evaluate and optimize the performance of Conjugate Gradient (CG) routines designed for manycore accelerators and compare against an industrial CPU-based implementation. We also investigate how the recent advances in preconditioning, such as iterative Incomplete Cholesky (IC, as symmetric case of ILU) preconditioning, match the requirements for solving real world problems.
%B VECPAR
%G eng
%U http://hgpu.org/?p=16264

%0 Journal Article
%J ACM Transactions on Parallel Computing
%D 2016
%T Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors
%A Anne Benoit
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%K checkpoint
%K fail-stop error
%K failure
%K HPC
%K resilience
%K silent data corruption
%K silent error
%K verification
%X In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent  errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bi-criteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via dynamic voltage and frequency scaling (DVFS). We determine in this scenario the optimal checkpointing and verification locations, as well as the optimal speed pairs. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.
%B ACM Transactions on Parallel Computing
%8 2016-08
%G eng
%R 10.1145/2897189

%0 Journal Article
%J Parallel Computing
%D 2016
%T Assessing the Cost of Redistribution followed by a Computational Kernel: Complexity and Performance Results
%A Julien Herrmann
%A George Bosilca
%A Thomas Herault
%A Loris Marchal
%A Yves Robert
%A Jack Dongarra
%K Data partition
%K linear algebra
%K parsec
%K QR factorization
%K Redistribution
%K Stencil
%X The classical redistribution problem aims at optimally scheduling communications when reshuffling from an initial data distribution to a target data distribution. This target data distribution is usually chosen to optimize some objective for the algorithmic kernel under study (good computational balance or low communication volume or cost), and therefore to provide high efficiency for that kernel. However, the choice of a distribution minimizing the target objective is not unique. This leads to generalizing the redistribution problem as follows: find a re-mapping of data items onto processors such that the data redistribution cost is minimal, and the operation remains as efficient. This paper studies the complexity of this generalized problem. We compute optimal solutions and evaluate, through simulations, their gain over classical redistribution. We also show the NP-hardness of the problem to find the optimal data partition and processor permutation (defined by new subsets) that minimize the cost of redistribution followed by a simple computational kernel. Finally, experimental validation of the new redistribution algorithms are conducted on a multicore cluster, for both a 1D-stencil kernel and a more compute-intensive dense linear algebra routine.
%B Parallel Computing
%V 52
%P 22-41
%8 2016-02
%G eng
%R doi:10.1016/j.parco.2015.09.005

%0 Conference Proceedings
%B Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2016
%T Batched Generation of Incomplete Sparse Approximate Inverses on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Thomas Huckle
%A Jack Dongarra
%X Incomplete Sparse Approximate Inverses (ISAI) have recently been shown to be an attractive alternative to exact sparse triangular solves in the context of incomplete factorization preconditioning. In this paper we propose a batched GPU-kernel for the efficient generation of ISAI matrices. Utilizing only thread-local memory allows for computing the ISAI matrix with very small memory footprint. We demonstrate that this strategy is faster than the existing strategy for generating ISAI matrices, and use a large number of test matrices to assess the algorithm's efficiency in an iterative solver setting.
%B Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%S ScalA '16
%P 49–56
%8 2016-11
%@ 978-1-5090-5222-6
%G eng
%R 10.1109/ScalA.2016.11

%0 Generic
%D 2016
%T On block-asynchronous execution on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Jack Dongarra
%X This paper experimentally investigates how GPUs execute instructions when used for general purpose computing (GPGPU). We use a light-weight realizing a vector operation to analyze which vector entries are updated subsequently, and identify regions where parallel execution can be expected. The results help us to understand how GPUs operate, and map this operation mode to the mathematical concept of asynchronism. In particular it helps to understand the effects that can occur when implementing a fixed-point method using in-place updates on GPU hardware.
%B LAPACK Working Note
%8 2016-11
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn291.pdf

%0 Generic
%D 2016
%T Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%I GPU Technology Conference (GTC16), Poster
%C San Jose, CA
%8 2016-04
%G eng

%0 Generic
%D 2016
%T Context Identifier Allocation in Open MPI
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%X The concept of communicators is a central notion in Message Passing Interface, allowing on one side the MPI implemen- tation to specialize it’s matching and deliver messages in the right context, and on the other side the library developers to contextualize their message exchanges, and scope different algorithms to well-defined groups of processes. More pre- cisely, all communication objects in MPI are derived from a communicator at some point. All MPI functions allowing the creation of new communicators have a collective mean- ing, either over the group of processes from the parent com- municator or those from the target communicator. Thus, the perfromance of the communicator creation is tied to the col- lective communication performance, as well as the amount of data needed to be exchanged in order to consistently create this new communicator. We introduce several communica- tor creation algorithms, and present their implementation in the context of Open MPI. We explore the performance of these new algorithms and compare them with state-of-the- art algorithms available in other MPI implementations.
%B University of Tennessee Computer Science Technical Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2016-01
%G eng

%0 Book Section
%B Lecture Notes in Computer Science
%D 2016
%T Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures
%A Marc Baboulin
%A Jack Dongarra
%A Adrien Remy
%A Stanimire Tomov
%A Ichitaro Yamazaki
%E Roman Wyrzykowski
%E Ewa Deelman
%E Konrad Karczewski
%E Jacek Kitowski
%E Kazimierz Wiatr
%K Communication-avoiding
%K Dense symmetric indefinite factorization
%K gpu computation
%K randomization
%X We study the performance of dense symmetric indefinite factorizations (Bunch-Kaufman and Aasen’s algorithms) on multicore CPUs with a Graphics Processing Unit (GPU). Though such algorithms are needed in many scientific and engineering simulations, obtaining high performance of the factorization on the GPU is difficult because the pivoting that is required to ensure the numerical stability of the factorization leads to frequent synchronizations and irregular data accesses. As a result, until recently, there has not been any implementation of these algorithms on hybrid CPU/GPU architectures. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive communication and synchronization between the CPU and GPU, or on the GPU. We also study the performance of an LDL^T factorization with no pivoting combined with the preprocessing technique based on Random Butterfly Transformations. Though such transformations only have probabilistic results on the numerical stability, they avoid the pivoting and obtain a great performance on the GPU.
%B Lecture Notes in Computer Science
%S 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I
%I Springer International Publishing
%V 9573
%P 86-95
%8 2015-09
%@ 978-3-319-32149-3
%G eng
%& Parallel Processing and Applied Mathematics
%R 10.1007/978-3-319-32149-3_9

%0 Conference Paper
%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016
%D 2016
%T On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K batched computation
%K GPUs
%K variable small sizes
%X <p>  Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required.  </p>  <p>  This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors’ knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2:5 against two Sandy Bridge CPUs (8-core each) running Intel MKL library.  </p>
%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng

%0 Conference Proceedings
%B Software for Exascale Computing - SPPEXA
%D 2016
%T Domain Overlap for Iterative Sparse Triangular Solves on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Daniel Szyld
%A Jack Dongarra
%E Hans-Joachim Bungartz
%E Philipp Neumann
%E Wolfgang E. Nagel
%X Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution.
%B Software for Exascale Computing - SPPEXA
%S Lecture Notes in Computer Science and Engineering
%I Springer International Publishing
%V 113
%P 527–545
%8 2016-09
%G eng
%R 10.1007/978-3-319-40528-5_24

%0 Conference Proceedings
%B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2016
%T Efficiency of General Krylov Methods on GPUs – An Experimental Study
%A Hartwig Anzt
%A Jack Dongarra
%A Moritz Kreutzer
%A Gerhard Wellein
%A Martin Kohler
%K algorithmic bombardment
%K BiCGSTAB
%K CGS
%K Convergence
%K Electric breakdown
%K gpu
%K graphics processing units
%K Hardware
%K IDR(s)
%K Krylov solver
%K Libraries
%K linear systems
%K QMR
%K Sparse matrices
%X This paper compares different Krylov methods based on short recurrences with respect to their efficiency whenimplemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergencecharacteristics. For a large set of test matrices taken from theUniversity of Florida Matrix Collection, we evaluate the methods'performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and executiontime. We also analyze whether the methods are "orthogonal"in terms of problem suitability. We propose best practicesfor choosing methods in a "black box" scenario, where noinformation about the optimal solver is available.
%B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%P 683-691
%8 2016-05
%G eng
%R 10.1109/IPDPSW.2016.45

%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES)
%D 2016
%T Efficiency of General Krylov Methods on GPUs – An Experimental Study
%A Hartwig Anzt
%A Jack Dongarra
%A Moritz Kreutzer
%A Gerhard Wellein
%A Martin Kohler
%K algorithmic bombardment
%K BiCGSTAB
%K CGS
%K gpu
%K IDR(s)
%K Krylov solver
%K QMR
%X This paper compares different Krylov methods based on short recurrences with respect to their efficiency when implemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergence characteristics. For a large set of test matrices taken from the University of Florida Matrix Collection, we evaluate the methods’  performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and execution time. We also analyze whether the methods are “orthogonal” in terms of problem suitability. We propose best practices for choosing methods in a “black box” scenario, where no information about the optimal solver is available.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES)
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%R 10.1109/IPDPSW.2016.45

%0 Conference Proceedings
%B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16)
%D 2016
%T Failure Detection and Propagation in HPC Systems
%A George Bosilca
%A Aurelien Bouteiller
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Pierre Sens
%A Jack Dongarra
%K failure detection
%K fault-tolerance
%K MPI
%B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16)
%I IEEE Press
%C Salt Lake City, Utah
%P 27:1-27:11
%8 2016-11
%@ 978-1-4673-8815-3
%G eng
%U http://dl.acm.org/citation.cfm?id=3014904.3014941

%0 Journal Article
%J Journal of Computational Science
%D 2016
%T Fine-grained Bit-Flip Protection for Relaxation Methods
%A Hartwig Anzt
%A Jack Dongarra
%A Enrique S. Quintana-Orti
%K Bit flips
%K Fault tolerance
%K High Performance Computing
%K iterative solvers
%K Jacobi method
%K sparse linear systems
%X Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.
%B Journal of Computational Science
%8 2016-11
%G eng
%R https://doi.org/10.1016/j.jocs.2016.11.013

%0 Conference Paper
%B 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'16)
%D 2016
%T GPU-Aware Non-contiguous Data Movement In Open MPI
%A Wei Wu
%A George Bosilca
%A Rolf vandeVaart
%A Sylvain Jeaugey
%A Jack Dongarra
%K datatype
%K gpu
%K hybrid architecture
%K MPI
%K non-contiguous data
%X <p>Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applications. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non-contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance.</p>    <p>To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype packing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Unied Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.</p>
%B 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'16)
%I ACM
%C Kyoto, Japan
%8 2016-06
%G eng
%R http://dx.doi.org/10.1145/2907294.2907317

%0 Conference Paper
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2016
%T Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures
%A Yulu Jia
%A Piotr Luszczek
%A Jack Dongarra
%X Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction.
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng

%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%D 2016
%T Heterogeneous Streaming
%A Chris J. Newburn
%A Gaurav Bansal
%A Michael Wood
%A Luis Crivelli
%A Judit Planas
%A Alejandro Duran
%A Paulo Souza
%A Leonardo Borges
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%A Hartwig Anzt
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Ichitaro Yamazaki
%A Jesus Labarta
%K plasma
%X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2016
%T High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems
%A Jack Dongarra
%A Michael A. Heroux
%A Piotr Luszczek
%B International Journal of High Performance Computing Applications
%V 30
%P 3 - 10
%8 2016-02
%G eng
%U http://hpc.sagepub.com/cgi/doi/10.1177/1094342015593158
%N 1
%! International Journal of High Performance Computing Applications
%R 10.1177/1094342015593158

%0 Generic
%D 2016
%T High Performance Realtime Convex Solver for Embedded Systems
%A Ichitaro Yamazaki
%A Saeid Nooshabadi
%A Stanimire Tomov
%A Jack Dongarra
%K KKT
%K Realtime embedded convex optimization solver
%X Convex optimization solvers for embedded systems find widespread use. This letter presents a novel technique to reduce the run-time of decomposition of KKT matrix for the convex optimization solver for an embedded system, by two orders of magnitude. We use the property that although the KKT matrix changes, some of its block sub-matrices are fixed during the solution iterations and the associated solving instances.
%B University of Tennessee Computer Science Technical Report
%8 2016-10
%G eng

%0 Conference Paper
%B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16)
%D 2016
%T High-performance Matrix-matrix Multiplications of Very Small Matrices
%A Ian Masliah
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Joël Falcou
%A Jack Dongarra
%X The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries.
%B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16)
%I Springer International Publishing
%C Grenoble, France
%8 2016-08
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS'16)
%D 2016
%T High-Performance Tensor Contractions for GPUs
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%K Applications
%K Batched linear algebra
%K FEM
%K gpu
%K Tensor contractions
%K Tensor HPC
%X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
%B International Conference on Computational Science (ICCS'16)
%C San Diego, CA
%8 2016-06
%G eng

%0 Generic
%D 2016
%T High-Performance Tensor Contractions for GPUs
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many  independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2016-01
%G eng

%0 Generic
%D 2016
%T The HPL Benchmark: Past, Present & Future
%A Jack Dongarra
%C ISC High Performance, Frankfurt, Germany
%8 2016-07
%G eng
%9 Conference Presentation

%0 Journal Article
%J Acta Numerica
%D 2016
%T Linear Algebra Software for Large-Scale Accelerated Multicore Computing
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A undefined
%A Asim YarKhan
%X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems.
%B Acta Numerica
%V 25
%P 1-160
%8 2016-05
%G eng
%R 10.1017/S0962492916000015

%0 Conference Paper
%B IEEE High Performance Extreme Computing Conference (HPEC'16)
%D 2016
%T LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi
%A Azzam Haidar
%A Stanimire Tomov
%A Konstantin Arturov
%A Murat Guney
%A Shane Story
%A Jack Dongarra
%X A wide variety of heterogeneous compute resources, ranging from multicore CPUs to GPUs and coprocessors, are available to modern computers, making it challenging to design unified numerical libraries that efficiently and productively use all these varied resources. For example, in order to efficiently use Intel’s Knights Langing (KNL) processor, the next-generation of Xeon Phi architectures, one must design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance. We propose a productive and portable programming model that allows us to write a serial-looking code, which, however, achieves parallelism and scalability by using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and the parallel execution. This is done through multiple techniques ranging from multi-level data partitioning to adaptive task grain sizes, and dynamic task scheduling. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. Finally, we outline the strengths and the effectiveness of this approach – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate current work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.
%B IEEE High Performance Extreme Computing Conference (HPEC'16)
%I IEEE
%C Waltham, MA
%8 2016-09
%G eng

%0 Generic
%D 2016
%T MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs
%A Tingxing Dong
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Jack Dongarra
%X A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how batched GEMV and GEMM to be able to assist batched advance factorization (e.g. bi-diagonalization) and other BLAS routines (e.g. triangular solve) to achieve optimal performance on GPUs. Our solutions achieved up to 2.8-3× speedups compared to CUBLAS and MKL solutions, wherever possible. We illustrated the batched methodology on a real-world Hydrodynamic application by reformulating the tensor operations into batched BLAS GEMV and GEMM operations. A 2.5× speedup and a 1.4× greenup are obtained by changing 10% of the code. We accelerated and scaled it on Titan supercomputer to 4096 nodes.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2016-08
%G eng

%0 Journal Article
%J National Science Review
%D 2016
%T A New Metric for Ranking High-Performance Computing Systems
%A Jack Dongarra
%A Michael A. Heroux
%A Piotr Luszczek
%B National Science Review
%V 3
%P 30-35
%8 2016-01
%G eng
%N 1
%R https://doi.org/10.1093/nsr/nwv084

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2016
%T Non-GPU-resident Dense Symmetric Indefinite Factorization
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X We study various algorithms to factorize a symmetric indefinite matrix that does not fit in the core memory of a computer. There are two sources of the data movement into the memory: one needed for selecting and applying pivots and the other needed to update each column of the matrix for the factorization. It is a challenge to obtain high performance of such an algorithm when the pivoting is required to ensure the numerical stability of the factorization. For example, when factorizing each column of the matrix, a diagonal entry, which ensures the stability, may need to be selected as a pivot among the remaining diagonals, and moved to the leading diagonal by swapping both the corresponding rows and columns of the matrix. If the pivot is not in the core memory, then it must be loaded into the core memory. For updating the matrix, the data locality may be improved by partitioning the matrix. For example, a right-looking partitioned algorithm first factorizes the leading columns, called panel, and then uses the factorized panel to update the trailing submatrix. This algorithm only accesses the trailing submatrix after each panel factorization (instead of after each column factorization) and performs most of its floating-point operations (flops) using BLAS-3, which can take advantage of the memory hierarchy. However, because the pivots cannot be predetermined, the whole trailing submatrix must be updated before the next panel factorization can start. When the whole submatrix does not fit in the core memory all at once, loading the block columns into the memory can become the performance bottleneck. Similarly, the left-looking variant of the algorithm would require to update each panel with all of the previously factorized columns. This makes it a much greater challenge to implement an efficient out-of-core symmetric indefinite factorization compared with an out-of-core nonsymmetric LU factorization with partial pivoting, which only requires to swap the rows of the matrix and accesses the trailing submatrix after each in-core factorization (instead of after each panel factorization by the symmetric factorization). To reduce the amount of the data transfer, in this paper we uses the recently proposed left-looking communication-avoiding variant of the symmetric factorization algorithm to factorize the columns in the core memory, and then perform the partitioned right-looking out-of-core trailing submatrix updates. This combination may still require to load the pivots into the core memory, but it only updates the trailing submatrix after each in-core factorization, while the previous algorithm updates it after each panel factorization.Although these in-core and out-of-core algorithms can be applied at any level of the memory hierarchy, we apply our designs to the GPU and CPU memory, respectively. We call this specific implementation of the algorithm a non–GPU-resident implementation. Our performance results on the current hybrid CPU/GPU architecture demonstrate that when the matrix is much larger than the GPU memory, the proposed algorithm can obtain significant speedups over the communication-hiding implementations of the previous algorithms.
%B Concurrency and Computation: Practice and Experience
%8 2016-11
%G eng
%R 10.1002/cpe.4012

%0 Conference Paper
%B 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2016
%T Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors
%A Anne Benoit
%A Aurelien Cavelan
%A Yves Robert
%A Hongyang Sun
%K fail-stop errors
%K multilevel checkpoint
%K optimal pattern
%K resilience
%K silent errors
%K verification
%X This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop errors. Many others deal with silent errors (or silent data corruptions). But very few papers deal with fail-stop and silent errors simultaneously. However, HPC applications will obviously have to cope with both error sources. This paper presents a unified framework and optimal algorithmic solutions to this double challenge. Silent errors are handled via verification mechanisms (either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, and a full characterization of the optimal pattern. Our results nicely extend several published solutions and demonstrate how to make use of different techniques to solve the double threat of fail-stop and silent errors. Extensive simulations based on real data confirm the accuracy of the model, and show that patterns that combine all resilience mechanisms are required to provide acceptable overheads.
%B 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%R 10.1109/IPDPS.2016.39

%0 Conference Paper
%B 2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16)
%D 2016
%T Performance Analysis and Acceleration of Explicit Integration for Large Kinetic Networks using Batched GPU Computations
%A Azzam Haidar
%A Benjamin Brock
%A Stanimire Tomov
%A Michael Guidry
%A Jay Jay Billings
%A Daniel Shyles
%A Jack Dongarra
%X We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms that solve efficiently N coupled ordinary differential equations (subject to initial conditions) on modern GPUs. We take representative test cases (Type Ia supernova explosions) and demonstrate two or more orders of magnitude increase in efficiency for solving such systems (of realistic thermonuclear networks coupled to fluid dynamics). This implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications we present the computational techniques developed for our ongoing deployment of these new methods on modern GPU accelerators. We show that similarly to many other scientific applications, ranging from national security to medical advances, the computation can be split into many independent computational tasks, each of relatively small-size. As the size of each individual task does not provide sufficient parallelism for the underlying hardware, especially for accelerators, these tasks must be computed concurrently as a single routine, that we call batched routine, in order to saturate the hardware with enough work.
%B 2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16)
%I IEEE
%C Waltham, MA
%8 2016-09
%G eng

%0 Thesis
%B Department of Electrical Engineering and Computer Science
%D 2016
%T Performance Analysis and Modeling of Task-Based Runtimes
%A Blake Haugen
%X The shift toward multicore processors has transformed the software and hardware landscape in the last decade. As a result, software developers must adopt parallelism in order to efficiently make use of multicore CPUs. Task-based scheduling has emerged as one method to reduce the complexity of parallel computing. Although task-based scheduling has been around for many years, the inclusion of task dependencies in OpenMP 4.0 suggests the paradigm will be around for the foreseeable future. While task-based schedulers simplify the process of parallel software development, they can obfuscate the performance characteristics of the execution of an algorithm. Additionally, they can create a challenge for users to analyze the performance of their software and tune algorithmic parameters accordingly.  <br /><br />  We will present the basic principles of task-based runtimes as well as two new tools developed to assist engineers developing these runtimes and users employing them to parallelize their workloads. The first is a tool allowing users to simulate the execution of their algorithm. The second is an extension to the common execution trace which includes information about task dependencies.
%B Department of Electrical Engineering and Computer Science
%I University of Tennessee
%C Knoxville
%V PhD
%8 2016-05
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2016
%T On the performance and energy efficiency of sparse linear algebra on GPUs
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based supercomputers. We compare the resource efficiency of different sparse matrix–vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel’s MKL for multicore CPUs, and develop a GPU sparse matrix–matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers.
%B International Journal of High Performance Computing Applications
%8 2016-10
%G eng
%U http://hpc.sagepub.com/content/early/2016/10/05/1094342016672081.abstract
%R 10.1177/1094342016672081

%0 Generic
%D 2016
%T Performance, Design, and Autotuning of Batched GEMM for GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K Autotuning
%K Batched GEMM
%K GEMM
%K GPU computing
%K HPC
%X Abstract. The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra. It is the key  component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, there becomes a need to have a high performance GEMM kernel for a batch of small matrices. Such kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both xed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance test reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2016-02
%G eng

%0 Conference Paper
%B The International Supercomputing Conference (ISC High Performance 2016)
%D 2016
%T Performance, Design, and Autotuning of Batched GEMM for GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K Autotuning
%K Batched GEMM
%K GEMM
%K GPU computing
%K HPC
%X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and  a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.
%B The International Supercomputing Conference (ISC High Performance 2016)
%C Frankfurt, Germany
%8 2016-06
%G eng

%0 Book Section
%B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings
%D 2016
%T Performance, Design, and Autotuning of Batched GEMM for GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%E Julian M. Kunkel
%E Pavan Balaji
%E Jack Dongarra
%X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general.    This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.
%B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings
%I Springer International Publishing
%P 21–38
%@ 978-3-319-41321-1
%G eng
%U http://dx.doi.org/10.1007/978-3-319-41321-1_2
%R 10.1007/978-3-319-41321-1_2

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2016
%T Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs
%A Ahmad Abdelfattah
%A Hatem Ltaeif
%A David Keyes
%A Jack Dongarra
%X Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications.
%B Concurrency and Computation: Practice and Experience
%V 28
%P 3447 - 3465
%8 2016-05
%G eng
%U http://onlinelibrary.wiley.com/doi/10.1002/cpe.3874/full
%N 12
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.v28.1210.1002/cpe.3874

%0 Conference Paper
%B International Conference on Computational Science (ICCS'16)
%D 2016
%T Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K batched computation
%K Cholesky Factorization
%K GPUs
%K Tuning
%X <p>Solving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually.</p>    <p>This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2 faster.</p>
%B International Conference on Computational Science (ICCS'16)
%C San Diego, CA
%8 2016-06
%G eng

%0 Journal Article
%J International Journal of Parallel Programming
%D 2016
%T Porting the PLASMA Numerical Library to the OpenMP Standard
%A Asim YarKhan
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X PLASMA is a numerical library intended as a successor to LAPACK for solving problems in dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for efficient multithreading of algorithms expressed in a serial fashion. QUARK is a superscalar scheduler and implements automatic parallelization by tracking data dependencies and resolving data hazards at runtime. Recently, this type of scheduling has been incorporated in the OpenMP standard, which allows to transition PLASMA from the proprietary solution offered by QUARK to the standard solution offered by OpenMP. This article studies the feasibility of such transition.
%B International Journal of Parallel Programming
%8 2016-06
%G eng
%U http://link.springer.com/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6.pdfhttp://link.springer.com/article/10.1007/s10766-016-0441-6/fulltext.html
%! Int J Parallel Prog
%R 10.1007/s10766-016-0441-6

%0 Conference Proceedings
%B Tools for High Performance Computing 2015: Proceedings of the 9th International Workshop on Parallel Tools for High Performance Computing, September 2015, Dresden, Germany
%D 2016
%T Power Management and Event Verification in PAPI
%A Heike Jagode
%A Asim YarKhan
%A Anthony Danalis
%A Jack Dongarra
%X For more than a decade, the PAPI performance monitoring library has helped to implement the familiar maxim attributed to Lord Kelvin: “If you cannot measure it, you cannot improve it.” Widely deployed and widely used, PAPI provides a generic, portable interface for the hardware performance counters available on all modern CPUs and some other components of interest that are scattered across the chip and system. Recent and radical changes in processor and system design—systems that combine multicore CPUs and accelerators, shared and distributed memory, PCI- express and other interconnects—as well as the emergence of power efficiency as a primary design constraint, and reduced data movement as a primary programming goal, pose new challenges and bring new opportunities to PAPI. We discuss new developments of PAPI that allow for multiple sources of performance data to be measured simultaneously via a common software interface. Specifically, a new PAPI component that controls power is discussed. We explore the challenges of shared hardware counters that include system-wide measurements in existing multicore architectures. We conclude with an exploration of future directions for the PAPI interface.
%B Tools for High Performance Computing 2015: Proceedings of the 9th International Workshop on Parallel Tools for High Performance Computing, September 2015, Dresden, Germany
%I Springer International Publishing
%C Dresden, Germany
%P pp. 41-51
%@ 978-3-319-39589-0
%G eng
%R https://doi.org/10.1007/978-3-319-39589-0_4

%0 Generic
%D 2016
%T Report on the Sunway TaihuLight System
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2016-06
%G eng
%U http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

%0 Journal Article
%J International Journal of Networking and Computing
%D 2016
%T Scheduling Computational Workflows on Failure-prone Platforms
%A Guillaume Aupy
%A Anne Benoit
%A Henri Casanova
%A Yves Robert
%K checkpointing
%K fault-tolerance
%K reliability
%K scheduling
%K workflow
%X We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomialtime optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.
%B International Journal of Networking and Computing
%V 6
%P 2-26
%G eng

%0 Conference Paper
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2016
%T Search Space Generation and Pruning System for Autotuners
%A Piotr Luszczek
%A Mark Gates
%A Jakub Kurzak
%A Anthony Danalis
%A Jack Dongarra
%X This work tackles two simultaneous challenges faced by autotuners: the ease of describing a complex, multidimensional search space, and the speed of evaluating that space, while applying a multitude of pruning constraints. This article presents a declarative notation for describing a search space and a translation system for conversion to a standard C code for fast and multithreaded, as necessary, evaluation. The notation is Python-based and thus simple in syntax and easy to assimilate by the user interested in tuning rather than learning a new programming language. A large number of dimensions and a large number of pruning constraints may be expressed with little effort. The system is discussed in the context of autotuning the canonical matrix multiplication kernel for NVIDIA GPUs, where the search space has 15 dimensions and involves application of 10 complex pruning constrains. The speed of evaluation is compared against generators created using imperative programming style in various scripting and compiled languages.
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng

%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2016
%T Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X To orthonormalize a set of dense vectors, Singular Value QR (SVQR) requires only one global reduction between the parallel processing units, and uses BLAS-3 kernels to perform most of its local computation. As a result, compared to other orthogonalization schemes, SVQR obtains superior performance on many of the current computers. In this paper, we study the stability and performance of various SVQR implementations on multicore CPUs with a GPU, focusing on the dense triangular solve, which performs half of the total floating-point operations in SVQR. As a part of this study, we examine its adaptive mixed-precision variant that decides if a lower-precision arithmetic can be used for the triangular solution at runtime without increasing the order of its orthogonality error. Since the backward error of this adaptive mixed-precision variant is significantly greater than that of the standard SVQR, we study its effects on the solution convergence of several subspace projection methods for solving a linear system of equations and for computing singular values or eigenvalues of a sparse matrix. Our experimental results indicate that in some cases, the convergence rate of the solver may not be affected by the larger backward errors, while reducing the time to solution.
%B ACM Transactions on Mathematical Software (TOMS)
%V 43
%8 2016-10
%G eng
%N 2

%0 Generic
%D 2016
%T A Standard for Batched BLAS Routines
%A Pedro Valero-Lara
%A Jack Dongarra
%A Azzam Haidar
%A Samuel D. Relton
%A Stanimire Tomov
%A Mawussi Zounon
%I 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16)
%C Paris, France
%8 2016-04
%G eng

%0 Journal Article
%J National Science Review
%D 2016
%T Sunway TaihuLight Supercomputer Makes Its Appearance
%A Jack Dongarra
%B National Science Review
%V 3
%P 256-266
%8 2016-09
%G eng
%N 3
%R https://doi.org/10.1093/nsr/nww044

%0 Conference Proceedings
%B OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments
%D 2016
%T Surviving Errors with OpenSHMEM
%A Aurelien Bouteiller
%A George Bosilca
%A Manjunath Gorentla Venkata
%E Manjunath Gorentla Venkata
%E Imam, Neena
%E Pophale, Swaroop
%E Mintz, Tiffany M.
%X Unexpected error conditions stem from a variety of underlying causes, including resource exhaustion, network failures, hardware failures, or program errors. As the scale of HPC systems continues to grow, so does the probability of encountering a condition that causes a failure; meanwhile, error recovery and run-through failure management are becoming mature, and interoperable HPC programming paradigms are beginning to feature advanced error management. As a result from these developments, it becomes increasingly desirable to gracefully handle error conditions in OpenSHMEM. In this paper, we present the design and rationale behind an extension of the OpenSHMEM API that can (1) notify user code of unexpected erroneous conditions, (2) permit customized user response to errors without incurring overhead on an error-free execution path, (3) propagate the occurence of an error condition to all Processing Elements, and (4) consistently close the erroneous epoch in order to resume the application.
%B OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments
%I Springer International Publishing
%C Baltimore, MD, USA
%P 66–81
%@ 978-3-319-50995-2
%G eng

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD)
%D 2016
%T Towards Achieving Performance Portability Using Directives for Accelerators
%A M. Graham Lopez
%A Larrea, V
%A Joubert, W
%A Hernandez, O
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer of- fload features of OpenMP 4.5 are for moving codes between architectures, how much tuning might be required and what lessons we can learn from this experience. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement these kernels using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both X86 64 with attached NVIDIA GPUs, self-hosted Intel Xeon Phi KNL, as well as an X86 64 host system with Intel Xeon Phi coprocessors. In this paper, we explain what factors affected the performance portability such as how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimize and target to multiple platforms.
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD)
%I Innovative Computing Laboratory, University of Tennessee
%C Salt Lake City, Utah
%8 2016-11
%G eng

%0 Journal Article
%J Numerical Algorithms
%D 2016
%T Updating Incomplete Factorization Preconditioners for Model Order Reduction
%A Hartwig Anzt
%A Edmond Chow
%A Jens Saak
%A Jack Dongarra
%K key publication
%X When solving a sequence of related linear systems by iterative methods, it is common to reuse the preconditioner for several systems, and then to recompute the preconditioner when the matrix has changed significantly. Rather than recomputing the preconditioner from scratch, it is potentially more efficient to update the previous preconditioner. Unfortunately, it is not always known how to update a preconditioner, for example, when the preconditioner is an incomplete factorization. A recently proposed iterative algorithm for computing incomplete factorizations, however, is able to exploit an initial guess, unlike existing algorithms for incomplete factorizations. By treating a previous factorization as an initial guess to this algorithm, an incomplete factorization may thus be updated. We use a sequence of problems from model order reduction. Experimental results using an optimized GPU implementation show that updating a previous factorization can be inexpensive and effective, making solving sequences of linear systems a potential niche problem for the iterative incomplete factorization algorithm.
%B Numerical Algorithms
%V 73
%P 611–630
%8 2016-02
%G eng
%N 3
%R 10.1007/s11075-016-0110-2

%0 Conference Paper
%B 2015 IEEE International Conference on Big Data (IEEE BigData 2015)
%D 2015
%T Accelerating Collaborative Filtering for Implicit Feedback Datasets using GPUs
%A Mark Gates
%A Hartwig Anzt
%A Jakub Kurzak
%A Jack Dongarra
%X In this paper we accelerate the Alternating Least Squares (ALS) algorithm used for generating product recommendations on the basis of implicit feedback datasets. We approach the algorithm with concepts proven to be successful in High Performance Computing. This includes the formulation of the algorithm as a mix of cache-optimized algorithm-specific kernels and standard BLAS routines, acceleration via graphics processing units (GPUs), use of parallel batched kernels, and autotuning to identify performance winners. For benchmark datasets, the multi-threaded CPU implementation we propose achieves more than a 10 times speedup over the implementations available in the GraphLab and Spark MLlib software packages. For the GPU implementation, the parameters of an algorithm-specific kernel were optimized using a comprehensive autotuning sweep. This results in an additional 2 times speedup over our CPU implementation.
%B 2015 IEEE International Conference on Big Data (IEEE BigData 2015)
%I IEEE
%C Santa Clara, CA
%8 2015-11
%G eng

%0 Conference Paper
%B 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015)
%D 2015
%T Accelerating NWChem Coupled Cluster through dataflow-based Execution
%A Heike Jagode
%A Anthony Danalis
%A George Bosilca
%A Jack Dongarra
%K CCSD
%K dag
%K dataflow
%K NWChem
%K parsec
%K ptg
%K tasks
%X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWChem, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy,  and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelised in coarse chunks. In this paper, we present our effort of converting the NWChem’s  CC code into a dataflow-based form that is capable of utilizing the task scheduling system PaRSEC (Parallel Runtime Scheduling and Execution Controller) – a software package designed to enable high performance computing at scale. We discuss the modularity of our approach and explain how the PaRSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWChem codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer grained tasks (compared to the original version of NWChem); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWChem, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation.
%B 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015)
%I Springer International Publishing
%C Krakow, Poland
%8 2015-09
%G eng

%0 Conference Paper
%B Spring Simulation Multi-Conference 2015 (SpringSim'15)
%D 2015
%T Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative eigensolver the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For the key routine generating the Krylov search spaces via the product of a sparse matrix and a block of vectors, we propose a GPU kernel based on a modied sliced ELLPACK format. Blocking a set of vectors and processing them simultaneously accelerates the computation of a set of consecutive SpMVs significantly. Comparing the performance against similar routines from Intel's MKL and NVIDIA's cuSPARSE library we identify appealing performance improvements. We integrate it into the highly optimized LOBPCG implementation. Compared to the BLOBEX CPU implementation running on two eight-core Intel Xeon E5-2690s, we accelerate the computation of a small set of eigenvectors using NVIDIA's K40 GPU by typically more than an order of magnitude.
%B Spring Simulation Multi-Conference 2015 (SpringSim'15)
%I SCS
%C Alexandria, VA
%8 2015-04
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2015
%T Acceleration of GPU-based Krylov solvers via Data Transfer Reduction
%A Hartwig Anzt
%A William Sawyer
%A Stanimire Tomov
%A Piotr Luszczek
%A Jack Dongarra
%B International Journal of High Performance Computing Applications
%G eng

%0 Conference Paper
%B 3rd International Workshop on Energy Efficient Supercomputing (E2SC '15)
%D 2015
%T Adaptive Precision Solvers for Sparse Linear Systems
%A Hartwig Anzt
%A Jack Dongarra
%A Enrique S. Quintana-Orti
%B 3rd International Workshop on Energy Efficient Supercomputing (E2SC '15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Journal Article
%J ACM Transactions on Parallel Computing
%D 2015
%T Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures, and Accuracy
%A Aurelien Bouteiller
%A Thomas Herault
%A George Bosilca
%A Peng Du
%A Jack Dongarra
%E Phillip B. Gibbons
%K ABFT
%K algorithms
%K fault-tolerance
%K High Performance Computing
%K linear algebra
%X Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that  require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations  are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the  Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault  Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme  conditions, such as the absence of any reliable node and the possibility of losing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead decreases inversely to the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. Applicability to tolerate multiple failures and accuracy after multiple recovery is also considered.
%B ACM Transactions on Parallel Computing
%V 1
%P 10:1-10:28
%8 2015-01
%G eng
%N 2
%R 10.1145/2686892

%0 Conference Paper
%B International Supercomputing Conference (ISC 2015)
%D 2015
%T Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs
%A Edmond Chow
%A Hartwig Anzt
%A Jack Dongarra
%B International Supercomputing Conference (ISC 2015)
%C Frankfurt, Germany
%8 2015-07
%G eng

%0 Conference Paper
%B EuroMPI/Asia 2015 Workshop
%D 2015
%T Batched Matrix Computations on Hardware Accelerators
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for effective approach to develop energy efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations: Cholesky, LU, and QR for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybridMAGMAfactorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient for in our applications’ context. We illustrate all these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared to a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5x speedup on the NVIDIA K40 GPU.
%B EuroMPI/Asia 2015 Workshop
%C Bordeaux, France
%8 2015-09
%G eng

%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%D 2015
%T Batched Matrix Computations on Hardware Accelerators Based on GPUs
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%X We will present techniques for small matrix computations on GPUs and their use for energy efficient, high-performance solvers. Work on small problems delivers high performance through improved data reuse. Many numerical libraries and applications need this functionality further developed. We describe the main factorizations LU, QR, and Cholesky for a set of small dense matrices in parallel. We achieve significant acceleration and reduced energy consumption against other solutions. Our techniques are of interest to GPU application developers in general.
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2015
%T Batched matrix computations on hardware accelerators based on GPUs
%A Azzam Haidar
%A Tingxing Dong
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K batched factorization
%K hardware accelerators
%K numerical linear algebra
%K numerical software libraries
%K one-sided factorization algorithms
%X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.
%B International Journal of High Performance Computing Applications
%8 2015-02
%G eng
%R 10.1177/1094342014567546

%0 Conference Paper
%B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)
%D 2015
%T Cholesky Across Accelerators
%A Asim YarKhan
%A Azzam Haidar
%A Chongxiao Cao
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)
%I IEEE
%C Elizabeth, NJ
%8 2015-08
%G eng

%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra
%D 2015
%T Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra
%A Mark Gates
%A Stanimire Tomov
%A Azzam Haidar
%X Accelerating dense linear algebra using GPUs admits two models: hybrid CPU-GPU and GPU-only. The hybrid model factors the panel on the CPU while updating the trailing matrix on the GPU, concentrating the GPU on high-performance matrix multiplies. The GPU-only model performs the entire computation on the GPU, avoiding costly data transfers to the CPU. We compare these two approaches for three QR-based algorithms: QR factorization, rank revealing QR, and reduction to Hessenberg.
%B 2015 SIAM Conference on Applied Linear Algebra
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng

%0 Journal Article
%J International Journal of Networking and Computing
%D 2015
%T Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%K ABFT
%K checkpoint
%K fault-tolerance
%K High-performance computing
%K model
%K performance evaluation
%K resilience
%X Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFT- aware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic program- ming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint.
%B International Journal of Networking and Computing
%V 5
%P 2-15
%8 2015-01
%G eng

%0 Journal Article
%J Scientific Programming
%D 2015
%T Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X Low-rank matrices arise in many scientific and engineering computation. Both computational and storage costs of manipulating such matrices may be reduced by taking advantages of their low-rank properties. To compute a low-rank approximation of a dense matrix, in this paper, we study the performance of QR factorization with column pivoting or with restricted pivoting on multicore CPUs with a GPU. We first propose several techniques to reduce the postprocessing time, which is required for restricted pivoting, on a modern CPU. We then examine the potential of using a GPU to accelerate the factorization process with both column and restricted pivoting. Our performance results on two eight-core Intel Sandy Bridge CPUs with one NVIDIA Kepler GPU demonstrate that using the GPU, the factorization time can be reduced by a factor of more than two. In addition, to study the performance of our implementations in practice, we integrate them into a recently-developed software StruMF which algebraically exploits such low-rank structures for solving a general sparse linear system of equations. Our performance results for solving Poisson's equations demonstrate that the proposed techniques can significantly reduce the preconditioner construction time of StruMF on the CPUs, and the construction time can be further reduced by 10%-50% using the GPU.
%B Scientific Programming
%G eng

%0 Conference Paper
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2015
%T A Data Flow Divide and Conquer Algorithm for Multicore Architecture
%A Azzam Haidar
%A Jakub Kurzak
%A Gregoire Pichon
%A Mathieu Faverge
%K Eigensolver
%K lapack
%K Multicore
%K plasma
%K task-based programming
%X Computing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automobiles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this article, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the Intel MKL library, and outperforms the best MRRR implementation for many matrices.
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Hyderabad, India
%8 2015-05
%G eng

%0 Generic
%D 2015
%T On the Design, Autotuning, and Optimization of GPU Kernels for Kinetic Network Simulations Using Fast Explicit Integration and GPU Batched Computation
%A Michael Guidry
%A Azzam Haidar
%I Joint Institute for Computational Sciences Seminar Series, Presentation
%C Oak Ridge, TN
%8 2015-09
%G eng

%0 Conference Paper
%B ISC High Performance 2015
%D 2015
%T On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors
%A Khairul Kabir
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X The dramatic change in computer architecture due to the manycore paradigm shift, made the development of numerical routines that are optimal extremely challenging. In this work, we target the development of numerical algorithms and implementations for Xeon Phi coprocessor architecture designs. In particular, we examine and optimize the general and symmetric matrix-vector multiplication routines (gemv/symv), which are some of the most heavily used linear algebra kernels in many important engineering and physics applications. We describe a successful approach on how to address the challenges for this problem, starting from our algorithm design, performance analysis and programing model, to kernel optimization. Our goal, by targeting low-level, easy to understand fundamental kernels, is to develop new optimization strategies that can be effective elsewhere for the use on manycore coprocessors, and to show significant performance improvements compared to the existing state-of-the-art implementations. Therefore, in addition to the new optimization strategies, analysis, and optimal performance results, we finally present the significance of using these routines/strategies to accelerate higher-level numerical algorithms for the eigenvalue problem (EVP) and the singular value decomposition (SVD) that by themselves are foundational for many important applications.
%B ISC High Performance 2015
%C Frankfurt, Germany
%8 2015-07
%G eng

%0 Conference Paper
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2015
%T Design for a Soft Error Resilient Dynamic Task-based Runtime
%A Chongxiao Cao
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%X As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Hyderabad, India
%8 2015-05
%G eng

%0 Journal Article
%J International Journal on High Performance Computing Applications
%D 2015
%T Efficient Checkpoint/Verification Patterns
%A Anne Benoit
%A Saurabh K. Raina
%A Yves Robert
%K checkpointing
%K Fault tolerance
%K High Performance Computing
%K silent data corruption
%K silent error
%K verification
%X Errors have become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (Mean Time Between Failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared to the base algorithm that always perform a verification just before taking a checkpoint (p = q = 1), and we exhibit gains of up to 19%.
%B International Journal on High Performance Computing Applications
%8 2015-07
%G eng
%R 10.1177/1094342015594531

%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%D 2015
%T Efficient Eigensolver Algorithms on Accelerator Based Architectures
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges -starting from our algorithm design, kernel optimization and tuning, to our programming model- in the development of a scalable high-performance symmetric eigenvalue and singular value solver.
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Efficient Implementation Of Quantum Materials Simulations On Distributed CPU-GPU Systems
%A Raffaele Solcà
%A Anton Kozhevnikov
%A Azzam Haidar
%A Stanimire Tomov
%A Thomas C. Schulthess
%A Jack Dongarra
%X We present a scalable implementation of the Linearized Augmented Plane Wave method for distributed memory systems, which relies on an efficient distributed, block-cyclic setup of the Hamiltonian and overlap matrices and allows us to turn around highly accurate 1000+ atom all-electron quantum materials simulations on clusters with a few hundred nodes. The implementation runs efficiently on standard multicore CPU nodes, as well as hybrid CPU-GPU nodes. The key for the latter is a novel algorithm to solve the generalized eigenvalue problem for dense, complex Hermitian matrices on distributed hybrid CPU-GPU systems. Performance tests for Li-intercalated CoO2 supercells containing 1501 atoms demonstrate that high-accuracy, transferable quantum simulations can now be used in throughput materials search problems. While our application can benefit and get scalable performance through CPU-only libraries like ScaLAPACK or ELPA2, our new hybrid solver enables the efficient use of GPUs and shows that a hybrid CPU-GPU architecture scales to a desired performance using substantially fewer cluster nodes, and notably, is considerably more energy efficient than the traditional multicore CPU only systems for such complex applications.
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Conference Paper
%B Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15)
%D 2015
%T Energy Efficiency and Performance Frontiers for Sparse Computations on GPU Supercomputers
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6x performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5x faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers.
%B Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15)
%I ACM
%C San Francisco, CA
%8 2015-02
%@ 978-1-4503-3404-4
%G eng
%R 10.1145/2712386.2712387

%0 Generic
%D 2015
%T Exascale Computing and Big Data
%A Dan Reed
%A Jack Dongarra
%X Scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics.
%B Communications of the ACM
%I ACM
%V 58
%P 56-68
%8 2015-07
%G eng
%9 Magazine Article
%R 10.1145/2699414

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T Experiences in Autotuning Matrix Multiplication for Energy Minimization on GPUs
%A Hartwig Anzt
%A Blake Haugen
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K Autotuning
%K energy efficiency
%K hardware accelerators
%K matrix multiplication
%K power
%X In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 5096 - 5113
%8 12-Oct
%G eng
%U http://doi.wiley.com/10.1002/cpe.3516https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2Fcpe.3516
%N 17
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.3516

%0 Journal Article
%J Concurrency in Computation: Practice and Experience
%D 2015
%T Experiences in autotuning matrix multiplication for energy minimization on GPUs
%A Hartwig Anzt
%A Blake Haugen
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B Concurrency in Computation: Practice and Experience
%V 27
%P 5096-5113
%8 2015-12
%G eng
%N 17
%R 10.1002/cpe.3516

%0 Generic
%D 2015
%T Fault Tolerance Techniques for High-performance Computing
%A Jack Dongarra
%A Thomas Herault
%A Yves Robert
%X This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the report by discussing techniques to cope with silent errors (or silent data corruption).
%B University of Tennessee Computer Science Technical Report (also LAWN 289)
%I University of Tennessee
%8 2015-05
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn289.pdf

%0 Conference Paper
%B 17th IEEE International Conference on High Performance Computing and Communications
%D 2015
%T Flexible Linear Algebra Development and Scheduling with Cholesky Factorization
%A Azzam Haidar
%A Asim YarKhan
%A Chongxiao Cao
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore-CPUs, GPUs, and coprocessors. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. On the other hand, a Xeon Phi coprocessor will work most effectively with degree of parallelism between GPUs and multicore-CPUs. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization.
%B 17th IEEE International Conference on High Performance Computing and Communications
%C Newark, NJ
%8 2015-08
%G eng

%0 Conference Paper
%B ISC High Performance
%D 2015
%T Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations
%A Azzam Haidar
%A Tingxing Dong
%A Stanimire Tomov
%A Piotr Luszczek
%A Jack Dongarra
%B ISC High Performance
%I Springer
%C Frankfurt, Germany
%8 2015-07
%G eng

%0 Conference Proceedings
%B OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies
%D 2015
%T From MPI to OpenSHMEM: Porting LAMMPS
%A Tang, Chunyan
%A Aurelien Bouteiller
%A Thomas Herault
%A Manjunath Gorentla Venkata
%A George Bosilca
%E Manjunath Gorentla Venkata
%E Shamis, Pavel
%E Imam, Neena
%E M. Graham Lopez
%X This work details the opportunities and challenges of porting a Petascale, MPI-based application –-LAMMPS–- to OpenSHMEM. We investigate the major programming challenges stemming from the differences in communication semantics, address space organization, and synchronization operations between the two programming models. This work provides several approaches to solve those challenges for representative communication patterns in LAMMPS, e.g., by considering group synchronization, peer's buffer status tracking, and unpacked direct transfer of scattered data. The performance of LAMMPS is evaluated on the Titan HPC system at ORNL. The OpenSHMEM implementations are compared with MPI versions in terms of both strong and weak scaling. The results outline that OpenSHMEM provides a rich semantic to implement scalable scientific applications. In addition, the experiments demonstrate that OpenSHMEM can compete with, and often improve on, the optimized MPI implementation.
%B OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies
%I Springer International Publishing
%C Annapolis, MD, USA
%P 121–137
%@ 978-3-319-26428-8
%G eng
%R 10.1007/978-3-319-26428-8_8

%0 Conference Paper
%B 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing
%D 2015
%T GPU-accelerated Co-design of Induced Dimension Reduction: Algorithmic Fusion and Kernel Overlap
%A Hartwig Anzt
%A Eduardo Ponce
%A Gregory D. Peterson
%A Jack Dongarra
%X In this paper we present an optimized GPU co-design of the Induced Dimension Reduction (IDR) algorithm for solving linear systems. Starting from a baseline implementation based on the generic BLAS routines from the MAGMA software library, we apply optimizations  that are based on kernel fusion and kernel overlap. Runtime experiments are used to investigate the benefit of the distinct optimization techniques for different variants of the IDR algorithm. A comparison to the reference implementation reveals that the interplay between them can succeed in cutting the overall runtime by up to about one third.
%B 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Conference Paper
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2015
%T Hierarchical DAG scheduling for Hybrid Distributed Systems
%A Wei Wu
%A Aurelien Bouteiller
%A George Bosilca
%A Mathieu Faverge
%A Jack Dongarra
%K dense linear algebra
%K gpu
%K heterogeneous architecture
%K PaRSEC runtime
%X Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the pro-gramming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel effi-ciency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments.
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Hyderabad, India
%8 2015-05
%G eng

%0 Book Section
%B The Princeton Companion to Applied Mathematics
%D 2015
%T High-Performance Computing
%A Jack Dongarra
%A Nicholas J. Higham
%A Mark R. Dennis
%A Paul Glendinning
%A Paul A. Martin
%A Fadil Santosa
%A Jared Tanner
%B The Princeton Companion to Applied Mathematics
%I Princeton University Press
%C Princeton, New Jersey
%P 839-842
%@ 9781400874477
%G eng

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2015
%T High-Performance Conjugate-Gradient Benchmark: A New Metric for Ranking High-Performance Computing Systems
%A Jack Dongarra
%A Michael A. Heroux
%A Piotr Luszczek
%K Additive Schwarz
%K HPC Benchmarking
%K Multigrid smoothing
%K Preconditioned Conjugate Gradient
%K Validation and Verification
%X We describe a new high-performance conjugate-gradient (HPCG) benchmark. HPCG is composed of computations and data-access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and to be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement.
%B The International Journal of High Performance Computing Applications
%G eng
%R 10.1177/1094342015593158

%0 Journal Article
%J Scientific Programming
%D 2015
%T HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
%A Azzam Haidar
%A Jack Dongarra
%A Khairul Kabir
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%A Yulu Jia
%K communication and computation overlap
%K dynamic runtime scheduling using dataflow dependences
%K hardware accelerators and coprocessors
%K Intel Xeon Phi processor
%K Many Integrated Cores
%K numerical linear algebra
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.
%B Scientific Programming
%V 23
%8 2015-01
%G eng
%N 1
%R 10.3233/SPR-140404

%0 Generic
%D 2015
%T HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems
%A Jack Dongarra
%A Michael A. Heroux
%A Piotr Luszczek
%K Additive Schwarz
%K HPC Benchmarking
%K Multigrid smoothing
%K Preconditioned Conjugate Gradient
%K Validation and Verification
%X We describe a new high performance conjugate gradient (HPCG) benchmark. HPCG is composed of computations and data access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2015-01
%G eng
%U http://www.eecs.utk.edu/resources/library/file/1047/ut-eecs-15-736.pdf

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2015
%T Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs
%A Jakub Kurzak
%A Hartwig Anzt
%A Mark Gates
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%8 2015-11
%G eng

%0 Conference Paper
%B EuroPar 2015
%D 2015
%T Iterative Sparse Triangular Solves for Preconditioning
%A Hartwig Anzt
%A Edmond Chow
%A Jack Dongarra
%X Sparse triangular solvers are typically parallelized using level scheduling techniques, but parallel eciency is poor on high-throughput architectures like GPUs. We propose using an iterative approach for solving sparse triangular systems when an approximation is suitable. This approach will not work for all problems, but can be successful for sparse triangular matrices arising from incomplete factorizations, where an approximate solution is acceptable. We demonstrate the performance gains that this approach can have on GPUs in the context of solving sparse linear systems with a preconditioned Krylov subspace method. We also illustrate the effect of using asynchronous iterations.
%B EuroPar 2015
%I Springer Berlin
%C Vienna, Austria
%8 2015-08
%G eng
%U http://dx.doi.org/10.1007/978-3-662-48096-0_50
%R 10.1007/978-3-662-48096-0_50

%0 Generic
%D 2015
%T Linear Algebra Software for High-Performance Computing (Part 2: Software for Hardware Accelerators and Coprocessors)
%A Stanimire Tomov
%I ISC High Performance (ISC18), Tutorial Presentation
%C Frankfurt, Germany
%8 2015-06
%G eng

%0 Conference Paper
%B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award)
%D 2015
%T MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing
%A Azzam Haidar
%A Stanimire Tomov
%A Piotr Luszczek
%A Jack Dongarra
%X Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries for embedded systems challenging. In this paper, we present the design and implementation of embedded system aware algorithms, that target these challenges in the area of dense linear algebra. We consider the fundamental problems of solving linear systems of equations and least squares problems, using the LU, QR, and Cholesky factorizations, and illustrate our results, both in terms of performance and energy efficiency, on the Jetson TK1 development kit. We developed performance optimizations for both small and large problems. In contrast to the corresponding LAPACK algorithms, the new designs target the use of many-cores, readily available now even in mobile devices like the Jetson TK1, e.g., featuring 192 CUDA cores. The implementations presented will form the core of a MAGMA Embedded library, to be released as part of the MAGMA libraries.
%B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award)
%I IEEE
%C Waltham, MA
%8 2015-09
%G eng

%0 Generic
%D 2015
%T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I ISC High Performance (ISC15), Intel Booth Presentation
%C Frankfurt, Germany
%8 2015-06
%G eng

%0 Conference Paper
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2015
%T Mixed-precision Block Gram Schmidt Orthogonalization
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jakub Kurzak
%A Jack Dongarra
%A Jesse Barlow
%X The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a significant  impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7:1 while maintaining about the same order of the numerical errors.
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2015
%T Mixed-Precision Cholesky QR Factorization and its Case Studies on Multicore CPU with Multiple GPUs
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X To orthonormalize the columns of a dense matrix, the Cholesky QR (CholQR) requires only one global reduction between the parallel processing units and performs most of its computation using BLAS-3 kernels. As a result, compared to other orthogonalization algorithms, CholQR obtains superior performance on many of the current computer architectures, where the communication is becoming increasingly expensive compared to the arithmetic operations. This is especially true when the input matrix is tall-skinny. Unfortunately, the orthogonality error of CholQR depends quadratically on the condition number of the input matrix, and it is numerically unstable when the matrix is ill-conditioned. To enhance the stability of CholQR, we recently used mixed-precision arithmetic; the input and output matrices are in the working precision, but some of its intermediate results are accumulated in the doubled precision. In this paper, we analyze the numerical properties of this mixed-precision CholQR. Our analysis shows that by selectively using the doubled precision, the orthogonality error of the mixed-precision CholQR only depends linearly on the condition number of the input matrix. We provide numerical results to demonstrate the improved numerical stability of the mixed-precision CholQR in practice. We then study its performance. When the target hardware does not support the desired higher precision, software emulation is needed. For example, using software-emulated double-double precision for the working 64-bit double precision, the mixed-precision CholQR requires about 8.5x more floating-point instructions than that required by the standard CholQR. On the other hand, the increase in the communication cost using the double-double precision is less significant, and our performance results on multicore CPU with a different graphics processing unit (GPU) demonstrate that the overhead of using the double-double arithmetic is decreasing on a newer architecture, where the computation is becoming less expensive compared to the communication. As a result, with a latest NVIDIA GPU, the mixed-precision CholQR was only 1.4x slower than the standard CholQR. Finally, we present case studies of using the mixed-precision CholQR within communication-avoiding variants of Krylov subspace projection methods for solving a nonsymmetric linear system of equations and for solving a symmetric eigenvalue problem, on a multicore CPU with multiple GPUs. These case studies demonstrate that by using the higher precision for this small but critical segment of the Krylov methods, we can improve not only the overall numerical stability of the solvers but also, in some cases, their performance.
%B SIAM Journal on Scientific Computing
%V 37
%P C203-C330
%8 2015-05
%G eng
%R DOI:10.1137/14M0973773

%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra
%D 2015
%T Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs
%A Ichitaro Yamazaki
%A Jesse Barlow
%A Stanimire Tomov
%A Jakub Kurzak
%A Jack Dongarra
%X Orthogonalizing a set of dense vectors is an important computational kernel in subspace projection methods for solving large-scale problems. In this talk, we discuss our efforts to improve the performance of the kernel, while maintaining its numerical accuracy. Our experimental results demonstrate the effectiveness of our approaches.
%B 2015 SIAM Conference on Applied Linear Algebra
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2015
%T Mixing LU-QR Factorization Algorithms to Design High-Performance Dense Linear Algebra Solvers
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%K lu factorization
%K Numerical algorithms
%K QR factorization
%K Stability; Performance
%X This paper introduces hybrid LU–QR algorithms for solving dense linear systems of the form Ax=b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floating-point operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. The choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. A comprehensive set of experiments shows that hybrid LU–QR algorithms provide a continuous range of trade-offs between stability and performances.
%B Journal of Parallel and Distributed Computing
%V 85
%P 32-46
%8 2015-11
%G eng
%R doi:10.1016/j.jpdc.2015.06.007

%0 Conference Paper
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8)
%D 2015
%T Optimization for Performance and Energy for Batched Matrix Computations on GPUs
%A Azzam Haidar
%A Tingxing Dong
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K batched factorization
%K hardware accelerators
%K numerical linear algebra
%K numerical software libraries
%K one-sided factorization algorithms
%X As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the  algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU’s significantly higher energy efficiency, as well as from the removal of the  costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2.5 speedup on the K40 GPU.
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8)
%I ACM
%C San Francisco, CA
%8 2015-02
%G eng
%R 10.1145/2716282.2716288

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2015
%T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems
%A Maksims Abalenkovs
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%A Asim YarKhan
%K dense linear algebra
%K gpu
%K HPC
%K Multicore
%K plasma
%K Programming models
%K runtime
%X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.
%B Supercomputing Frontiers and Innovations
%V 2
%8 2015-10
%G eng
%R 10.14529/jsfi1504

%0 Conference Paper
%B 2015 IEEE International Conference on Cluster Computing
%D 2015
%T PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution
%A Anthony Danalis
%A Heike Jagode
%A George Bosilca
%A Jack Dongarra
%K dag
%K parsec
%K ptg
%K tasks
%X Task-based execution has been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. The Parallel Runtime Scheduling and Execution Control (PARSEC) framework is a task-based runtime system that we designed to achieve high performance computing at scale. PARSEC offers a programming paradigm that is different than what has been traditionally used to develop large scale parallel scientific applications. In this paper, we discuss the use of PARSEC to convert a part of the Coupled Cluster (CC) component of the Quantum Chemistry package NWCHEM into a task-based form. We explain how we organized the computation of the CC methods in individual tasks with explicitly defined data dependencies between them and re-integrated the modified code into NWCHEM. We present a thorough performance evaluation and demonstrate that the modified code outperforms the original by more than a factor of two. We also compare the performance of different variants of the modified code and explain the different behaviors that lead to the differences in performance.
%B 2015 IEEE International Conference on Cluster Computing
%I IEEE
%C Chicago, IL
%8 2015-09
%G eng

%0 Conference Paper
%B The Spring Simulation Multi-Conference 2015 (SpringSim'15), Best Paper Award
%D 2015
%T Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures
%A Khairul Kabir
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K Eigenvalues problem
%K Hessenberg reduction
%K Multi/Many-core
%K Stabilized Elementary Transformations
%X The solution of nonsymmetric eigenvalue problems, Ax = λx, can be accelerated substantially by first reducing A to an upper Hessenberg matrix H that has the same eigenvalues as A. This can be done using Householder orthogonal transformations, which is a well established standard, or stabilized elementary transformations. The latter approach, although having half the flops of the former, has been used less in practice, e.g., on computer architectures with well developed hierarchical memories, because of its memory-bound operations and the complexity in stabilizing it. In this paper we revisit the stabilized elementary transformations approach in the context of new architectures – both multicore CPUs and Xeon Phi coprocessors. We derive for a first time a blocking version of the algorithm. The blocked version reduces the memory-bound operations and we analyze its performance. A performance model is developed that shows the limitations of both approaches. The competitiveness of using stabilized elementary transformations has been quantified, highlighting that it can be 20 to 30% faster on current high-end multicore CPUs and Xeon Phi coprocessors.
%B The Spring Simulation Multi-Conference 2015 (SpringSim'15), Best Paper Award
%C Alexandria, VA
%8 2015-04
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2015)
%D 2015
%T Performance Analysis and Optimization of Two-Sided Factorization Algorithms for Heterogeneous Platform
%A Khairul Kabir
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%B International Conference on Computational Science (ICCS 2015)
%C Reykjavík, Iceland
%8 2015-06
%G eng

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs
%A Theo Mary
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Conference Paper
%B 22nd European MPI Users' Group Meeting
%D 2015
%T Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery
%A Aurelien Bouteiller
%A George Bosilca
%A Jack Dongarra
%X Advanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and  the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods.
%B 22nd European MPI Users' Group Meeting
%I ACM
%C Bordeaux, France
%8 2015-09
%G eng
%R 10.1145/2802658.2802668

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems
%A Thomas Herault
%A Aurelien Bouteiller
%A George Bosilca
%A Marc Gamell
%A Keita Teranishi
%A Manish Parashar
%A Jack Dongarra
%X The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications.
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Generic
%D 2015
%T Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems: Formal Proof
%A Thomas Herault
%A Aurelien Bouteiller
%A George Bosilca
%A Marc Gamell
%A Keita Teranishi
%A Manish Parashar
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%8 2015-04
%G eng

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Randomized Algorithms to Update Partial Singular Value Decomposition on a Hybrid CPU/GPU Cluster
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%D 2015
%T Random-Order Alternating Schwarz for Sparse Triangular Solves
%A Hartwig Anzt
%A Edmond Chow
%A Daniel Szyld
%A Jack Dongarra
%X Block-asynchronous Jacobi is an iteration method where a locally synchronous iteration is embedded in an asynchronous global iteration. The unknowns are partitioned into small subsets, and while the components within the same subset are iterated in Jacobi fashion, no update order in-between the subsets is enforced. The values of the non-local entries remain constant during the local iterations, which can result in slow inter-subset information propagation and slow convergence. Interpreting of the subsets as subdomains allows to transfer the concept of domain overlap typically enhancing the information propagation to block-asynchronous solvers. In this talk we explore the impact of overlapping domains to convergence and performance of block-asynchronous Jacobi iterations, and present results obtained by running this solver class on state-of-the-art HPC systems.
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T A Scalable Approach to Solving Dense Linear Algebra Problems on Hybrid CPU-GPU Systems
%A Fengguang Song
%A Jack Dongarra
%K dense linear algebra
%K distributed dataﬂow scheduling
%K heterogeneous HPC systems
%K runtime systems
%X Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 3702-3723
%8 2015-09
%G eng
%N 14
%R 10.1002/cpe.3403

%0 Generic
%D 2015
%T Scheduling for fault-tolerance: an introduction
%A Guillaume Aupy
%A Yves Robert
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2015-01
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%K Gaussian elimination
%K lu factorization
%K Multicore
%K parallel
%K plasma
%K shared memory
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 1292-1309
%8 2015-04
%G eng
%N 5
%R 10.1002/cpe.3306

%0 Journal Article
%J IEEE Computer
%D 2015
%T The TOP500 List and Progress in High-Performance Computing
%A Erich Strohmaier
%A Hans Meuer
%A Jack Dongarra
%A Horst D. Simon
%K application performance
%K Benchmark testing
%K benchmarks
%K Computer architecture
%K High Performance Computing
%K High-performance computing
%K Linpack
%K Market research
%K Parallel computing
%K Program processors
%K Scientific computing
%K Supercomputers
%K top500
%X For more than two decades, the TOP500 list has enjoyed incredible success as a metric for supercomputing performance and as a source of data for identifying technological trends. The project's editors reflect on its usefulness and limitations for guiding large-scale scientific computing into the exascale era.
%B IEEE Computer
%V 48
%P 42-49
%8 2015-11
%G eng
%N 11
%R doi:10.1109/MC.2015.338

%0 Generic
%D 2015
%T Towards a High-Performance Tensor Algebra Package for Accelerators
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%I moky Mountains Computational Sciences and Engineering Conference (SMC15)
%C Gatlinburg, TN
%8 2015-09
%G eng

%0 Conference Paper
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015
%D 2015
%T Towards Batched Linear Solvers on Accelerated Hardware Platforms
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K batched factorization
%K hardware accelerators
%K numerical linear algebra
%K numerical software libraries
%K one-sided factorization algorithms
%X As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU’s symmetric multiprocessors factorizes a single problem at a time.We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA’s CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015
%I ACM
%C San Francisco, CA
%8 2015-02
%G eng

%0 Conference Paper
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15)
%D 2015
%T Tuning Stationary Iterative Solvers for Fault Resilience
%A Hartwig Anzt
%A Jack Dongarra
%A Enrique S. Quintana-Orti
%X As the transistor’s feature size decreases following Moore’s Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result,  resilience is considered the most difficult under addressed issue faced by the High Performance Computing community. In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation.
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Conference Proceedings
%B 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
%D 2015
%T UCX: An Open Source Framework for HPC Network APIs and Beyond
%A P. Shamis
%A Manjunath Gorentla Venkata
%A M. Graham Lopez
%A M. B. Baker
%A O. Hernandez
%A Y. Itigin
%A M. Dubman
%A G. Shainer
%A R. L. Graham
%A L. Liss
%A Y. Shahar
%A S. Potluri
%A D. Rossetti
%A D. Becker
%A D. Poole
%A C. Lamb
%A S. Kumar
%A C. Stunkel
%A George Bosilca
%A Aurelien Bouteiller
%K application program interfaces
%K Bandwidth
%K Electronics packaging
%K Hardware
%K high throughput computing
%K highly-scalable network stack
%K HPC
%K HPC network APIs
%K I/O bound applications
%K Infiniband
%K input-output programs
%K Libraries
%K Memory management
%K message passing
%K message passing interface
%K Middleware
%K MPI
%K open source framework
%K OpenSHMEM
%K parallel programming
%K parallel programming models
%K partitioned global address space languages
%K PGAS
%K PGAS languages
%K Programming
%K protocols
%K public domain software
%K RDMA
%K system libraries
%K task-based paradigms
%K UCX
%K Unified Communication X
%X This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.
%B 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
%I IEEE
%C Santa Clara, CA, USA
%P 40-43
%8 Aug
%@ 978-1-4673-9160-3
%G eng
%M 15573048
%R 10.1109/HOTI.2015.13

%0 Conference Paper
%B 2nd Workshop on Visual Performance Analysis (VPA '15)
%D 2015
%T Visualizing Execution Traces with Task Dependencies
%A Blake Haugen
%A Stephen Richmond
%A Jakub Kurzak
%A Chad A. Steed
%A Jack Dongarra
%X Task-based scheduling has emerged as one method to reduce the complexity of parallel computing. When using task-based schedulers, developers must frame their computation as a series of tasks with various data dependencies. The scheduler can take these tasks, along with their input and output dependencies, and schedule the task in parallel across a node or cluster. While these schedulers simplify the process of parallel software development, they can obfuscate the performance characteristics of the execution of an algorithm. The execution trace has been used for many years to give developers a visual representation of how their computations are performed. These methods can be employed to visualize when and where each of the tasks in a task-based algorithm is scheduled. In addition, the task dependencies can be used to create a directed acyclic graph (DAG) that can also be visualized to demonstrate the dependencies of the various tasks that make up a workload. The work presented here aims to combine these two data sets and extend execution trace visualization to better suit task-based workloads. This paper presents a brief description of task-based schedulers and the performance data they produce. It will then describe an interactive extension to the current trace visualization methods that combines the trace and DAG data sets. This new tool allows users to gain a greater understanding of how their tasks are scheduled. It also provides a simplified way for developers to evaluate and debug the performance of their scheduler.
%B 2nd Workshop on Visual Performance Analysis (VPA '15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Conference Proceedings
%B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15)
%D 2015
%T Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators
%A Azzam Haidar
%A Yulu Jia
%A Piotr Luszczek
%A Stanimire Tomov
%A Asim YarKhan
%A Jack Dongarra
%K dataflow scheduling
%K hardware accelerators
%K multi-grain parallelism
%X A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications.
%B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15)
%I ACM
%C Austin, TX
%V No. 5
%8 2015-11
%G eng

%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Accelerating Eigenvector Computation in the Nonsymmetric Eigenvalue Problem
%A Mark Gates
%A Azzam Haidar
%A Jack Dongarra
%X In the nonsymmetric eigenvalue problem, work has focused on the Hessenberg reduction and QR iteration, using efficient algorithms and fast, Level 3 BLAS routines. Comparatively, computation of eigenvectors performs poorly, limited to slow, Level 2 BLAS performance with little speedup on multi-core systems. It has thus become a dominant cost in the eigenvalue problem. To address this, we present improvements for the eigenvector computation to use Level 3 BLAS where applicable and parallelize the remaining triangular solves, achieving good parallel scaling and accelerating the overall eigenvalue problem more than three-fold.
%B VECPAR 2014
%C Eugene, OR
%8 2014-06
%G eng

%0 Book Section
%B Numerical Computations with GPUs
%D 2014
%T Accelerating Numerical Dense Linear Algebra Calculations with GPUs
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Numerical Computations with GPUs
%I Springer International Publishing
%P 3-28
%@ 978-3-319-06547-2
%G eng
%& 1
%R 10.1007/978-3-319-06548-9_1

%0 Generic
%D 2014
%T Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iterative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU data structures and kernels to the higher-level algorithmic choices and overall heterogeneous design. Most notably, the eigensolver leverages the high-performance of a new GPU kernel developed for the simultaneous multiplication of a sparse matrix and a set of vectors (SpMM). This is a building block that serves as a backbone for not only block-Krylov, but also for other methods relying on blocking for acceleration in general. The heterogeneous LOBPCG developed here reveals the potential of this type of eigensolver by highly optimizing all of its components, and can be viewed as a benchmark for other SpMM-dependent applications. Compared to non-blocked algorithms, we show that the performance speedup factor of SpMM vs. SpMV-based algorithms is up to six on GPUs like NVIDIA’s K40. In particular, a typical SpMV performance range in double precision is 20 to 25 GFlop/s, while the SpMM is in the range of 100 to 120 GFlop/s. Compared to highly-optimized CPU implementations, e.g., the SpMM from MKL on two eight-core Intel Xeon E5-2690s, our kernel is 3 to 5x. faster on a K40 GPU. For comparison to other computational loads, the same GPU to CPU performance acceleration is observed for the SpMV product, as well as dense linear algebra, e.g., matrix-matrix multiplication and factorizations like LU, QR, and Cholesky. Thus, the modeled GPU (vs. CPU) acceleration for the entire solver is also 3 to 5x. In practice though, currently available CPU implementations are much slower due to missed optimization opportunities, as we show.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2014-10
%G eng

%0 Conference Paper
%B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining
%D 2014
%T Access-averse Framework for Computing Low-rank Matrix Approximations
%A Ichitaro Yamazaki
%A Theo Mary
%A Jakub Kurzak
%A Stanimire Tomov
%A Jack Dongarra
%B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining
%C Washington, DC
%8 2014-10
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2014
%T Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K factorization
%K parallel linear algebra
%K plasma
%K recursion
%K shared memory synchronization
%K threaded parallelism
%X The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS.
%B Concurrency and Computation: Practice and Experience
%V 26
%P 1408-1431
%8 2014-05
%G eng
%U http://doi.wiley.com/10.1002/cpe.3110
%N 7
%! Concurrency Computat.: Pract. Exper.
%& 1408
%R 10.1002/cpe.3110

%0 Journal Article
%J VMWare Technical Journal
%D 2014
%T Analyzing PAPI Performance on Virtual Machines
%A John Nelson
%X Performance Application Programming Interface (PAPI) aims to provide a consistent interface for measuring performance events using the performance counter hardware available on the CPU as well as available software performance events and off-chip hardware. Without PAPI, a user may be forced to search through specific processor documentation to discover the name of processor performance events. These names can change from model to model and vendor to vendor. PAPI simplifies this process by providing a consistent interface and a set of processor-agnostic preset events. Software engineers can use data collected through source-code instrumentation using the PAPI interface to examine the relation between software performance and performance events. PAPI can also be used within many high-level performance-monitoring utilities such as TAU, Vampir, and Score-P.    VMware® ESXiTM and KVM have both added support within the last year for virtualizing performance counters. This article compares results measuring the performance of five real-world applications included in the Mantevo Benchmarking Suite in a VMware virtual machine, a KVM virtual machine, and on bare metal. By examining these results, it will be shown that PAPI provides accurate performance counts in a virtual machine environment.
%B VMWare Technical Journal
%V Winter 2013
%8 2014-01
%G eng
%U https://labs.vmware.com/vmtj/analyzing-papi-performance-on-virtual-machines

%0 Conference Paper
%B Euro-Par 2014
%D 2014
%T Assembly Operations for Multicore Architectures using Task-Based Runtime Systems
%A Damien Genet
%A Abdou Guermouche
%A George Bosilca
%X Traditionally, numerical simulations based on finite element methods consider the algorithm as being divided in three major steps: the generation of a set of blocks and vectors, the assembly of these blocks in a matrix and a big vector, and the inversion of the matrix. In this paper we tackle the second step, the block assembly, where no parallel algorithm is widely available. Several strategies are proposed to decompose the assembly problem while relying on a scheduling middle-ware to maximize the overlap between stages and increase the parallelism and thus the performance. These strategies are quantified using examples covering two extremes in the field, large number of non-overlapping small blocks for CFD-like problems, and a smaller number of larger blocks with significant overlap which can be met in sparse linear algebra solvers.
%B Euro-Par 2014
%I Springer International Publishing
%C Porto, Portugal
%8 2014-08
%G eng

%0 Conference Paper
%B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014
%D 2014
%T Assessing the Impact of ABFT and Checkpoint Composite Strategies
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%K ABFT
%K checkpoint
%K fault-tolerance
%K High-performance computing
%K resilience
%X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFTunaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed.
%B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Conference Paper
%B International Workshop on OpenCL
%D 2014
%T clMAGMA: High Performance Dense Linear Algebra with OpenCL
%A Chongxiao Cao
%A Jack Dongarra
%A Peng Du
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments presented, and in general provides to heterogeneous architectures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance OpenCL BLAS, hardware and OpenCL-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
%B International Workshop on OpenCL
%C Bristol University, England
%8 2014-05
%G eng

%0 Journal Article
%J SIAM Journal on Matrix Analysis and Application
%D 2014
%T Communication-Avoiding Symmetric-Indefinite Factorization
%A Grey Ballard
%A Dulceneia Becker
%A James Demmel
%A Jack Dongarra
%A Alex Druinsky
%A I Peled
%A Oded Schwartz
%A Sivan Toledo
%A Ichitaro Yamazaki
%K plasma
%X We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen’s triangular tridiagonalization. It factors a dense symmetric matrix A as the product A = P LT L T P T where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. The current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal.
%B SIAM Journal on Matrix Analysis and Application
%V 35
%P 1364-1406
%8 2014-07
%G eng
%N 4

%0 Conference Paper
%B International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS)
%D 2014
%T Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems
%A Marc Baboulin
%A Jack Dongarra
%A Remi Lacroix
%X This paper presents an efficient computation for least squares conditioning or estimates of it. We propose performance results using new routines on top of the multicore-GPU library MAGMA. This set of routines is based on an efficient computation of the variance-covariance matrix for which, to our knowledge, there is no implementation in current public domain libraries LAPACK and ScaLAPACK.
%B International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS)
%C Waterloo, Ontario, CA
%8 2014-08
%G eng

%0 Conference Paper
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2014
%T Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C New Orleans, LA
%8 2014-11
%G eng

%0 Conference Paper
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%D 2014
%T Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K dataflow
%K message-passing
%K multithreading
%K QR decomposition
%K runtime
%K systolic array
%X A systolic array provides an alternative computing paradigm to the von Neuman architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems.
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Generic
%D 2014
%T Design for a Soft Error Resilient Dynamic Task-based Runtime
%A Chongxiao Cao
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%X Abstract—As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.
%B ICL Technical Report
%I University of Tennessee
%8 2014-11
%G eng

%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Designing LU-QR Hybrid Solvers for Performance and Stability
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%K plasma
%X This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. Ideally, the choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the Parsec software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%@ 978-1-4799-3800-1
%G eng
%R 10.1109/IPDPS.2014.108

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14)
%D 2014
%T Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster
%A Ichitaro Yamazaki
%A Sivasankaran Rajamanickam
%A Eric G. Boman
%A Mark Hoemmen
%A Michael A. Heroux
%A Stanimire Tomov
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14)
%I IEEE
%C New Orleans, LA
%8 2014-11
%G eng

%0 Conference Paper
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%D 2014
%T Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs
%A Simplice Donfack
%A Stanimire Tomov
%A Jack Dongarra
%X Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer’s characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on a 24 cores AMD Opteron 6172 show that by adding one GPU (Tesla S2050) we accelerate LU up to 2.4x compared to the corresponding routine in MKL using 24 cores. The comparisons with MAGMA also show significant improvements.
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%8 2014-05
%G eng

%0 Generic
%D 2014
%T Efficient checkpoint/verification patterns for silent error detection
%A Anne Benoit
%A Yves Robert
%A Saurabh K. Raina
%X Resilience has become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their particularities is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct.    In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (Mean Time Between Failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared to the base algorithm that always perform a verification just before taking a checkpoint (p = q = 1), and we exhibit gains of up to 19%.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2014-05
%G eng
%9 LAWN 287

%0 Journal Article
%J Parallel Computing
%D 2014
%T An Efficient Distributed Randomized Algorithm for Solving Large Dense Symmetric Indefinite Linear Systems
%A Marc Baboulin
%A Du Becker
%A George Bosilca
%A Anthony Danalis
%A Jack Dongarra
%K Distributed linear algebra solvers
%K LDLT factorization
%K PaRSEC runtime
%K plasma
%K Randomized algorithms
%K Symmetric indefinite systems
%X Randomized algorithms are gaining ground in high-performance computing applications as they have the potential to outperform deterministic methods, while still providing accurate results. We propose a randomized solver for distributed multicore architectures to efficiently solve large dense symmetric indefinite linear systems that are encountered, for instance, in parameter estimation problems or electromagnetism simulations. The contribution of this paper is to propose efficient kernels for applying random butterfly transformations and a new distributed implementation combined with a runtime (PaRSEC) that automatically adjusts data structures, data mappings, and the scheduling as systems scale up. Both the parallel distributed solver and the supporting runtime environment are innovative. To our knowledge, the randomization approach associated with this solver has never been used in public domain software for symmetric indefinite systems. The underlying runtime framework allows seamless data mapping and task scheduling, mapping its capabilities to the underlying hardware features of heterogeneous distributed architectures. The performance of our software is similar to that obtained for symmetric positive definite systems, but requires only half the execution time and half the amount of data storage of a general dense solver.
%B Parallel Computing
%V 40
%P 213-223
%8 2014-07
%G eng
%N 7
%R 10.1016/j.parco.2013.12.003

%0 Conference Paper
%B International Conference on Parallel Processing (ICPP-2014)
%D 2014
%T A Fast Batched Cholesky Factorization on a GPU
%A Tingxing Dong
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms – nonblocked, blocked, and recursive blocked – were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1:8 speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMA by 1.5 in performance-per-watt for large matrices.
%B International Conference on Parallel Processing (ICPP-2014)
%C Minneapolis, MN
%8 2014-09
%G eng

%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K Computer science
%K factorization
%K Heterogeneous systems
%K Intel Xeon Phi
%K linear algebra
%X We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs and coprocessors using a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components.
%B VECPAR 2014
%C Eugene, OR
%8 2014-06
%G eng

%0 Conference Paper
%B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014
%D 2014
%T Hybrid Multi-Elimination ILU Preconditioners on GPUs
%A Dimitar Lukarski
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X Abstract—Iterative solvers for sparse linear systems often benefit from using preconditioners. While there are implementations for many iterative methods that leverage the computing power of accelerators, porting the latest developments in preconditioners to accelerators has been challenging. In this paper we develop a selfadaptive multi-elimination preconditioner for graphics processing units (GPUs). The preconditioner is based on a multi-level incomplete LU factorization and uses a direct dense solver for the bottom-level system. For test matrices from the University of Florida matrix collection, we investigate the influence of handling the triangular solvers in the distinct iteration steps in either single or double precision arithmetic. Integrated into a Conjugate Gradient method, we show that our multi-elimination algorithm is highly competitive against popular preconditioners, including multi-colored symmetric Gauss-Seidel relaxation preconditioners, and (multi-colored symmetric) ILU for numerous problems.
%B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Generic
%D 2014
%T Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X Numerical methods in sparse linear algebra typically rely on a fast and efficient matrix vector product, as this usually is the backbone of iterative algorithms for solving eigenvalue problems or linear systems. Against the background of a large diversity in the characteristics of high performance computer architectures, it is a challenge to derive a cross-platform efficient storage format along with fast matrix vector kernels. Recently, attention focused on the SELL-C- format, a sliced ELLPACK format enhanced by row-sorting to reduce the fill in when padding rows with zeros. In this paper we propose an additional modification resulting in the padded sliced ELLPACK (SELLP) format, for which we develop a sparse matrix vector CUDA kernel that is able to efficiently exploit the computing power of NVIDIA GPUs. We show that the kernel we developed outperforms straight-forward implementations for the widespread CSR and ELLPACK formats, and is highly competitive to the implementations in the highly optimized CUSPARSE library.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2014-04
%G eng

%0 Journal Article
%J Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences
%D 2014
%T Improving the Energy Efficiency of Sparse Linear System Solvers on Multicore and Manycore Systems
%A Hartwig Anzt
%A Enrique S. Quintana-Orti
%K energy efficiency
%K graphics processing units
%K High Performance Computing
%K iterative solvers
%K multicore processors
%K sparse linear systems
%X While most recent breakthroughs in scientific research rely on complex simulations carried out in large-scale supercomputers, the power draft and energy spent for this purpose is increasingly becoming a limiting factor to this trend. In this paper, we provide an overview of the current status in energy-efficient scientific computing by reviewing different technologies used to monitor power draft as well as power- and energy-saving mechanisms available in commodity hardware. For the particular domain of sparse linear algebra, we analyze the energy efficiency of a broad collection of hardware architectures and investigate how algorithmic and implementation modifications can improve the energy performance of sparse linear system solvers, without negatively impacting their performance.
%B Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences
%V 372
%8 2014-07
%G eng
%N 2018
%R 10.1098/rsta.2013.0279

%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Improving the performance of CA-GMRES on multicores with multiple GPUs
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Hoemmen
%A Jack Dongarra
%X Abstract—The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2014
%T Looking Back at Dense Linear Algebra Software
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%K decompositional approach
%K dense linear algebra
%K parallel algorithms
%X Over the years, computational physics and chemistry served as an ongoing source of problems that demanded the ever increasing performance from hardware as well as the software that ran on top of it. Most of these problems could be translated into solutions for systems of linear equations: the very topic of numerical linear algebra. Seemingly then, a set of efficient linear solvers could be solving important scientific problems for years to come. We argue that dramatic changes in hardware designs precipitated by the shifting nature of the marketplace of computer hardware had a continuous effect on the software for numerical linear algebra. The extraction of high percentages of peak performance continues to require adaptation of software. If the past history of this adaptive nature of linear algebra software is any guide then the future theme will feature changes as well–changes aimed at harnessing the incredible advances of the evolving hardware infrastructure.
%B Journal of Parallel and Distributed Computing
%V 74
%P 2548–2560
%8 2014-07
%G eng
%N 7
%& 2548
%R 10.1016/j.jpdc.2013.10.005

%0 Conference Paper
%B 16th IEEE International Conference on High Performance Computing and Communications (HPCC)
%D 2014
%T LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU
%A Tingxing Dong
%A Azzam Haidar
%A Piotr Luszczek
%A James Harris
%A Stanimire Tomov
%A Jack Dongarra
%X Gaussian Elimination is commonly used to solve dense linear systems in scientific models. In a large number of applications, a need arises to solve many small size problems, instead of few large linear systems. The size of each of these small linear systems depends, for example, on the number of the ordinary differential equations (ODEs) used in the model, and can be on the order of hundreds of unknowns. To efficiently exploit the computing power of modern accelerator hardware, these linear systems are processed in batches. To improve the numerical stability of the Gaussian Elimination, at least partial pivoting is required, most often accomplished with row pivoting. However, row pivoting can result in a severe performance penalty on GPUs because it brings in thread divergence and non-coalesced memory accesses. The state-of-the-art libraries for linear algebra that target GPUs, such as MAGMA, focus on large matrix sizes. They change the data layout by transposing the matrix to avoid these divergence and non-coalescing penalties. However, the data movement associated with transposition is very expensive for small matrices. In this paper, we propose a batched LU factorization for GPUs by using a multi-level blocked right looking algorithm that preserves the data layout but minimizes the penalty of partial pivoting. Our batched LU achieves up to 2:5-fold speedup when compared to the alternative CUBLAS solutions on a K40c GPU and 3:6-fold speedup over MKL on a node of the Titan supercomputer at ORNL in a nuclear reaction network simulation.
%B 16th IEEE International Conference on High Performance Computing and Communications (HPCC)
%I IEEE
%C Paris, France
%8 2014-08
%G eng

%0 Conference Paper
%B IPASS-2014
%D 2014
%T MIAMI: A Framework for Application Performance Diagnosis
%A Gabriel Marin
%A Jack Dongarra
%A Dan Terpstra
%X A typical application tuning cycle repeats the following three steps in a loop: performance measurement, analysis of results, and code refactoring. While performance measurement is well covered by existing tools, analysis of results to understand the main sources of inefficiency and to identify opportunities for optimization is generally left to the user. Today's state of the art performance analysis tools use instrumentation or hardware counter sampling to measure the performance of interactions between code and the target architecture during execution. Such measurements are useful to identify hotspots in applications, places where execution time is spent or where cache misses are incurred. However, explanatory understanding of tuning opportunities requires a more detailed, mechanistic modeling approach. This paper presents MIAMI (Machine Independent Application Models for performance Insight), a set of tools for automatic performance diagnosis. MIAMI uses application characterization and models of target architectures to reason about an application's performance. MIAMI uses a modeling approach based on first-order principles to identify performance bottlenecks, pinpoint optimization opportunities, and compute bounds on the potential for improvement.
%B IPASS-2014
%I IEEE
%C Monterey, CA
%8 2014-03
%@ 978-1-4799-3604-5
%G eng
%R 10.1109/ISPASS.2014.6844480

%0 Conference Paper
%B VECPAR 2014 (Best Paper)
%D 2014
%T Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Tingxing Dong
%A Jack Dongarra
%X We propose a mixed-precision orthogonalization scheme that takes the input matrix in a standard 32 or 64-bit floating-point precision, but uses higher-precision arithmetics to accumulate its intermediate results. For the 64-bit precision, our scheme uses software emulation for the higher-precision arithmetics, and requires about 20x more computation but about the same amount of communication as the standard orthogonalization scheme. Since the computation is becoming less expensive compared to the communication on new and emerging architectures, the relative cost of our mixed-precision scheme is decreasing. Our case studies with CA-GMRES on a GPU demonstrate that using mixed-precision for this small but critical segment of CA-GMRES can improve not only its overall numerical stability but also, in some cases, its performance.
%B VECPAR 2014 (Best Paper)
%C Eugene, OR
%8 2014-06
%G eng

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2014
%T Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems
%A Jack Dongarra
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Asim YarKhan
%K dense linear algebra
%K hardware accelerators
%K task superscalar scheduling
%X Hardware heterogeneity of the HPC platforms is no longer considered unusual but instead have become the most viable way forward towards Exascale.  In fact, the multitude of the heterogeneous resources available to modern computers are designed for different workloads and their efficient use is closely aligned with the specialized role envisaged by their design.  Commonly in order to efficiently use such GPU resources, the workload in question must have a much greater degree of parallelism than workloads often associated with multicore processors (CPUs).  Available GPU variants differ in their internal architecture and, as a result, are capable of handling workloads of varying degrees of complexity and a range of computational patterns.  This vast array of applicable workloads will likely lead to an ever accelerated mixing of multicore-CPUs and GPUs in multi-user environments with the ultimate goal of offering adequate computing facilities for a wide range of scientific and technical workloads.  In the following paper, we present a research prototype that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems.  Our lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution.  This concept is reminiscent of dataflow and systolic architectures in its conceptualization of a workload as a set of side-effect-free tasks that pass data items whenever the associated work assignment have been completed.  Additionally, our task abstractions and their parametrization enable uniformity in the algorithmic development across all the heterogeneous resources without sacrificing precious compute cycles.  We include performance results for dense linear algebra functions which demonstrate the practicality and effectiveness of our approach that is aptly capable of full utilization of a wide range of accelerator hardware.
%B Supercomputing Frontiers and Innovations
%V 1
%G eng
%N 1
%R http://dx.doi.org/10.14529/jsfi1401

%0 Conference Paper
%B 8th International Conference on Partitioned Global Address Space Programming Models (PGAS)
%D 2014
%T A Multithreaded Communication Substrate for OpenSHMEM
%A Aurelien Bouteiller
%A Thomas Herault
%A George Bosilca
%X OpenSHMEM scalability is strongly dependent on the capa- bility of its communication layer to efficiently handle multi- ple threads. In this paper, we present an early evaluation of the thread safety specification in the Unified Common Com- munication Substrate (UCCS) employed in OpenSHMEM. Results demonstrate that thread safety can be provided at an acceptable cost and can improve efficiency for some op- erations, compared to serializing communication.
%B 8th International Conference on Partitioned Global Address Space Programming Models (PGAS)
%C Eugene, OR
%8 2014-10
%G eng

%0 Conference Paper
%B Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper)
%D 2014
%T New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem
%A Azzam Haidar
%A Piotr Luszczek
%A Jack Dongarra
%X We describe a design and implementation of a multi-stage algorithm for computing eigenvectors of a dense symmetric matrix. We show that reformulating the existing algorithms is beneficial in terms of performance even if that doubles the computational complexity. Through detailed analysis, we show that the effect of the increase in the asymptotic operation count may be compensated by a much improved performance rate. Our performance results indicate that using our approach achieves very good speedup and scalability even when directly compared with the existing state-of-the-art software.
%B Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper)
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%R 10.1109/IPDPSW.2014.130

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2014
%T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks
%A Azzam Haidar
%A Raffaele Solcà
%A Mark Gates
%A Stanimire Tomov
%A Thomas C. Schulthess
%A Jack Dongarra
%K Eigensolver
%K electronic structure calculations
%K generalized eigensolver
%K gpu
%K high performance
%K hybrid
%K Multicore
%K two-stage
%X The adoption of hybrid CPU–GPU nodes in traditional supercomputing platforms such as the Cray-XK6 opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium-sized generalized eigenvalue problems must be solved many times. These eigenvalue problems are too small to effectively solve on distributed systems, but can benefit from the massive computing power concentrated on a single-node, hybrid CPU–GPU system. However, hybrid systems call for the development of new algorithms that efficiently exploit heterogeneity and massive parallelism of not just GPUs, but of multicore/manycore CPUs as well. Addressing these demands, we developed a generalized eigensolver featuring novel algorithms of increased computational intensity (compared with the standard algorithms), decomposition of the computation into fine-grained memory aware tasks, and their hybrid execution. The resulting eigensolvers are state-of-the-art in high-performance computing, significantly outperforming existing libraries. We describe the algorithm and analyze its performance impact on applications of interest when different fractions of eigenvectors are needed by the host electronic structure code.
%B International Journal of High Performance Computing Applications
%V 28
%P 196-209
%8 2014-05
%G eng
%N 2
%& 196
%R 10.1177/1094342013502097

%0 Conference Paper
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%D 2014
%T Optimizing Krylov Subspace Solvers on Graphics Processing Units
%A Stanimire Tomov
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Hartwig Anzt
%A William Sawyer
%X Krylov subspace solvers are often the method of choice when solving sparse linear systems iteratively. At the same time, hardware accelerators such as graphics processing units (GPUs) continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to leverage the full potential of the accelerator. In this paper we target the acceleration of the BiCGSTAB solver for GPUs, showing that significant improvement can be achieved by reformulating the method and developing application-specific kernels instead of using the generic CUBLAS library provided by NVIDIA. We propose an implementation that benefits from a significantly reduced number of kernel launches and GPUhost communication events, by means of increased data locality and a simultaneous reduction of multiple scalar products. Using experimental data, we show that, depending on the dominance of the untouched sparse matrix vector products, significant performance improvements can be achieved compared to a reference implementation based on the CUBLAS library. We feel that such optimizations are crucial for the subsequent development of highlevel sparse linear algebra libraries.
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Generic
%D 2014
%T Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI
%A Gabriel Marin
%K MIAMI
%B ICL Technical Report
%I University of Tennessee
%8 2014-02
%G eng

%0 Conference Paper
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%D 2014
%T Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors
%A Azzam Haidar
%A Chongxiao Cao
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly.
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%I IEEE
%C New Orleans, LA
%8 2014-11
%G eng
%R 10.1109/ScalA.2014.8

%0 Journal Article
%J International Journal of Networking and Computing
%D 2014
%T Performance and Reliability Trade-offs for the Double Checkpointing Algorithm
%A Jack Dongarra
%A Thomas Herault
%A Yves Robert
%K communication contention
%K in-memory checkpoint
%K performance
%K resilience
%K risk
%X Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach based upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé [23], with the non-blocking algorithm of Ni, Meneses and Kalé [15] in terms of both performance and risk. We also extend the model proposedcan provide a better efficiency in [23, 15] to assess the impact of the overhead associated to non-blocking communications. In addition, we deal with arbitrary failure distributions (as opposed to uniform distributions in [23]). We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work without additional memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
%B International Journal of Networking and Computing
%V 4
%P 32-41
%8 2014
%G eng
%& 32

%0 Generic
%D 2014
%T Performance of Various Computers Using Standard Linear Equations Software, (Linpack Benchmark Report)
%A Jack Dongarra
%X This report compares the performance of different computer systems in solving dense systems of linear equations. The comparison involves approximately a hundred computers, ranging from the Earth Simulator to personal computers.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2014-06
%G eng

%0 Conference Paper
%B 2014 IEEE International Conference on Cluster Computing
%D 2014
%T Power Monitoring with PAPI for Extreme Scale Architectures and Dataflow-based Programming Models
%A Heike McCraw
%A James Ralph
%A Anthony Danalis
%A Jack Dongarra
%X For more than a decade, the PAPI performance-monitoring library has provided a clear, portable interface to the hardware performance counters available on all modern CPUs and other components of interest (e.g., GPUs, network, and I/O systems). Most major end-user tools that application developers use to analyze the performance of their applications rely on PAPI to gain access to these performance counters.    One of the critical road-blockers on the way to larger, more complex high performance systems, has been widely identified as being the energy efficiency constraints. With modern extreme scale machines having hundreds of thousands of cores, the ability to reduce power consumption for each CPU at the software level becomes critically important, both for economic and environmental reasons. In order for PAPI to continue playing its well established role in HPC, it is pressing to provide valuable performance data that not only originates from within the processing cores but also delivers insight into the power consumption of the system as a whole.    An extensive effort has been made to extend the Performance API to support power monitoring capabilities for various platforms. This paper provides detailed information about three components that allow power monitoring on the Intel Xeon Phi and Blue Gene/Q. Furthermore, we discuss the integration of PAPI in PARSEC – a taskbased dataflow-driven execution engine – enabling hardware performance counter and power monitoring at true task granularity.
%B 2014 IEEE International Conference on Cluster Computing
%I IEEE
%C Madrid, Spain
%8 2014-09
%G eng
%R 10.1109/CLUSTER.2014.6968672

%0 Conference Paper
%B International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC)
%D 2014
%T PTG: An Abstraction for Unhindered Parallelism
%A Anthony Danalis
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Jack Dongarra
%K dte
%K parsec
%K plasma
%X <p>  Increased parallelism and use of heterogeneous computing resources is now an established trend in High Performance Computing (HPC), a trend that, looking forward to Exascale, seems bound to intensify. Despite the evolution of hardware over the past decade, the programming paradigm of choice was invariably derived from Coarse Grain Parallelism with explicit data movements. We argue that message passing has remained the de facto standard in HPC because, until now, the ever increasing challenges that application developers had to address to create efficient portable applications remained manageable for expert programmers.  </p>  <p>  Data-flow based programming is an alternative approach with significant potential. In this paper, we discuss the Parameterized Task Graph (PTG) abstraction and present the specialized input language that we use to specify PTGs in our data-flow task-based runtime system, PaRSEC. This language and the corresponding execution model are in contrast with the execution model of explicit message passing as well as the model of alternative task based runtime systems. The Parameterized Task Graph language decouples the expression of the parallelism in the algorithm from the control-flow ordering, load balance, and data distribution. Thus, programs are more adaptable and map more efficiently on challenging hardware, as well as maintain portability across diverse architectures. To support these claims, we discuss the different challenges of HPC programming and how PaRSEC can address them, and we demonstrate that in today’s large scale supercomputers, PaRSEC can significantly outperform state-of-the-art MPI applications and libraries, a trend that will increase with future architectural evolution.  </p>
%B International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC)
%I IEEE Press
%C New Orleans, LA
%8 2014-11
%G eng

%0 Generic
%D 2014
%T PULSAR Users’ Guide, Parallel Ultra-Light Systolic Array Runtime
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X PULSAR version 2.0, released in November 2014, is a complete programming platform for large-scale distributed memory systems with multicore processors and hardware accelerators. PULSAR provides a simple abstraction layer over multithreading, message passing, and multi-GPU, multi-stream programming. PULSAR offers a general-purpose programming model, suitable for a wide range of scientific and engineering applications. PULSAR was inspired by systolic arrays, popularized by Hsiang-Tsung Kung and Charles E. Leiserson.
%B University of Tennessee EECS Technical Report
%I University of Tennessee
%8 2014-11
%G eng

%0 Conference Proceedings
%B International conference on Supercomputing
%D 2014
%T Scaling Up Matrix Computations on Shared-Memory Manycore Systems with 1000 CPU Cores
%A Fengguang Song
%A Jack Dongarra
%X While the growing number of cores per chip allows researchers to solve larger scientific and engineering problems, the parallel efficiency of the deployed parallel software starts to decrease. This unscalability problem happens to both vendor-provided and open-source software and wastes CPU cycles and energy. By expecting CPUs with hundreds of cores to be imminent, we have designed a new framework to perform matrix computations for massively many cores. Our performance analysis on manycore systems shows that the unscalability bottleneck is related to Non-Uniform Memory Access (NUMA): memory bus contention and remote memory access latency. To overcome the bottleneck, we have designed NUMA-aware tile algorithms with the help of a dynamic scheduling runtime system to minimize NUMA memory accesses. The main idea is to identify the data that is, either read a number of times or written once by a thread resident on a remote NUMA node, then utilize the runtime system to conduct data caching and movement between different NUMA nodes. Based on the experiments with QR factorizations, we demonstrate that our framework is able to achieve great scalability on a 48-core AMD Opteron system (e.g., parallel efficiency drops only 3% from one core to 48 cores). We also deploy our framework to an extreme-scale shared-memory SGI machine which has 1024 CPU cores and runs a single Linux operating system image. Our framework continues to scale well, and can outperform the vendor-optimized Intel MKL library by up to 750%.
%B International conference on Supercomputing
%I ACM
%C Munich, Germany
%P 333-342
%8 2014-06
%@ 978-1-4503-2642-1
%G eng
%R 10.1145/2597652.2597670

%0 Conference Paper
%B VISSOFT'14: 2nd IEEE Working Conference on Software Visualization
%D 2014
%T Search Space Pruning Constraints Visualization
%A Blake Haugen
%A Jakub Kurzak
%X The field of software optimization, among others, is interested in finding an optimal solution in a large search space. These search spaces are often large, complex, non-linear and even non-continuous at times. The size of the search space makes a brute force solution intractable. As a result, one or more search space pruning constraints are often used to reduce the number of candidate configurations that must be evaluated in order to solve the optimization problem.    If more than one pruning constraint is employed, it can be challenging to understand how the pruning constraints interact and overlap. This work presents a visualization technique based on a radial, space-filling technique that allows the user to gain a better understanding of how the pruning constraints remove candidates from the search space. The technique is then demonstrated using a search space pruning data set derived from the optimization of a matrix multiplication code for NVIDIA CUDA accelerators.
%B VISSOFT'14: 2nd IEEE Working Conference on Software Visualization
%I IEEE
%C Victoria, BC, Canada
%8 2014-09
%G eng

%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures
%A Hartwig Anzt
%A Dimitar Lukarski
%A Stanimire Tomov
%A Jack Dongarra
%X Based on the premise that preconditioners needed for scientific computing are not only required to be robust in the numerical sense, but also scalable for up to thousands of light-weight cores, we argue that this two-fold goal is achieved for the recently developed self-adaptive multi-elimination preconditioner. For this purpose, we revise the underlying idea and analyze the performance of implementations realized in the PARALUTION and MAGMA open-source software libraries on GPU architectures (using either CUDA or OpenCL), Intel’s Many Integrated Core Architecture, and Intel’s Sandy Bridge processor. The comparison with other well-established preconditioners like multi-coloured Gauss-Seidel, ILU(0) and multi-colored ILU(0), shows that the twofold goal of a numerically stable cross-platform performant algorithm is achieved.
%B VECPAR 2014
%C Eugene, OR
%8 2014-06
%G eng

%0 Conference Paper
%B IPDPS 2014
%D 2014
%T A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU
%A Tingxing Dong
%A Veselin Dobrev
%A Tzanio Kolev
%A Robert Rieben
%A Stanimire Tomov
%A Jack Dongarra
%K Computer science
%K CUDA
%K FEM
%K Finite element method
%K linear algebra
%K nVidia
%K Tesla K20
%X Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs [10, 1]. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor’s library routines substantially in performance. We proposed an autotuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2:5x overall speedup and 1:42x energy efficiency (greenup) using 4th order (Q4) finite elements, and a 1:9x speedup and 1:27x greenup using 2nd order (Q2) finite elements.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Conference Paper
%B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014
%D 2014
%T Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes
%A Xavier Lacoste
%A Mathieu Faverge
%A Pierre Ramet
%A Samuel Thibault
%A George Bosilca
%K DAG based runtime
%K gpu
%K Multicore
%K Sparse linear solver
%X The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of the computing resources. The pressure to maintain reasonable levels of performance and portability, forces the application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical architectures. In this paper, we study the replacement of the highly specialized internal scheduler in PaStiX by two generic runtime frameworks: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them with the opportunity to optimize it in order to maximize the algorithm eefficiency for a predefined execution environment. A comparative study of the performance of the PaStiX solver with the three schedulers { native PaStiX, StarPU and PaRSEC schedulers { on different execution contexts is performed. The analysis highlights the similarities from a performance point of view between the different execution supports. These results demonstrate that these generic DAG-based runtimes provide a uniform and portable programming interface across heterogeneous environments, and are, therefore, a sustainable solution for hybrid environments.
%B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Conference Paper
%B 2014 IEEE International Conference on High Performance Computing and Communications (HPCC)
%D 2014
%T Task-Based Programming for Seismic Imaging: Preliminary Results
%A Lionel Boillot
%A George Bosilca
%A Emmanuel Agullo
%A Henri Calandra
%K plasma
%X The level of hardware complexity of current supercomputers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms. While this paradigm has proved efficient for achieving such goals for dense and sparse linear solvers, it is yet to be demonstrated that industrial parallel codes—relying on the classical Message Passing Interface (MPI) standard and that accumulate dozens of years of expertise (and countless lines of code)—may be revisited to turn them into efficient task-based programs. In this paper, we study the applicability of task-based programming in the case of a Reverse Time Migration (RTM) application for Seismic Imaging. The initial MPI-based application is turned into a task-based code executed on top of the PaRSEC runtime system. Preliminary results show that the approach is competitive with (and even potentially superior to) the original MPI code on a homogeneous multicore node, and can more efficiently exploit complex hardware such as a cache coherent Non Uniform Memory Access (ccNUMA) node or an Intel Xeon Phi accelerator.
%B 2014 IEEE International Conference on High Performance Computing and Communications (HPCC)
%I IEEE
%C Paris, France
%8 2014-08
%G eng

%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment
%A Azzam Haidar
%A Chongxiao Cao
%A Jack Dongarra
%A Piotr Luszczek
%A Stanimire Tomov
%K algorithms
%K Computer science
%K CUDA
%K Heterogeneous systems
%K Intel Xeon Phi
%K linear algebra
%K nVidia
%K Tesla K20
%K Tesla M2090
%X Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resourcespecific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2014
%T Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors
%A José I. Aliaga
%A Hartwig Anzt
%A Maribel Castillo
%A Juan C. Fernández
%A Germán León
%A Joaquín Pérez
%A Enrique S. Quintana-Orti
%K CG
%K CPUs
%K energy efficiency
%K GPUs
%K low-power architectures
%X In this paper, we analyze the interactions occurring in the triangle performance-power-energy for the execution of a pivotal numerical algorithm, the iterative conjugate gradient (CG) method, on a diverse collection of parallel multithreaded architectures. This analysis is especially timely in a decade where the power wall has arisen as a major obstacle to build faster processors. Moreover, the CG method has recently been proposed as a complement to the LINPACK benchmark, as this iterative method is argued to be more archetypical of the performance of today's scientific and engineering applications. To gain insights about the benefits of hands-on optimizations we include runtime and energy efficiency results for both out-of-the-box usage relying exclusively on compiler optimizations, and implementations manually optimized for target architectures, that range from general-purpose and digital signal multicore processors to manycore graphics processing units, all representative of current multithreaded systems.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 885-904
%8 2014-09
%G eng
%U http://dx.doi.org/10.1002/cpe.3341
%N 4
%& 885
%R 10.1002/cpe.3341

%0 Conference Paper
%B 2014 IEEE International Conference on Cluster Computing
%D 2014
%T Utilizing Dataflow-based Execution for Coupled Cluster Methods
%A Heike McCraw
%A Anthony Danalis
%A George Bosilca
%A Jack Dongarra
%A Karol Kowalski
%A Theresa Windus
%X Computational chemistry comprises one of the driving forces of High Performance Computing. In particular, many-body methods, such as Coupled Cluster (CC) methods of the quantum chemistry package NWCHEM, are of particular interest for the applied chemistry community.    Harnessing large fractions of the processing power of modern large scale computing platforms has become increasingly difficult. With the increase in scale, complexity, and heterogeneity of modern platforms, traditional programming models fail to deliver the expected performance scalability. On our way to Exascale and with these extremely hybrid platforms, dataflow-based programming models may be the only viable way for achieving and maintaining computation at scale.    In this paper, we discuss a dataflow-based programming model and its applicability to NWCHEM’s CC methods. Our dataflow version of the CC kernels breaks down the algorithm into fine-grained tasks with explicitly defined data dependencies. As a result, many of the traditional synchronization points can be eliminated, allowing for a dynamic reshaping of the execution based on the ongoing availability of computational resources. We build this experiment using PARSEC – a task-based dataflow-driven execution engine – that enables efficient task scheduling on distributed systems, providing a desirable portability layer for application developers.
%B 2014 IEEE International Conference on Cluster Computing
%I IEEE
%C Madrid, Spain
%8 2014-09
%G eng

%0 Journal Article
%J ACM Transactions on Mathematical Software (also LAWN 246)
%D 2013
%T Accelerating Linear System Solutions Using Randomization Techniques
%A Marc Baboulin
%A Jack Dongarra
%A Julien Herrmann
%A Stanimire Tomov
%K algorithms
%K dense linear algebra
%K experimentation
%K graphics processing units
%K linear systems
%K lu factorization
%K multiplicative preconditioning
%K numerical linear algebra
%K performance
%K plasma
%K randomization
%X We illustrate how linear algebra calculations can be enhanced by statistical techniques in the case of a square linear system Ax = b. We study a random transformation of A that enables us to avoid pivoting and then to reduce the amount of communication. Numerical experiments show that this randomization can be performed at a very affordable computational price while providing us with a satisfying accuracy when compared to partial pivoting. This random transformation called Partial Random Butterfly Transformation (PRBT) is optimized in terms of data storage and flops count. We propose a solver where PRBT and the LU factorization with no pivoting take advantage of the current hybrid multicore/GPU machines and we compare its Gflop/s performance with a solver implemented in a current parallel library.
%B ACM Transactions on Mathematical Software (also LAWN 246)
%V 39
%8 2013-02
%G eng
%U http://dl.acm.org/citation.cfm?id=2427025
%N 2
%R 10.1145/2427023.2427025

%0 Generic
%D 2013
%T Analyzing PAPI Performance on Virtual Machines
%A John Nelson
%X Over the last ten years, virtualization techniques have become much more widely popular as a result of fast and cheap processors. Virtualization provides many benefits making it appealing for testing environments. Encapsulating configurations is a huge motivator for wanting to do performance testing on virtual machines. Provisioning, a technique that is used by FutureGrid, is also simplified using virtual machines. Virtual machines enable portability among heterogeneous systems while providing an identical configuration within the guest operating system.    My work in ICL has focused on using PAPI inside of virtual machines. There were two main areas of focus throughout my research. The first originated because of anomalous results of the HPC Challenge Benchmark reported in a paper submitted by ICL [3] in which the order of input sizes tested impacted run time on virtual machines but not on bare metal. A discussion of this anomaly will be given in section II along with a discussion of timers used in virtual machines. The second area of focus was exploring the recently implemented support by KVM (Kernel-based Virtual Machine) and VMware for guest OS level performance counters. A discussion of application tests run to observe the behavior of event counts measured in a virtual machine as well as a discussion of information learned pertinent to event measurement will be given in section III.
%B ICL Technical Report
%8 2013-08
%G eng

%0 Generic
%D 2013
%T Assessing the impact of ABFT and Checkpoint composite strategies
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%K ABFT
%K checkpoint
%K fault-tolerance
%K High-performance computing
%K resilience
%X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFT-unaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed.
%B University of Tennessee Computer Science Technical Report
%G eng

%0 Conference Paper
%B International Supercomputing Conference 2013 (ISC'13)
%D 2013
%T Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q
%A Heike McCraw
%A Dan Terpstra
%A Jack Dongarra
%A Kris Davis
%A Roy Musselman
%B International Supercomputing Conference 2013 (ISC'13)
%I Springer
%C Leipzig, Germany
%8 2013-06
%G eng

%0 Journal Article
%J The Computer Journal
%D 2013
%T BlackjackBench: Portable Hardware Characterization with Automated Results Analysis
%A Anthony Danalis
%A Piotr Luszczek
%A Gabriel Marin
%A Jeffrey Vetter
%A Jack Dongarra
%K hardware characterization
%K micro-benchmarks
%K statistical analysis
%X DARPA's AACE project aimed to develop Architecture Aware Compiler Environments. Such a compiler automatically characterizes the targeted hardware and optimizes the application codes accordingly. We present the BlackjackBench suite, a collection of portable micro-benchmarks that automate system characterization, plus statistical analysis techniques for interpreting the results. The BlackjackBench benchmarks discover the effective sizes and speeds of the hardware environment rather than the often unattainable peak values. We aim at hardware characteristics that can be observed by running executables generated by existing compilers from standard C codes. We characterize the memory hierarchy, including cache sharing and non-uniform memory access characteristics of the system, properties of the processing cores affecting the instruction execution speed and the length of the operating system scheduler time slot. We show how these features of modern multicores can be discovered programmatically. We also show how the features could potentially interfere with each other resulting in incorrect interpretation of the results, and how established classification and statistical analysis techniques can reduce experimental noise and aid automatic interpretation of results. We show how effective hardware metrics from our probes allow guided tuning of computational kernels that outperform an autotuning library further tuned by the hardware vendor.
%B The Computer Journal
%8 2013-03
%G eng
%R 10.1093/comjnl/bxt057

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2013
%T A Block-Asynchronous Relaxation Method for Graphics Processing Units
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%A Vincent Heuveline
%X In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the ‘‘subdomain’’ handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing.
%B Journal of Parallel and Distributed Computing
%V 73
%P 1613–1626
%8 2013-12
%G eng
%N 12
%R http://dx.doi.org/10.1016/j.jpdc.2013.05.008

%0 Generic
%D 2013
%T clMAGMA: High Performance Dense Linear Algebra with OpenCL
%A Chongxiao Cao
%A Jack Dongarra
%A Peng Du
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X This paper presents the design and implementation of sev- eral fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments pre- sented, and in general provides to heterogeneous architec- tures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is ob- tained through use of the high-performance OpenCL BLAS, hardware and OpenCL-speci c tuning, and a hybridization methodology where we split the algorithm into computa- tional tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
%B University of Tennessee Technical Report (Lawn 275)
%I University of Tennessee
%8 2013-03
%G eng

%0 Generic
%D 2013
%T On the Combination of Silent Error Detection and Checkpointing
%A Guillaume Aupy
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%K checkpointing
%K error recovery
%K High-performance computing
%K silent data corruption
%K verification
%X In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
%B UT-CS-13-710
%I University of Tennessee Computer Science Technical Report
%8 2013-06
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn278.pdf

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Correlated Set Coordination in Fault Tolerant Message Logging Protocols
%A Aurelien Bouteiller
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%X With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes.
%B Concurrency and Computation: Practice and Experience
%V 25
%P 572-585
%8 2013-03
%G eng
%N 4
%R 10.1002/cpe.2859

%0 Conference Proceedings
%B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2013
%T CPU-GPU Hybrid Bidiagonal Reduction With Soft Error Resilience
%A Yulu Jia
%A Piotr Luszczek
%A George Bosilca
%A Jack Dongarra
%X Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur.
%B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C Montpellier, France
%8 2013-11
%G eng

%0 Journal Article
%J Scalable Computing and Communications: Theory and Practice
%D 2013
%T Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Piotr Luszczek
%A Jack Dongarra
%E Samee Khan
%E Lin-Wang Wang
%E Albert Zomaya
%B Scalable Computing and Communications: Theory and Practice
%I John Wiley & Sons
%P 699-735
%8 2013-03
%G eng

%0 Generic
%D 2013
%T Designing LU-QR hybrid solvers for performance and stability
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report (also LAWN 282)
%I University of Tennessee
%8 2013-10
%G eng

%0 Conference Paper
%B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
%D 2013
%T Diagnosis and Optimization of Application Prefetching Performance
%A Gabriel Marin
%A Colin McCurdy
%A Jeffrey Vetter
%E Allen D. Malony
%E Nemirovsky, Mario
%E Midkiff, Sam
%X Hardware prefetchers are effective at recognizing streaming memory access patterns and at moving data closer to the processing units to hide memory latency. However, hardware prefetchers can track only a limited number of data streams due to finite hardware resources. In this paper, we introduce the term <em>streaming concurrency</em> to characterize the number of parallel, logical data streams in an application. We present a simulation algorithm for understanding the streaming concurrency at any point in an application, and we show that this metric is a good predictor of the number of memory requests initiated by streaming prefetchers. Next, we try to understand the causes behind poor prefetching performance. We identified four prefetch unfriendly conditions and we show how to classify an application's memory references based on these conditions. We evaluated our analysis using the SPEC CPU2006 benchmark suite. We selected two benchmarks with unfavorable access patterns and transformed them to improve their prefetching effectiveness. Results show that making applications more prefetcher friendly can yield meaningful performance gains.
%B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
%I ACM Press
%C Eugene, Oregon, USA
%8 2013-06
%@ 9781450321303
%G eng
%U http://dl.acm.org/citation.cfm?doid=2464996.2465014
%R 10.1145/2464996.2465014

%0 Generic
%D 2013
%T Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs
%A Simplice Donfack
%A Stanimire Tomov
%A Jack Dongarra
%X Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU computing approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on high-end hybrid CPU/GPU systems show that our dynamically balanced synchronization-avoiding LU is both multicore and GPU scalable. Comparisons with state-of-the-art libraries like MKL (for multicore) and MAGMA (for hybrid systems) are provided, demonstrating significant performance improvements. The approach is applicable to other linear algebra algorithms. The scheduling mechanisms and tuning models can be incorporated into respectively dynamic runtime systems/schedulers and autotuning frameworks for hybrid CPU/MIC/GPU architectures.
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng

%0 Conference Paper
%B 7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems
%D 2013
%T Efficient Parallelization of Batch Pattern Training Algorithm on Many-core and Cluster Architectures
%A Volodymyr Turchenko
%A George Bosilca
%A Aurelien Bouteiller
%A Jack Dongarra
%K many-core system
%K parallel batch pattern training
%K parallelization efficiency
%K recirculation neural network
%X Abstract—The experimental research of the parallel batch pattern back propagation training algorithm on the example of recirculation neural network on many-core high performance computing systems is presented in this paper. The choice of recirculation neural network among the multilayer perceptron, recurrent and radial basis neural networks is proved. The model of a recirculation neural network and usual sequential batch pattern algorithm of its training are theoretically described. An algorithmic description of the parallel version of the batch pattern training method is presented. The experimental research is fulfilled using the Open MPI, Mvapich and Intel MPI message passing libraries. The results obtained on many-core AMD system and Intel MIC are compared with the results obtained on a cluster system. Our results show that the parallelization efficiency is about 95% on 12 cores located inside one physical AMD processor for the considered minimum and maximum scenarios. The parallelization efficiency is about 70-75% on 48 AMD cores for the minimum and maximum scenarios. These results are higher by 15-36% (depending on the version of MPI library) in comparison with the results obtained on 48 cores of a cluster system. The parallelization efficiency obtained on Intel MIC architecture is surprisingly low, asking for deeper analysis.
%B 7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems
%C Berlin, Germany
%8 2013-09
%G eng

%0 Journal Article
%J Journal of Supercomputing
%D 2013
%T Enabling Workflows in GridSolve: Request Sequencing and Service Trading
%A Yinan Li
%A Asim YarKhan
%A Jack Dongarra
%A Keith Seymour
%A Aurlie Hurault
%K grid computing
%K gridpac
%K netsolve
%K service trading
%K workflow applications
%X GridSolve employs a RPC-based client-agent-server model for solving computational problems. There are two deficiencies associated with GridSolve when a computational problem essentially forms a workflow consisting of a sequence of tasks with data dependencies between them. First, intermediate results are always passed through the client, resulting in unnecessary data transport. Second, since the execution of each individual task is a separate RPC session, it is difficult to enable any potential parallelism among tasks. This paper presents a request sequencing technique that addresses these deficiencies and enables workflow executions. Building on the request sequencing work, one way to generate workflows is by taking higher level service requests and decomposing them into a sequence of simpler service requests using a technique called service trading. A service trading component is added to GridSolve to take advantage of the new dynamic request sequencing. The features described here include automatic DAG construction and data dependency analysis, direct interserver data transfer, parallel task execution capabilities, and a service trading component.
%B Journal of Supercomputing
%V 64
%P 1133-1152
%8 2013-06
%G eng
%N 3
%& 1133
%R 10.1007/s11227-010-0549-1

%0 Journal Article
%J Computing
%D 2013
%T An evaluation of User-Level Failure Mitigation support in MPI
%A Wesley Bland
%A Aurelien Bouteiller
%A Thomas Herault
%A Joshua Hursey
%A George Bosilca
%A Jack Dongarra
%K Fault tolerance
%K MPI
%K User-level fault mitigation
%X As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.
%B Computing
%V 95
%P 1171-1184
%8 2013-12
%G eng
%N 12
%R 10.1007/s00607-013-0331-3

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI
%A Wesley Bland
%A Peng Du
%A Aurelien Bouteiller
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%X Most predictions of exascale machines picture billion ways parallelism, encompassing not only millions of cores but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: (i) traditional checkpoint-based approaches incur a steep overhead on failure free operations and (ii) the dominant programming paradigm for parallel applications (the message passing interface (MPI) Standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI Standard, to enable advanced forward recovery techniques, without incurring the overhead of customary periodic checkpointing. With our approach, when failure strikes, applications regain control to make a checkpoint before quitting execution. This checkpoint is in reaction to the failure occurrence rather than periodic. This checkpoint is reloaded in a new MPI application, which restores a sane environment for the forward, application-based recovery technique to repair the failure-damaged dataset. The validity and performance of this approach are evaluated on large-scale systems, using the QR factorization as an example. Published 2013. This article is a US Government work and is in the public domain in the USA.
%B Concurrency and Computation: Practice and Experience
%8 2013-07
%G eng
%U http://doi.wiley.com/10.1002/cpe.3100
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.3100

%0 Journal Article
%J Parallel Computing
%D 2013
%T Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Mathias Jacquelin
%A Julien Langou
%A Yves Robert
%K Cluster
%K Distributed memory
%K Hierarchical architecture
%K multi-core
%K numerical linear algebra
%K QR factorization
%X This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ‘‘communication-avoiding’’), it is natural to consider hierarchical trees composed of an ‘‘inter-node’’ tree which acts on top of ‘‘intra-node’’ trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ‘‘TS level’’ for cache-friendliness, (1) ‘‘low-level’’ for decoupled highly parallel inter-node reductions, (2) ‘‘domino level’’ to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGUE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
%B Parallel Computing
%V 39
%P 212-232
%8 2013-05
%G eng
%N 4-5

%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2013
%T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K algorithms
%K bidiagional reduction
%K bulge chasing
%K data translation layer
%K dynamic scheduling
%K high performance kernels
%K performance
%K tile algorithms
%K two-stage approach
%X This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.
%B ACM Transactions on Mathematical Software (TOMS)
%V 39
%G eng
%N 3
%R 10.1145/2450153.2450154

%0 Book Section
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%D 2013
%T HPC Challenge: Design, History, and Implementation Highlights
%A Jack Dongarra
%A Piotr Luszczek
%K exascale
%K hpc challenge
%K hpcc
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%I Taylor and Francis
%C Boca Raton, FL
%@ 978-1-4665-6834-1
%G eng
%& 2

%0 Generic
%D 2013
%T Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters
%A Tingxing Dong
%A Veselin Dobrev
%A Tzanio Kolev
%A Robert Rieben
%A Stanimire Tomov
%A Jack Dongarra
%X The explosion of parallelism and heterogeneity in today's computer architectures has created opportunities as well as challenges for redesigning legacy numerical software to harness the power of new hardware. In this paper we address the main challenges in redesigning BLAST { a numerical library that solves the equations of compressible hydrodynamics using high order nite element methods (FEM) in a moving Lagrangian frame { to support CPU-GPU clusters. We use a hybrid MPI + OpenMP + CUDA programming model that includes two layers: domain decomposed MPI parallelization and OpenMP + CUDA acceleration in a given domain. To optimize the code, we implemented custom linear algebra kernels and introduced an auto-tuning technique to deal with heterogeneity and load balancing at runtime. Our tests show that 12 Intel Xeon cores and two M2050 GPUs deliver a 24x speedup compared to a single core, and a 2.5x speedup compared to 12 MPI tasks in one node. Further, we achieve perfect weak scaling, demonstrated on a cluster with up to 64 GPUs in 32 nodes. Our choice of programming model and proposed solutions, as related to parallelism and load balancing, specifically targets high order FEM discretizations, and can be used equally successfully for applications beyond hydrodynamics. A major accomplishment is that we further establish the appeal of high order FEMs, which despite their better approximation properties, are often avoided due to their high computational cost. GPUs, as we show, have the potential to make them the method of choice, as the increased computational cost is also localized, e.g., cast as Level 3 BLAS, and thus can be done very efficiently (close to \free" relative to the usual overheads inherent in sparse computations).
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng

%0 Journal Article
%J IPDPS 2013 (submitted)
%D 2013
%T Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures
%A Ichitaro Yamazaki
%A Dulceneia Becker
%A Jack Dongarra
%A Alex Druinsky
%A I. Peled
%A Sivan Toledo
%A Grey Ballard
%A James Demmel
%A Oded Schwartz
%X Factorization of a dense symmetric indeﬁnite matrix is a key computational kernel in many scientiﬁc and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symmetry and guarantees numerical stability through pivoting at the same time. This is because such an algorithm exhibits many of the fundamental challenges in parallel programming like irregular data accesses and irregular task dependencies. In this paper, we address these challenges in a tiled implementation of a blocked Aasen’s algorithm using a dynamic scheduler. To fully exploit the limited parallelism in this left-looking algorithm, we study several performance enhancing techniques; e.g., parallel reduction to update a panel, tall-skinny LU factorization algorithms to factorize the panel, and a parallel implementation of symmetric pivoting. Our performance results on up to 48 AMD Opteron processors demonstrate that our implementation obtains speedups of up to 2.8 over MKL, while losing only one or two digits in the computed residual norms.
%B IPDPS 2013 (submitted)
%C Boston, MA
%8 2013-00
%G eng

%0 Generic
%D 2013
%T Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC
%A Guillaume Aupy
%A Mathieu Faverge
%A Yves Robert
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter-node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures
%B Lawn 277
%8 2013-05
%G eng

%0 Conference Paper
%B Supercomputing 2013
%D 2013
%T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware
%A Azzam Haidar
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%B Supercomputing 2013
%C Denver, CO
%8 2013-11
%G eng

%0 Generic
%D 2013
%T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware
%A Azzam Haidar
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%K lapack
%K plasma
%K scalapack
%B University of Tennessee Computer Science Technical Report (also LAWN 283)
%I University of Tennessee
%8 2013-10
%G eng

%0 Book Section
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%D 2013
%T Keeneland: Computational Science Using Heterogeneous GPU Computing
%A Jeffrey Vetter
%A Richard Glassbrook
%A Karsten Schwan
%A Sudha Yalamanchili
%A Mitch Horton
%A Ada Gavrilovska
%A Magda Slawinska
%A Jack Dongarra
%A Jeremy Meredith
%A Philip Roth
%A Kyle Spafford
%A Stanimire Tomov
%A John Wynkoop
%X The Keeneland Project is a five year Track 2D grant awarded by the National Science Foundation (NSF) under solicitation NSF 08-573 in August 2009 for the development and deployment of an innovative high performance computing system. The Keeneland project is led by the Georgia Institute of Technology (Georgia Tech) in collaboration with the University of Tennessee at Knoxville, National Institute of Computational Sciences, and Oak Ridge National Laboratory.
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%S CRC Computational Science Series
%I Taylor and Francis
%C Boca Raton, FL
%G eng
%& 7

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2013
%T Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms
%A Teng Ma
%A George Bosilca
%A Aurelien Bouteiller
%A Jack Dongarra
%K Cluster
%K Collective communication
%K Hierarchical
%K HPC
%K MPI
%K Multicore
%X Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies. This calls for a collaborative approach between multiple layers of collective algorithms, dedicated to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications.    In this work, we present HierKNEM, a kernel-assisted topology-aware collective framework, and the mechanisms deployed by this framework to orchestrate the collaboration between multiple layers of collective algorithms. The resulting scheme maximizes the overlap of intra- and inter-node communications. We demonstrate experimentally, by considering three of the most used collective operations (Broadcast, Allgather and Reduction), that (1) this approach is immune to modifications of the underlying process-core binding; (2) it outperforms state-of-art MPI libraries (Open MPI, MPICH2 and MVAPICH2) demonstrating up to a 30x speedup for synthetic benchmarks, and up to a 3x acceleration for a parallel graph application (ASP); (3) it furthermore demonstrates a linear speedup with the increase of the number of cores per compute node, a paramount requirement for scalability on future many-core hardware.
%B Journal of Parallel and Distributed Computing
%V 73
%P 1000-1010
%8 2013-07
%G eng
%U http://www.sciencedirect.com/science/article/pii/S0743731513000166
%N 7
%R 10.1016/j.jpdc.2013.01.015

%0 Book Section
%B Handbook of Linear Algebra
%D 2013
%T LAPACK
%A Zhaojun Bai
%A James Demmel
%A Jack Dongarra
%A Julien Langou
%A Jenny Wang
%X With a substantial amount of new material, the Handbook of Linear Algebra, Second Edition provides comprehensive coverage of linear algebra concepts, applications, and computational software packages in an easy-to-use format. It guides you from the very elementary aspects of the subject to the frontiers of current research. Along with revisions and updates throughout, the second edition of this bestseller includes 20 new chapters.
%B Handbook of Linear Algebra
%7 Second
%I CRC Press
%C Boca Raton, FL
%@ 9781466507289
%G eng

%0 Conference Proceedings
%B International Supercomputing Conference (ISC)
%D 2013
%T Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%A Raffaele Solcà
%A Thomas C. Schulthess
%X Today’s high computational demands from engineering fields and complex hardware development make it necessary to develop and optimize new algorithms toward achieving high performance and good scalability on the next generation of computers. The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe and analyze a successful methodology to address the challenges—starting from our algorithm design, kernel optimization and tuning, to our programming model—in the development of a scalable high-performance generalized eigenvalue solver in the context of electronic structure calculations in materials science applications. We developed a set of leading edge dense linear algebra algorithms, as part of a generalized eigensolver, featuring fine grained memory aware kernels, a task based approach and hybrid execution/scheduling. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. We report the performance impact on the generalized eigensolver when different fractions of eigenvectors are needed. The algorithm described provides an enormous performance boost compared to current GPU-based solutions, and performance comparable to state-of-the-art distributed solutions, using a single node with multiple GPUs.
%B International Supercomputing Conference (ISC)
%7 Lecture Notes in Computer Science
%I Springer Berlin Heidelberg
%C Leipzig, Germany
%V 7905
%P 67-80
%8 2013-06
%@ 978-3-642-38750-0
%G eng
%R 10.1007/978-3-642-38750-0_6

%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2013
%T Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms
%A Fred G. Gustavson
%A Jerzy Wasniewski
%A Jack Dongarra
%A José Herrero
%A Julien Langou
%X Four routines called DPOTF3i, i = a,b,c,d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts decreasing. This is our main result and it implies, due to the use of larger block size nb, that DGEMM, DSYRK, and DTRSM performance also increases! The four DPOTF3i routines use simple register blocking. Different platforms have different numbers of registers. Thus, our four routines have different register blocking sizes.    BPF is introduced. LAPACK routines for POTRF and PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK POTRF source codes. We call these codes BPTRF. There are two variants of BPF: lower and upper. Upper BPF is “identical” to Square Block Packed Format (SBPF). “LAPACK” implementations on multicore processors use SBPF. Lower BPF is less efficient than upper BPF. Vector inplace transposition converts lower BPF to upper BPF very efficiently. Corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms are given for n ≈ nb as well as results for large n comparing DBPTRF versus DPOTRF.
%B ACM Transactions on Mathematical Software (TOMS)
%V 39
%8 2013-02
%G eng
%N 2
%R 10.1145/2427023.2427026

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Computing
%D 2013
%T LU Factorization with Partial Pivoting for a Multicore System with Accelerators
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K accelerator
%K Gaussian elimination
%K gpu
%K lu factorization
%K manycore
%K Multicore
%K partial pivoting
%K plasma
%X LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of partial pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
%B IEEE Transactions on Parallel and Distributed Computing
%V 24
%P 1613-1621
%8 2013-08
%G eng
%N 8
%& 1613
%R http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.242

%0 Generic
%D 2013
%T Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization
%A Aurelien Bouteiller
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%X Failures are increasingly threatening the eciency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to generalpurpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.
%B University of Tennessee Computer Science Technical Report
%8 2013-02
%G eng

%0 Conference Paper
%B Euro-Par 2013
%D 2013
%T Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization
%A Aurelien Bouteiller
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%X Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that roll- back recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the sys- tem batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted com- putation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application comple- tion time is unchanged, while it delivers near-perfect platform efficiency.
%B Euro-Par 2013
%I Springer
%C Aachen, Germany
%8 2013-08
%G eng

%0 Journal Article
%J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%D 2013
%T Multithreading in the PLASMA Library
%A Jakub Kurzak
%A Piotr Luszczek
%A Asim YarKhan
%A Mathieu Faverge
%A Julien Langou
%A Henricus Bouwmeester
%A Jack Dongarra
%E Mohamed Ahmed
%E Reda Ammar
%E Sanguthevar Rajasekaran
%K plasma
%B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%I Taylor & Francis
%8 2013-00
%G eng

%0 Conference Paper
%B 2013 IEEE International Symposium on Performance Analysis of Systems and Software
%D 2013
%T Non-Determinism and Overcount on Modern Hardware Performance Counter Implementations
%A Vincent Weaver
%A Dan Terpstra
%A Shirley Moore
%B 2013 IEEE International Symposium on Performance Analysis of Systems and Software
%I IEEE
%C Austin, TX
%8 2013-04
%G eng

%0 Generic
%D 2013
%T Optimal Checkpointing Period: Time vs. Energy
%A Guillaume Aupy
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report (also LAWN 281)
%I University of Tennessee
%8 2013-10
%G eng

%0 Generic
%D 2013
%T PAPI 5: Measuring Power, Energy, and the Cloud
%A Vincent Weaver
%A Dan Terpstra
%A Heike McCraw
%A Matt Johnson
%A Kiran Kasichayanula
%A James Ralph
%A John Nelson
%A Phil Mucci
%A Tushar Mohan
%A Shirley Moore
%I 2013 IEEE International Symposium on Performance Analysis of Systems and Software
%C Austin, TX
%8 2013-04
%G eng

%0 Conference Paper
%B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013
%D 2013
%T Parallel Reduction to Hessenberg Form with Algorithm-Based Fault Tolerance
%A Yulu Jia
%A George Bosilca
%A Piotr Luszczek
%A Jack Dongarra
%X This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.
%B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013
%C Denver, CO
%8 2013-11
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2013)
%D 2013
%T A Parallel Solver for Incompressible Fluid Flows
%A Yushan Wang
%A Marc Baboulin
%A Joël Falcou
%A Yann Fraigneau
%A Olivier Le Maître
%K ADI
%K Navier-Stokes equations
%K Parallel computing
%K Partial diagonalization
%K Prediction-projection
%K SIMD
%X The Navier-Stokes equations describe a large class of fluid flows but are difficult to solve analytically because of their nonlin- earity. We present in this paper a parallel solver for the 3-D Navier-Stokes equations of incompressible unsteady flows with constant coefficients, discretized by the finite difference method. We apply the prediction-projection method which transforms the Navier-Stokes equations into three Helmholtz equations and one Poisson equation. For each Helmholtz system, we apply the Alternating Direction Implicit (ADI) method resulting in three tridiagonal systems. The Poisson equation is solved using partial diagonalization which transforms the Laplacian operator into a tridiagonal one. We describe an implementation based on MPI where the computations are performed on each subdomain and information is exchanged on the interfaces, and where the tridiagonal system solutions are accelerated using vectorization techniques. We present performance results on a current multicore system.
%B International Conference on Computational Science (ICCS 2013)
%I Elsevier B.V.
%C Barcelona, Spain
%8 2013-06
%G eng
%R DOI: 10.1016/j.procs.2013.05.207

%0 Journal Article
%J IEEE Computing in Science and Engineering
%D 2013
%T PaRSEC: Exploiting Heterogeneity to Enhance Scalability
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Thomas Herault
%A Jack Dongarra
%X New high-performance computing system designs with steeply escalating processor and core counts, burgeoning heterogeneity and accelerators, and increasingly unpredictable memory access times call for dramatically new programming paradigms. These new approaches must react and adapt quickly to unexpected contentions and delays, and they must provide the execution environment with sufficient intelligence and flexibility to rearrange the execution to improve resource utilization.
%B IEEE Computing in Science and Engineering
%V 15
%P 36-45
%8 2013-11
%G eng
%N 6
%R 10.1109/MCSE.2013.98

%0 Generic
%D 2013
%T Performance of Various Computers Using Standard Linear Equations Software
%A Jack Dongarra
%X This report compares the performance of different computer systems in solving dense systems of linear equations. The comparison involves approximately a hundred computers, ranging from the Earth Simulator to personal computers.
%B University of Tennessee Computer Science Technical Report
%8 2013-02
%G eng

%0 Conference Paper
%B PPAM 2013
%D 2013
%T Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%K magma
%K mic
%K xeon phi
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.
%B PPAM 2013
%C Warsaw, Poland
%8 2013-09
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2013
%T Post-failure recovery of MPI communication capability: Design and rationale
%A Wesley Bland
%A Aurelien Bouteiller
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%X As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery.
%B International Journal of High Performance Computing Applications
%V 27
%P 244 - 254
%8 2013-01
%G eng
%U http://hpc.sagepub.com/cgi/doi/10.1177/1094342013488238
%N 3
%! International Journal of High Performance Computing Applications
%R 10.1177/1094342013488238

%0 Conference Paper
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, at the IEEE International Parallel & Distributed Processing Symposium
%D 2013
%T Revisiting the Double Checkpointing Algorithm
%A Jack Dongarra
%A Thomas Herault
%A Yves Robert
%X Abstract—Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kale [1], with the non-blocking algorithm of Ni, Meneses and Kale [2] in terms of both performance and risk. We also extend the model proposed in [1], [2] to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-topeer checkpointing algorithm, called the triple checkpointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, at the IEEE International Parallel & Distributed Processing Symposium
%C Boston, MA
%8 2013-05
%G eng

%0 Generic
%D 2013
%T Revisiting the Double Checkpointing Algorithm
%A Jack Dongarra
%A Thomas Herault
%A Yves Robert
%K checkpoint algorithm
%K communication overlap
%K fault-tolerance
%K performance model
%K resilience
%X Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé in terms of both performance and risk. We also extend the model that they have proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
%B University of Tennessee Computer Science Technical Report (LAWN 274)
%8 2013-01
%G eng

%0 Book Section
%B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing
%D 2013
%T Scalable Dense Linear Algebra on Heterogeneous Hardware
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Abstract. Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect, that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand, when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs), which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores.
%B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing
%G eng

%0 Journal Article
%J Journal of Computational Science
%D 2013
%T Soft Error Resilient QR Factorization for Hybrid System with GPGPU
%A Peng Du
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K gpgpu
%K gpu
%K magma
%X The general purpose graphics processing units (GPGPUs) are increasingly deployed for scientific computing due to their performance advantages over CPUs. What followed is the fact that fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors, for example, in the form of bit flips. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include: (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R; and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can successfully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.
%B Journal of Computational Science
%V 4
%P 457–464
%8 2013-11
%G eng
%N 6
%R http://dx.doi.org/10.1016/j.jocs.2013.01.004

%0 Conference Paper
%B 17th IEEE High Performance Extreme Computing Conference (HPEC '13)
%D 2013
%T Standards for Graph Algorithm Primitives
%A Tim Mattson
%A David Bader
%A Jon Berry
%A Aydin Buluc
%A Jack Dongarra
%A Christos Faloutsos
%A John Feo
%A John Gilbert
%A Joseph Gonzalez
%A Bruce Hendrickson
%A Jeremy Kepner
%A Charles Lieserson
%A Andrew Lumsdaine
%A David Padua
%A Steve W. Poole
%A Steve Reinhardt
%A Mike Stonebraker
%A Steve Wallach
%A Andrew Yoo
%K algorithms
%K graphs
%K linear algebra
%K software standards
%X It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.
%B 17th IEEE High Performance Extreme Computing Conference (HPEC '13)
%I IEEE
%C Waltham, MA
%8 2013-09
%G eng
%R 10.1109/HPEC.2013.6670338

%0 Generic
%D 2013
%T Toward a New Metric for Ranking High Performance Computing Systems
%A Michael A. Heroux
%A Jack Dongarra
%X The High Performance Linpack (HPL), or Top 500, benchmark is the most widely recognized and discussed metric for ranking high performance computing systems. However, HPL is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications. In this paper we describe a new high performance conjugate gradient (HPCG) benchmark. HPCG is composed of computations and data access patterns more commonly found in applications. Using HPCG we strive for a better correlation to real scientific application performance and expect to drive computer system design and implementation in directions that will better impact performance improvement.
%B SAND2013 - 4744
%8 2013-06
%G eng
%U http://www.netlib.org/utk/people/JackDongarra/PAPERS/HPCG-Benchmark-utk.pdf

%0 Conference Paper
%B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
%D 2013
%T Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
%A Azzam Haidar
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%E Allen D. Malony
%E Nemirovsky, Mario
%E Midkiff, Sam
%K eigenvalue
%K gpu communication
%K gpu computation
%K heterogeneous programming model
%K performance
%K reduction to tridiagonal
%K singular value decomposiiton
%K task parallelism
%X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges---starting from our algorithm design, kernel optimization and tuning, to our programming model---in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores.
%B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
%I ACM Press
%C Eugene, Oregon, USA
%8 2013-06
%@ 9781450321303
%G eng
%U http://dl.acm.org/citation.cfm?doid=2464996.2465438
%R 10.1145/2464996.2465438

%0 Generic
%D 2013
%T Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures
%A Yulu Jia
%A Piotr Luszczek
%A Jack Dongarra
%X Graphics Processing Units (GPUs) are gaining wide spread usage in the ﬁeld of scientiﬁc computing owing to the performance boost GPUs bring to computation intensive applications. The typical conﬁguration is to integrate GPUs and CPUs in the same system where the CPUs handle the control ﬂow and part of the computation workload, and the GPUs serve as accelerators carry out the bulk of the data parallel compute workload. In this paper we design and implement a soft error resilient Hessenberg reduction algorithm on GPU based hybrid platforms. Our design employs algorithm based fault tolerance technique, diskless checkpointing and reverse computation. We detect and correct soft errors on-line without delaying the detection and correction to the end of the factorization. By utilizing idle time of the CPUs and overlapping both host side and GPU side workloads we minimize the observed overhead. Experiment results validated our design philosophy. Our algorithm introduces less than 2% performance overhead compared to the non-fault tolerant hybrid Hessenberg reduction algorithm.
%B UT-CS-13-712
%I University of Tennessee Computer Science Technical Report
%8 2013-06
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Raffaele Solcà
%A Stanimire Tomov
%A Jack Dongarra
%A Thomas C. Schulthess
%X For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)-2 symmetric matrix-vector multiplication, and the BLAS-3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi-GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU-GPU kernel into computational kernels at higher-levels of software stacks, that is, a shared-memory dense eigensolver and a distributed-memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher-level kernels, not only reducing the solution time but also enabling the solution of larger-scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques.
%B Concurrency and Computation: Practice and Experience
%8 2013-10
%G eng

%0 Conference Paper
%B The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES)
%D 2013
%T Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Stanimire Tomov
%A Jack Dongarra
%B The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES)
%8 2013-05
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
%A George Bosilca
%A Aurelien Bouteiller
%A Elisabeth Brunet
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%X In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
%B Concurrency and Computation: Practice and Experience
%8 2013-11
%G eng
%R 10.1002/cpe.3173

%0 Conference Paper
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%D 2013
%T Virtual Systolic Array for QR Decomposition
%A Jakub Kurzak
%A Piotr Luszczek
%A Mark Gates
%A Ichitaro Yamazaki
%A Jack Dongarra
%K dataflow programming
%K message passing
%K multi-core
%K QR decomposition
%K roofline model
%K systolic array
%X Systolic arrays offer a very attractive, data-centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for the QR decomposition is developed and a virtualization layer is used for mapping of the algorithm to a large distributed memory system. Strong scaling properties are discovered, superior to existing solutions.
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%I IEEE
%C Boston, MA
%8 2013-05
%G eng
%R 10.1109/IPDPS.2013.119

%0 Generic
%D 2012
%T Acceleration of the BLAST Hydro Code on GPU
%A Tingxing Dong
%A Tzanio Kolev
%A Robert Rieben
%A Veselin Dobrev
%A Stanimire Tomov
%A Jack Dongarra
%B Supercomputing '12 (poster)
%I SC12
%C Salt Lake City, Utah
%8 2012-11
%G eng

%0 Conference Proceedings
%B Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012
%D 2012
%T Algorithm-Based Fault Tolerance for Dense Matrix Factorization
%A Peng Du
%A Aurelien Bouteiller
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%E J. Ramanujam
%E P. Sadayappan
%K ft-la
%K ftmpi
%X Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors.
%B Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012
%I ACM
%C New Orleans, LA, USA
%P 225-234
%8 2012-02
%G eng
%R 10.1145/2145816.2145845

%0 Generic
%D 2012
%T On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng

%0 Conference Proceedings
%B 2012 IEEE High Performance Extreme Computing Conference
%D 2012
%T Anatomy of a Globally Recursive Embedded LINPACK Benchmark
%A Piotr Luszczek
%A Jack Dongarra
%X We present a complete bottom-up implementation of an embedded LINPACK benchmark on iPad 2. We use a novel formulation of a recursive LU factorization that is recursive and parallel at the global scope. We be believe our new algorithm presents an alternative to existing linear algebra parallelization techniques such as master-worker and DAG-based approaches. We show a assembly API that allows us a much higher level of abstraction and provides rapid code development within the confines of mobile device SDK. We use performance modeling to help with the limitation of the device and the limited access to device from the development environment not geared for HPC application tuning.
%B 2012 IEEE High Performance Extreme Computing Conference
%C Waltham, MA
%P 1-6
%8 2012-09
%@ 978-1-4673-1577-7
%G eng
%R 10.1109/HPEC.2012.6408679

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2012
%T Autotuning GEMM Kernels for the Fermi GPU
%A Jakub Kurzak
%A Stanimire Tomov
%A Jack Dongarra
%X Abstract—In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.
%B IEEE Transactions on Parallel and Distributed Systems
%V 23
%8 2012-11
%G eng
%R https://doi.org/10.1109/TPDS.2011.311

%0 Journal Article
%J ICCS 2012
%D 2012
%T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Gates
%A Jack Dongarra
%A Vincent Heuveline
%B ICCS 2012
%C Omaha, NE
%8 2012-06
%G eng

%0 Conference Proceedings
%B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award)
%D 2012
%T A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI
%A Wesley Bland
%A Peng Du
%A Aurelien Bouteiller
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%E Christos Kaklamanis
%E Theodore Papatheodorou
%E Paul Spirakis
%B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award)
%I Springer-Verlag
%C Rhodes, Greece
%8 2012-08
%G eng

%0 Conference Proceedings
%B Proc. of the International Conference on Computational Science (ICCS)
%D 2012
%T A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines
%A Marc Baboulin
%A Simplice Donfack
%A Jack Dongarra
%A Laura Grigori
%A Adrien Remi
%A Stanimire Tomov
%K magma
%B Proc. of the International Conference on Computational Science (ICCS)
%V 9
%P 17-26
%8 2012-06
%G eng

%0 Journal Article
%J IPDPS 2012
%D 2012
%T A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction
%A Azzam Haidar
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%B IPDPS 2012
%C Shanghai, China
%8 2012-05
%G eng

%0 Journal Article
%J Parallel Computing
%D 2012
%T DAGuE: A generic distributed DAG Engine for High Performance Computing.
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%K dague
%K parsec
%B Parallel Computing
%I Elsevier
%V 38
%P 27-51
%8 2012-00
%G eng

%0 Journal Article
%J High Performance Scientific Computing: Algorithms and Applications
%D 2012
%T Dense Linear Algebra on Accelerated Multicore Hardware
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%E Michael Berry
%E et al.,
%B High Performance Scientific Computing: Algorithms and Applications
%I Springer-Verlag
%C London, UK
%8 2012-00
%G eng

%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2012
%T Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems
%A Christof Voemel
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B SIAM Journal on Scientific Computing
%V 34(2)
%P C70-C82
%8 2012-04
%G eng

%0 Generic
%D 2012
%T Dynamic Task Execution on Shared and Distributed Memory Architectures
%A Asim YarKhan
%X Multicore architectures with high core counts have come to dominate the world of high performance computing, from shared memory machines to the largest distributed memory clusters. The multicore route to increased performance has a simpler design and better power efficiency than the traditional approach of increasing processor frequencies. But, standard programming techniques are not well adapted to this change in computer architecture design.      In this work, we study the use of dynamic runtime environments executing data driven applications as a solution to programming multicore architectures. The goals of our runtime environments are productivity, scalability and performance. We demonstrate productivity by defining a simple programming interface to express code. Our runtime environments are experimentally shown to be scalable and give competitive performance on large multicore and distributed memory machines.      This work is driven by linear algebra algorithms, where state-of-the-art libraries (e.g., LAPACK and ScaLAPACK) using a fork-join or block-synchronous execution style do not use the available resources in the most efficient manner. Research work in linear algebra has reformulated these algorithms as tasks acting on tiles of data, with data dependency relationships between the tasks. This results in a task-based DAG for the reformulated algorithms, which can be executed via asynchronous data-driven execution paths analogous to dataflow execution.      We study an API and runtime environment for shared memory architectures that efficiently executes serially presented tile based algorithms. This runtime is used to enable linear algebra applications and is shown to deliver performance competitive with state-ofthe-art commercial and research libraries.      We develop a runtime environment for distributed memory multicore architectures extended from our shared memory implementation. The runtime takes serially presented algorithms designed for the shared memory environment, and schedules and executes them on distributed memory architectures in a scalable and high performance manner. We design a distributed data coherency protocol and a distributed task scheduling mechanism which avoid global coordination. Experimental results with linear algebra applications show the scalability and performance of our runtime environment.
%9 Dissertation

%0 Generic
%D 2012
%T An efficient distributed randomized solver with application to large dense linear systems
%A Marc Baboulin
%A Dulceneia Becker
%A George Bosilca
%A Anthony Danalis
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%K plasma
%B ICL Technical Report
%8 2012-07
%G eng

%0 Conference Proceedings
%B 26th ACM International Conference on Supercomputing (ICS 2012)
%D 2012
%T Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems
%A Fengguang Song
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B 26th ACM International Conference on Supercomputing (ICS 2012)
%I ACM
%C San Servolo Island, Venice, Italy
%8 2012-06
%G eng

%0 Conference Proceedings
%B 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
%D 2012
%T Enabling Application Resilience With and Without the MPI Standard
%A Wesley Bland
%B 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
%C Ottawa, Canada
%8 2012-05
%G eng

%0 Conference Proceedings
%B The 2nd International Conference on Cloud and Green Computing (submitted)
%D 2012
%T Energy Footprint of Advanced Dense Numerical Linear Algebra using Tile Algorithms on Multicore Architecture
%A Jack Dongarra
%A Hatem Ltaeif
%A Piotr Luszczek
%A Vincent M Weaver
%B The 2nd International Conference on Cloud and Green Computing (submitted)
%C Xiangtan, Hunan, China
%8 2012-11
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2012
%T Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures using Tree Reduction
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%B Lecture Notes in Computer Science
%V 7203
%P 661-670
%8 2012-09
%G eng

%0 Conference Proceedings
%B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012
%D 2012
%T An Evaluation of User-Level Failure Mitigation Support in MPI
%A Wesley Bland
%A Aurelien Bouteiller
%A Thomas Herault
%A Joshua Hursey
%A George Bosilca
%A Jack Dongarra
%B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012
%I Springer
%C Vienna, Austria
%8 2012-09
%G eng

%0 Generic
%D 2012
%T Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI
%A Wesley Bland
%A Peng Du
%A Aurelien Bouteiller
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%K ftmpi
%B University of Tennessee Computer Science Technical Report
%8 2012-00
%G eng

%0 Journal Article
%J Parallel Computing
%D 2012
%T From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming
%A Peng Du
%A Rick Weber
%A Piotr Luszczek
%A Stanimire Tomov
%A Gregory D. Peterson
%A Jack Dongarra
%B Parallel Computing
%V 38
%P 391-407
%8 2012-08
%G eng

%0 Conference Paper
%B International European Conference on Parallel and Distributed Computing (Euro-Par '12)
%D 2012
%T From Serial Loops to Parallel Execution on Distributed Systems
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Jack Dongarra
%B International European Conference on Parallel and Distributed Computing (Euro-Par '12)
%C Rhodes, Greece
%8 2012-08
%G eng

%0 Generic
%D 2012
%T The Future of Computing: Software Libraries
%A Stanimire Tomov
%A Jack Dongarra
%I DOD CREATE Developers' Review, Keynote Presentation
%C Savannah, GA
%8 2012-02
%G eng

%0 Journal Article
%J EuroPar 2012 (also LAWN 260)
%D 2012
%T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement
%A Hartwig Anzt
%A Piotr Luszczek
%A Jack Dongarra
%A Vincent Heuveline
%B EuroPar 2012 (also LAWN 260)
%C Rhodes Island, Greece
%8 2012-08
%G eng

%0 Conference Proceedings
%B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium
%D 2012
%T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Julien Langou
%A Yves Robert
%B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium
%I IEEE Computer Society Press
%C Shanghai, China
%8 2012-05
%G eng

%0 Journal Article
%J IPDPS 2012 (Best Paper)
%D 2012
%T HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters
%A Teng Ma
%A George Bosilca
%A Aurelien Bouteiller
%A Jack Dongarra
%B IPDPS 2012 (Best Paper)
%C Shanghai, China
%8 2012-05
%G eng

%0 Journal Article
%J Acta Numerica
%D 2012
%T High Performance Computing Systems: Status and Outlook
%A Jack Dongarra
%A Aad J. van der Steen
%B Acta Numerica
%I Cambridge University Press
%C Cambridge, UK
%V 21
%P 379-474
%8 2012-05
%G eng

%0 Journal Article
%J ICCS 2012
%D 2012
%T High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors
%A Peng Du
%A Piotr Luszczek
%A Jack Dongarra
%B ICCS 2012
%C Omaha, NE
%8 2012-06
%G eng

%0 Generic
%D 2012
%T How LAPACK library enables Microsoft Visual Studio support with CMake and LAPACKE
%A Julien Langou
%A Bill Hoffman
%A Brad King
%B University of Tennessee Computer Science Technical Report (also LAWN 270)
%8 2012-07
%G eng

%0 Journal Article
%J On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear)
%D 2012
%T HPC Challenge: Design, History, and Implementation Highlights
%A Jack Dongarra
%A Piotr Luszczek
%E Jeffrey Vetter
%B On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear)
%I Chapman & Hall/CRC Press
%8 2012-00
%G eng

%0 Journal Article
%J Applied Parallel and Scientific Computing
%D 2012
%T An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs
%A Jakub Kurzak
%A Rajib Nath
%A Peng Du
%A Jack Dongarra
%E Kristján Jónasson
%B Applied Parallel and Scientific Computing
%V 7133
%P 248-257
%8 2012-00
%G eng

%0 Journal Article
%J Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear)
%D 2012
%T Looking Back at Dense Linear Algebra Software
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%E Viktor K. Prasanna
%E Yves Robert
%E Per Stenström
%B Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear)
%8 2012-00
%G eng

%0 Generic
%D 2012
%T MAGMA: A Breakthrough in Solvers for Eigenvalue Problems
%A Stanimire Tomov
%A Jack Dongarra
%A Azzam Haidar
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Thomas Schulthess
%A Raffaele Solcà
%I GPU Technology Conference (GTC12), Presentation
%C San Jose, CA
%8 2012-05
%G eng

%0 Generic
%D 2012
%T MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures
%A Jack Dongarra
%A Tingxing Dong
%A Mark Gates
%A Azzam Haidar
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation
%C Salt Lake City, UT
%8 2012-11
%G eng

%0 Generic
%D 2012
%T MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors
%A Jack Dongarra
%A Mark Gates
%A Yulu Jia
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12)
%C Salt Lake City, UT
%8 2012-11
%G eng

%0 Generic
%D 2012
%T MAGMA Tutorial
%A Mark Gates
%I Keeneland Workshop
%C Atlanta, GA
%8 2012-02
%G eng

%0 Journal Article
%J Supercomputing '12 (poster)
%D 2012
%T Matrices Over Runtime Systems at Exascale
%A Emmanuel Agullo
%A George Bosilca
%A Cedric Castagnède
%A Jack Dongarra
%A Hatem Ltaeif
%A Stanimire Tomov
%B Supercomputing '12 (poster)
%C Salt Lake City, Utah
%8 2012-11
%G eng

%0 Conference Proceedings
%B International Workshop on Power-Aware Systems and Architectures
%D 2012
%T Measuring Energy and Power with PAPI
%A Vincent M Weaver
%A Matt Johnson
%A Kiran Kasichayanula
%A James Ralph
%A Piotr Luszczek
%A Dan Terpstra
%A Shirley Moore
%K papi
%X Energy and power consumption are becoming critical metrics in the design and usage of high performance systems. We have extended the Performance API (PAPI) analysis library to measure and report energy and power values. These values are reported using the existing PAPI API, allowing code previously instrumented for performance counters to also measure power and energy. Higher level tools that build on PAPI will automatically gain support for power and energy readings when used with the newest version of PAPI. We describe in detail the types of energy and power readings available through PAPI. We support external power meters, as well as values provided internally by recent CPUs and GPUs. Measurements are provided directly to the instrumented process, allowing immediate code analysis in real time. We provide examples showing results that can be obtained with our infrastructure.
%B International Workshop on Power-Aware Systems and Architectures
%C Pittsburgh, PA
%8 2012-09
%G eng
%R 10.1109/ICPPW.2012.39

%0 Journal Article
%J Supercomputing '12 (poster)
%D 2012
%T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks
%A Raffaele Solcà
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%A Thomas C. Schulthess
%B Supercomputing '12 (poster)
%C Salt Lake City, Utah
%8 2012-11
%G eng

%0 Conference Proceedings
%B The International Conference on Computational Science (ICCS)
%D 2012
%T One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B The International Conference on Computational Science (ICCS)
%8 2012-06
%G eng

%0 Journal Article
%J VECPAR 2012
%D 2012
%T Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators
%A Ahmad Abdelfattah
%A Jack Dongarra
%A David Keyes
%A Hatem Ltaeif
%B VECPAR 2012
%C Kobe, Japan
%8 2012-07
%G eng

%0 Journal Article
%J CloudTech-HPC 2012
%D 2012
%T PAPI-V: Performance Monitoring for Virtual Machines
%A Matt Johnson
%A Heike McCraw
%A Shirley Moore
%A Phil Mucci
%A John Nelson
%A Dan Terpstra
%A Vincent M Weaver
%A Tushar Mohan
%K papi
%X This paper describes extensions to the PAPI hardware counter library for virtual environments, called PAPI-V. The extensions support timing routines, I/O measurements, and processor counters. The PAPI-V extensions will allow application and tool developers to use a familiar interface to obtain relevant hardware performance monitoring information in virtual environments.
%B CloudTech-HPC 2012
%C Pittsburgh, PA
%8 2012-09
%G eng
%R 10.1109/ICPPW.2012.29

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2012
%T Parallel Processing and Applied Mathematics, 9th International Conference, PPAM 2011
%E Roman Wyrzykowski
%E Jack Dongarra
%E Konrad Karczewski
%E Jerzy Wasniewski
%B Lecture Notes in Computer Science
%C Torun, Poland
%V 7203
%8 2012-00
%G eng

%0 Journal Article
%J IPDPS 2012
%D 2012
%T A Parallel Tiled Solver for Symmetric Indefinite Systems On Multicore Architectures
%A Marc Baboulin
%A Dulceneia Becker
%A Jack Dongarra
%B IPDPS 2012
%C Shanghai, China
%8 2012-05
%G eng

%0 Generic
%D 2012
%T Performance Counter Monitoring for the Blue Gene/Q Architecture
%A Heike McCraw
%K papi
%B University of Tennessee Computer Science Technical Report
%8 2012-00
%G eng

%0 Generic
%D 2012
%T Performance evaluation of LU factorization through hardware counter measurements
%A Simplice Donfack
%A Stanimire Tomov
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report
%8 2012-10
%G eng

%0 Journal Article
%J SAAHPC '12 (Best Paper Award)
%D 2012
%T Power Aware Computing on GPUs
%A Kiran Kasichayanula
%A Dan Terpstra
%A Piotr Luszczek
%A Stanimire Tomov
%A Shirley Moore
%A Gregory D. Peterson
%K magma
%B SAAHPC '12 (Best Paper Award)
%C Argonne, IL
%8 2012-07
%G eng

%0 Conference Proceedings
%B Third International Conference on Energy-Aware High Performance Computing
%D 2012
%T Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
%A George Bosilca
%A Jack Dongarra
%A Hatem Ltaeif
%B Third International Conference on Energy-Aware High Performance Computing
%C Hamburg, Germany
%8 2012-09
%G eng

%0 Journal Article
%J LAWN 267
%D 2012
%T Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B LAWN 267
%8 2012-00
%G eng

%0 Conference Proceedings
%B Proceedings of VECPAR’12
%D 2012
%T Programming the LU Factorization for a Multicore System with Accelerators
%A Jakub Kurzak
%A Piotr Luszczek
%A Mathieu Faverge
%A Jack Dongarra
%K plasma
%K quark
%B Proceedings of VECPAR’12
%C Kobe, Japan
%8 2012-04
%G eng

%0 Generic
%D 2012
%T A Proposal for User-Level Failure Mitigation in the MPI-3 Standard
%A Wesley Bland
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Jack Dongarra
%K ftmpi
%B University of Tennessee Electrical Engineering and Computer Science Technical Report
%I University of Tennessee
%8 2012-02
%G eng

%0 Generic
%D 2012
%T Providing GPU Capability to LU and QR within the ScaLAPACK Framework
%A Peng Du
%A Stanimire Tomov
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report (also LAWN 272)
%8 2012-09
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2012
%T Recent Advances in the Message Passing Interface: 19th European MPI Users' Group Meeting, EuroMPI 2012
%E Jesper Larsson Träff
%E Siegfried Benkner
%E Jack Dongarra
%B Lecture Notes in Computer Science
%C Vienna, Austria
%V 7490
%8 2012-00
%G eng

%0 Journal Article
%J Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011)
%D 2012
%T Reducing the Amount of Pivoting in Symmetric Indefinite Systems
%A Dulceneia Becker
%A Marc Baboulin
%A Jack Dongarra
%E Roman Wyrzykowski
%E Jack Dongarra
%E Konrad Karczewski
%E Jerzy Wasniewski
%B Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011)
%I Springer-Verlag Berlin Heidelberg
%V 7203
%P 133-142
%8 2012-00
%G eng

%0 Conference Proceedings
%B The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012)
%D 2012
%T A Scalable Framework for Heterogeneous GPU-Based Clusters
%A Fengguang Song
%A Jack Dongarra
%K magma
%B The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012)
%I ACM
%C Pittsburgh, PA, USA
%8 2012-06
%G eng

%0 Journal Article
%J SIAM Journal on Scientific Computing (Accepted)
%D 2012
%T Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B SIAM Journal on Scientific Computing (Accepted)
%8 2012-07
%G eng

%0 Generic
%D 2012
%T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
%A George Bosilca
%A Aurelien Bouteiller
%A Elisabeth Brunet
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%B University of Tennessee Computer Science Technical Report (also LAWN 269)
%8 2012-06
%G eng

%0 Conference Proceedings
%B Euro-Par 2012: Parallel Processing Workshops
%D 2012
%T User Level Failure Mitigation in MPI
%A Wesley Bland
%E Ioannis Caragiannis
%E Michael Alexander
%E Rosa M. Badia
%E Mario Cannataro
%E Alexandru Costan
%E Marco Danelutto
%E Frederic Desprez
%E Bettina Krammer
%E Sahuquillo, J.
%E Stephen L. Scott
%E J. Weidendorfer
%K ftmpi
%B Euro-Par 2012: Parallel Processing Workshops
%I Springer Berlin Heidelberg
%C Rhodes Island, Greece
%V 7640
%P 499-504
%8 2012-08
%G eng

%0 Conference Proceedings
%B Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper)
%D 2012
%T Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%A Vincent Heuveline
%B Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper)
%C Rhodes Island, Greece
%8 2012-08
%G eng

%0 Journal Article
%J SIAM Journal on Computing (submitted)
%D 2012
%T Weighted Block-Asynchronous Relaxation for GPU-Accelerated Systems
%A Hartwig Anzt
%A Jack Dongarra
%A Vincent Heuveline
%B SIAM Journal on Computing (submitted)
%8 2012-03
%G eng

%0 Conference Proceedings
%B 73rd EAGE Conference & Exhibition incorporating SPE EUROPEC 2011, Vienna, Austria, 23-26 May
%D 2011
%T 3-D parallel frequency-domain visco-acoustic wave modelling based on a hybrid direct/iterative solver
%A Azzam Haidar
%A Luc Giraud
%A Hafedh Ben-Hadj-Ali
%A Florent Sourbier
%A Stéphane Operto
%A Jean Virieux
%B 73rd EAGE Conference & Exhibition incorporating SPE EUROPEC 2011, Vienna, Austria, 23-26 May
%8 2011-00
%G eng

%0 Journal Article
%J INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11)
%D 2011
%T Accelerating Linear System Solutions Using Randomization Techniques
%A Marc Baboulin
%A Jack Dongarra
%A Julien Herrmann
%A Stanimire Tomov
%K magma
%B INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11)
%C Waterloo, Ontario, Canada
%8 2011-07
%G eng

%0 Generic
%D 2011
%T Achieving Numerical Accuracy and High Performance using Recursive Tile LU Factorization
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K plasma
%K quark
%B University of Tennessee Computer Science Technical Report (also as a LAWN)
%8 2011-09
%G eng

%0 Conference Proceedings
%B The Twentieth International Conference on Domain Decomposition Methods
%D 2011
%T Algebraic Schwarz Preconditioning for the Schur Complement: Application to the Time-Harmonic Maxwell Equations Discretized by a Discontinuous Galerkin Method.
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Stephane Lanteri
%A Jean Roman
%B The Twentieth International Conference on Domain Decomposition Methods
%C La Jolla, California
%8 2011-02
%G eng
%U http://hal.inria.fr/inria-00577639

%0 Generic
%D 2011
%T Algorithm-based Fault Tolerance for Dense Matrix Factorizations
%A Peng Du
%A Aurelien Bouteiller
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%K ft-la
%B University of Tennessee Computer Science Technical Report
%C Knoxville, TN
%8 2011-08
%G eng

%0 Generic
%D 2011
%T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
%A Azzam Haidar
%A Hatem Ltaeif
%A Asim YarKhan
%A Jack Dongarra
%K plasma
%K quark
%B University of Tennessee Computer Science Technical Report, UT-CS-11-666, (also Lawn 243)
%8 2011-03
%G eng

%0 Journal Article
%J TeraGrid'11
%D 2011
%T Autotuned Parallel I/O for Highly Scalable Biosequence Analysis
%A Haihang You
%A Bhanu Rekapalli
%A Qing Liu
%A Shirley Moore
%B TeraGrid'11
%C Salt Lake City, Utah
%8 2011-07
%G eng

%0 Generic
%D 2011
%T Autotuning GEMMs for Fermi
%A Jakub Kurzak
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B University of Tennessee Computer Science Technical Report, UT-CS-11-671, (also Lawn 245)
%8 2011-04
%G eng

%0 Conference Proceedings
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%D 2011
%T BlackjackBench: Hardware Characterization with Portable Micro-Benchmarks and Automatic Statistical Analysis of Results
%A Anthony Danalis
%A Piotr Luszczek
%A Gabriel Marin
%A Jeffrey Vetter
%A Jack Dongarra
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%C Anchorage, AK
%8 2011-05
%G eng

%0 Journal Article
%D 2011
%T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Gates
%A Jack Dongarra
%A Vincent Heuveline
%K magma
%8 2011-12
%G eng

%0 Generic
%D 2011
%T A Block-Asynchronous Relaxation Method for Graphics Processing Units
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%A Vincent Heuveline
%K magma
%B University of Tennessee Computer Science Technical Report
%8 2011-11
%G eng

%0 Journal Article
%J in Solving the Schrodinger Equation: Has everything been tried? (to appear)
%D 2011
%T Changes in Dense Linear Algebra Kernels - Decades Long Perspective
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%E P. Popular
%B in Solving the Schrodinger Equation: Has everything been tried? (to appear)
%I Imperial College Press
%8 2011-00
%G eng

%0 Conference Proceedings
%B Symposium for Application Accelerators in High Performance Computing (SAAHPC'11)
%D 2011
%T A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures
%A Mitch Horton
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%K quark
%B Symposium for Application Accelerators in High Performance Computing (SAAHPC'11)
%C Knoxville, TN
%8 2011-07
%G eng

%0 Conference Proceedings
%B Proceedings of 17th International Conference, Euro-Par 2011, Part II
%D 2011
%T Correlated Set Coordination in Fault Tolerant Message Logging Protocols
%A Aurelien Bouteiller
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%E Emmanuel Jeannot
%E Raymond Namyst
%E Jean Roman
%K ftmpi
%B Proceedings of 17th International Conference, Euro-Par 2011, Part II
%I Springer
%C Bordeaux, France
%V 6853
%P 51-64
%8 2011-08
%G eng

%0 Conference Proceedings
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%D 2011
%T DAGuE: A Generic Distributed DAG Engine for High Performance Computing
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%K dague
%K parsec
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%I IEEE
%C Anchorage, Alaska, USA
%P 1151-1158
%8 2011-00
%G eng

%0 Conference Proceedings
%B Cray Users Group Conference (CUG'11) (Best Paper Finalist)
%D 2011
%T The Design of an Auto-tuning I/O Framework on Cray XT5 System
%A Haihang You
%A Qing Liu
%A Zhiqiang Li
%A Shirley Moore
%K gco
%B Cray Users Group Conference (CUG'11) (Best Paper Finalist)
%C Fairbanks, Alaska
%8 2011-05
%G eng

%0 Generic
%D 2011
%T Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures
%A Fengguang Song
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250)
%8 2011-06
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2011
%T Energy and performance characteristics of different parallel implementations of scientific applications on multicore systems
%A Charles Lively
%A Xingfu Wu
%A Valerie Taylor
%A Shirley Moore
%A Hung-Ching Chang
%A Kirk Cameron
%K mumi
%B International Journal of High Performance Computing Applications
%V 25
%P 342-350
%8 2011-00
%G eng

%0 Conference Proceedings
%B 6th Workshop on Virtualization in High-Performance Cloud Computing
%D 2011
%T Evaluation of the HPC Challenge Benchmarks in Virtualized Environments
%A Piotr Luszczek
%A Eric Meek
%A Shirley Moore
%A Dan Terpstra
%A Vincent M Weaver
%A Jack Dongarra
%K hpcc
%B 6th Workshop on Virtualization in High-Performance Cloud Computing
%C Bordeaux, France
%8 2011-08
%G eng

%0 Conference Proceedings
%B Proceedings of PARCO'11
%D 2011
%T Exploiting Fine-Grain Parallelism in Recursive LU Factorization
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K plasma
%B Proceedings of PARCO'11
%C Gent, Belgium
%8 2011-04
%G eng

%0 Conference Proceedings
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%D 2011
%T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%I IEEE
%C Anchorage, Alaska, USA
%P 1432-1441
%8 2011-05
%G eng

%0 Generic
%D 2011
%T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement
%A Hartwig Anzt
%A Piotr Luszczek
%A Jack Dongarra
%A Vincent Heuveline
%K magma
%B University of Tennessee Computer Science Technical Report UT-CS-11-690 (also Lawn 260)
%8 2011-12
%G eng

%0 Generic
%D 2011
%T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Julien Langou
%A Yves Robert
%K magma
%K plasma
%B University of Tennessee Computer Science Technical Report (also Lawn 257)
%8 2011-10
%G eng

%0 Generic
%D 2011
%T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247)
%8 2011-05
%G eng

%0 Journal Article
%J IEEE Cluster 2011
%D 2011
%T High Performance Dense Linear System Solver with Soft Error Resilience
%A Peng Du
%A Piotr Luszczek
%A Jack Dongarra
%K ft-la
%B IEEE Cluster 2011
%C Austin, TX
%8 2011-09
%G eng

%0 Conference Proceedings
%B Proceedings of MTAGS11
%D 2011
%T High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%B Proceedings of MTAGS11
%C Seattle, WA
%8 2011-11
%G eng

%0 Journal Article
%J Journal of Computational Physics
%D 2011
%T High-Performance High-Resolution Semi-Lagrangian Tracer Transport on a Sphere
%A James B. White
%A Jack Dongarra
%K cubed sphere
%K high resolution
%K High-performance computing
%K semi-Lagrangian
%K spherical geometry
%K tracer transport
%X Current climate models have a limited ability to increase spatial resolution because numerical stability requires the time step to decrease. We describe a semi-Lagrangian method for tracer transport that is stable for arbitrary Courant numbers, and we test a parallel implementation discretized on the cubed sphere. The method includes a fixer that conserves mass and constrains tracers to a physical range of values. The method shows third-order convergence and maintains nonlinear tracer correlations to second order. It shows optimal accuracy at Courant numbers of 10–20, more than an order of magnitude higher than explicit methods. We present parallel performance in terms of strong scaling, weak scaling, and spatial scaling (where the time step stays constant while the resolution increases). For a 0.2° test with 100 tracers, the implementation scales efficiently to 10,000 MPI tasks.
%B Journal of Computational Physics
%V 230
%P 6778-6799
%8 2011-07
%G eng
%N 17
%R https://doi.org/10.1016/j.jcp.2011.05.008

%0 Journal Article
%J in GPU Computing Gems, Jade Edition
%D 2011
%T A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Samuel Thibault
%A Stanimire Tomov
%E Wen-mei W. Hwu
%K magma
%K morse
%B in GPU Computing Gems, Jade Edition
%I Elsevier
%V 2
%P 473-484
%8 2011-00
%G eng

%0 Journal Article
%J 18th EuroMPI
%D 2011
%T Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW
%A Teng Ma
%A Aurelien Bouteiller
%A George Bosilca
%A Jack Dongarra
%E Yiannis Cotronis
%E Anthony Danalis
%E Dimitrios S. Nikolopoulos
%E Jack Dongarra
%K dague
%B 18th EuroMPI
%I Springer
%C Santorini, Greece
%P 247-254
%8 2011-09
%G eng

%0 Journal Article
%J International Journal of High Performance Computing
%D 2011
%T The International Exascale Software Project Roadmap
%A Jack Dongarra
%A Pete Beckman
%A Terry Moore
%A Patrick Aerts
%A Giovanni Aloisio
%A Jean-Claude Andre
%A David Barkai
%A Jean-Yves Berthou
%A Taisuke Boku
%A Bertrand Braunschweig
%A Franck Cappello
%A Barbara Chapman
%A Xuebin Chi
%A Alok Choudhary
%A Sudip Dosanjh
%A Thom Dunning
%A Sandro Fiore
%A Al Geist
%A Bill Gropp
%A Robert Harrison
%A Mark Hereld
%A Michael Heroux
%A Adolfy Hoisie
%A Koh Hotta
%A Zhong Jin
%A Yutaka Ishikawa
%A Fred Johnson
%A Sanjay Kale
%A Richard Kenway
%A David Keyes
%A Bill Kramer
%A Jesus Labarta
%A Alain Lichnewsky
%A Thomas Lippert
%A Bob Lucas
%A Barney MacCabe
%A Satoshi Matsuoka
%A Paul Messina
%A Peter Michielse
%A Bernd Mohr
%A Matthias S. Mueller
%A Wolfgang E. Nagel
%A Hiroshi Nakashima
%A Michael E. Papka
%A Dan Reed
%A Mitsuhisa Sato
%A Ed Seidel
%A John Shalf
%A David Skinner
%A Marc Snir
%A Thomas Sterling
%A Rick Stevens
%A Fred Streitz
%A Bob Sugar
%A Shinji Sumimoto
%A William Tang
%A John Taylor
%A Rajeev Thakur
%A Anne Trefethen
%A Mateo Valero
%A Aad van der Steen
%A Jeffrey Vetter
%A Peg Williams
%A Robert Wisniewski
%A Kathy Yelick
%X Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.
%B International Journal of High Performance Computing
%V 25
%P 3-60
%8 2011-01
%G eng
%R https://doi.org/10.1177/1094342010391989

%0 Journal Article
%J IEEE Computing in Science & Engineering
%D 2011
%T Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community
%A Jeffrey Vetter
%A Richard Glassbrook
%A Jack Dongarra
%A Karsten Schwan
%A Bruce Loftis
%A Stephen McNally
%A Jeremy Meredith
%A James Rogers
%A Philip Roth
%A Kyle Spafford
%A Sudhakar Yalamanchili
%K Benchmark testing
%K Computational modeling
%K Computer architecture
%K Graphics processing unit
%K Hardware
%K Random access memory
%K Scientific computing
%X The Keeneland project's goal is to develop and deploy an innovative, GPU-based high-performance computing system for the NSF computational science community.
%B IEEE Computing in Science & Engineering
%V 13
%P 90-95
%8 2011-08
%G eng
%N 5
%R https://doi.org/10.1109/MCSE.2011.83

%0 Conference Proceedings
%B Int'l Conference on Parallel Processing (ICPP '11)
%D 2011
%T Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs
%A Teng Ma
%A George Bosilca
%A Aurelien Bouteiller
%A Brice Goglin
%A J. Squyres
%A Jack Dongarra
%B Int'l Conference on Parallel Processing (ICPP '11)
%C Taipei, Taiwan
%8 2011-09
%G eng

%0 Journal Article
%J IEEE/ACS AICCSA 2011
%D 2011
%T LU Factorization for Accelerator-Based Systems
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Mathieu Faverge
%A Julien Langou
%A Hatem Ltaeif
%A Stanimire Tomov
%K magma
%K morse
%B IEEE/ACS AICCSA 2011
%C Sharm-El-Sheikh, Egypt
%8 2011-12
%G eng

%0 Generic
%D 2011
%T MAGMA - LAPACK for GPUs
%A Stanimire Tomov
%I Keeneland GPU Tutorial
%C Atlanta, GA
%8 2011-04
%G eng

%0 Generic
%D 2011
%T MAGMA - LAPACK for HPC on Heterogeneous Architectures
%A Stanimire Tomov
%A Jack Dongarra
%I Titan Summit at Oak Ridge National Laboratory, Presentation
%C Oak Ridge, TN
%8 2011-08
%G eng

%0 Generic
%D 2011
%T Matrix Algebra on GPU and Multicore Architectures
%A Stanimire Tomov
%I Workshop on GPU-enabled Numerical Libraries, Presentation
%C Basel, Switzerland
%8 2011-05
%G eng

%0 Journal Article
%J 18th EuroMPI
%D 2011
%T OMPIO: A Modular Software Architecture for MPI I/O
%A Mohamad Chaarawi
%A Edgar Gabriel
%A Rainer Keller
%A Richard L. Graham
%A George Bosilca
%A Jack Dongarra
%E Yiannis Cotronis
%E Anthony Danalis
%E Dimitrios S. Nikolopoulos
%E Jack Dongarra
%B 18th EuroMPI
%I Springer
%C Santorini, Greece
%P 81-89
%8 2011-09
%G eng

%0 Conference Proceedings
%B Parallel Tools Workshop
%D 2011
%T An open-source tool-chain for performance analysis
%A Kevin Coulomb
%A Augustin Degomme
%A Mathieu Faverge
%A Francois Trahay
%B Parallel Tools Workshop
%C Dresden, Germany
%8 2011-09
%G eng

%0 Conference Proceedings
%B ACM/IEEE Conference on Supercomputing (SC’11)
%D 2011
%T Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs
%A Rajib Nath
%A Stanimire Tomov
%A Tingxing Dong
%A Jack Dongarra
%K magma
%B ACM/IEEE Conference on Supercomputing (SC’11)
%C Seattle, WA
%8 2011-11
%G eng

%0 Conference Proceedings
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%D 2011
%T Overlapping Computation and Communication for Advection on a Hybrid Parallel Computer
%A James B White
%A Jack Dongarra
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%C Anchorage, AK
%8 2011-05
%G eng

%0 Journal Article
%J Parallel, Distributed, Grid and Cloud Computing for Engineering, Ajaccio, Corsica, France, 12-15 April
%D 2011
%T Parallel algebraic domain decomposition solver for the solution of augmented systems.
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Jean Roman
%B Parallel, Distributed, Grid and Cloud Computing for Engineering, Ajaccio, Corsica, France, 12-15 April
%8 2011-00
%G eng

%0 Conference Paper
%B International Conference on Parallel Processing (ICPP'11)
%D 2011
%T Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs
%A Allen D. Malony
%A Scott Biersdorff
%A Sameer Shende
%A Heike Jagode
%A Stanimire Tomov
%A Guido Juckeland
%A Robert Dietrich
%A Duncan Poole
%A Christopher Lamb
%K magma
%K mumi
%K papi
%X The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. This paper studies the problems associated with performance measurement of heterogeneous machines with GPUs. A heterogeneous computation model and alternative host-GPU measurement approaches are discussed to set the stage for reporting new capabilities for heterogeneous parallel performance measurement in three leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work leverages the new CUPTI tool support in NVIDIA's CUDA device library. Heterogeneous benchmarks from the SHOC suite are used to demonstrate the measurement methods and tool support.
%B International Conference on Parallel Processing (ICPP'11)
%I ACM
%C Taipei, Taiwan
%8 2011-09
%@ 978-0-7695-4510-3
%G eng
%R 10.1109/ICPP.2011.71

%0 Conference Proceedings
%B Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11)
%D 2011
%T Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%K plasma
%K quark
%B Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11)
%C Seattle, WA
%8 2011-11
%G eng

%0 Generic
%D 2011
%T Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report, UT-CS-11-677, (also Lawn254)
%8 2011-08
%G eng

%0 Generic
%D 2011
%T A parallel tiled solver for dense symmetric indefinite systems on multicore architectures
%A Marc Baboulin
%A Dulceneia Becker
%A Jack Dongarra
%K plasma
%K quark
%B University of Tennessee Computer Science Technical Report
%8 2011-10
%G eng

%0 Generic
%D 2011
%T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report)
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report
%8 2011-00
%G eng

%0 Journal Article
%J IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC)
%D 2011
%T Performance Portability of a GPU Enabled Factorization with the DAGuE Framework
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Pierre Lemariner
%A Narapat Ohm Saengpatsa
%A Stanimire Tomov
%A Jack Dongarra
%K dague
%K magma
%K parsec
%B IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC)
%8 2011-06
%G eng

%0 Generic
%D 2011
%T Power-aware Computing on GPGPUs
%A Kiran Kasichayanula
%A Haihang You
%A Shirley Moore
%A Stanimire Tomov
%A Heike Jagode
%A Matt Johnson
%I Fall Creek Falls Conference, Poster
%C Gatlinburg, TN
%8 2011-09
%G eng

%0 Conference Proceedings
%B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011)
%D 2011
%T Power-Aware Prediction Models of Hybrid (MPI/OpenMP) Scientific Applications
%A Charles Lively
%A Xingfu Wu
%A Valerie Taylor
%A Shirley Moore
%A Hung-Ching Chang
%A Chun-Yi Su
%A Kirk Cameron
%K mumi
%B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011)
%C Hamburg, Germany
%8 2011-09
%G eng

%0 Conference Proceedings
%B IEEE Int'l Conference on Cluster Computing (Cluster 2011)
%D 2011
%T Process Distance-aware Adaptive MPI Collective Communications
%A Teng Ma
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%B IEEE Int'l Conference on Cluster Computing (Cluster 2011)
%C Austin, Texas
%8 2011-00
%G eng

%0 Conference Proceedings
%B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011)
%D 2011
%T Profiling High Performance Dense Linear Algebra Algorithms on Multicore Architectures for Power and Energy Efficiency
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K mumi
%B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011)
%C Hamburg, Germany
%8 2011-09
%G eng

%0 Journal Article
%J Future Generation Computer Systems
%D 2011
%T QCG-OMPI: MPI Applications on Grids.
%A Emmanuel Agullo
%A Camille Coti
%A Thomas Herault
%A Julien Langou
%A Sylvain Peyronnet
%A A. Rezmerita
%A Franck Cappello
%A Jack Dongarra
%B Future Generation Computer Systems
%V 27
%P 435-369
%8 2011-01
%G eng

%0 Generic
%D 2011
%T QUARK Users' Guide: QUeueing And Runtime for Kernels
%A Asim YarKhan
%A Jakub Kurzak
%A Jack Dongarra
%K magma
%K plasma
%K quark
%B University of Tennessee Innovative Computing Laboratory Technical Report
%8 2011-00
%G eng

%0 Generic
%D 2011
%T Reducing the Amount of Pivoting in Symmetric Indefinite Systems
%A Dulceneia Becker
%A Marc Baboulin
%A Jack Dongarra
%B University of Tennessee Innovative Computing Laboratory Technical Report
%I Submitted to PPAM 2011
%C Knoxville, TN
%8 2011-05
%G eng

%0 Conference Proceedings
%B International Conference on Cluster Computing (CLUSTER)
%D 2011
%T On Scalability for MPI Runtime Systems
%A George Bosilca
%A Thomas Herault
%A A. Rezmerita
%A Jack Dongarra
%K harness
%B International Conference on Cluster Computing (CLUSTER)
%I IEEEE
%C Austin, TX, USA
%P 187-195
%8 2011-09
%G eng

%0 Generic
%D 2011
%T On Scalability for MPI Runtime Systems
%A George Bosilca
%A Thomas Herault
%A A. Rezmerita
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report
%C Knoxville, TN
%8 2011-05
%G eng

%0 Conference Proceedings
%B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011
%D 2011
%T Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure
%A George Bosilca
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%A A. Rezmerita
%E Yiannis Cotronis
%E Anthony Danalis
%E Dimitrios S. Nikolopoulos
%E Jack Dongarra
%K ftmpi
%B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011
%I Springer
%C Santorini, Greece
%V 6960
%P 342-344
%8 2011-09
%G eng

%0 Generic
%D 2011
%T Soft Error Resilient QR Factorization for Hybrid System
%A Peng Du
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K ft-la
%B University of Tennessee Computer Science Technical Report
%C Knoxville, TN
%8 2011-07
%G eng

%0 Journal Article
%J UT-CS-11-675 (also LAPACK Working Note #252)
%D 2011
%T Soft Error Resilient QR Factorization for Hybrid System
%A Peng Du
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B UT-CS-11-675 (also LAPACK Working Note #252)
%8 2011-07
%G eng

%0 Journal Article
%J Journal of Computational Science
%D 2011
%T Soft Error Resilient QR Factorization for Hybrid System with GPGPU
%A Peng Du
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K ft-la
%B Journal of Computational Science
%I Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems at SC11
%C Seattle, WA
%8 2011-11
%G eng

%0 Journal Article
%J To appear in Geophysical Prospecting journal.
%D 2011
%T Three-dimensional parallel frequency-domain visco-acoustic wave modelling based on a hybrid direct/iterative solver.
%A Florent Sourbier
%A Azzam Haidar
%A Luc Giraud
%A Hafedh Ben-Hadj-Ali
%A Stéphane Operto
%A Jean Virieux
%B To appear in Geophysical Prospecting journal.
%8 2011-00
%G eng

%0 Journal Article
%J Submitted to SIAM Journal on Scientific Computing (SISC)
%D 2011
%T Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices.
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B Submitted to SIAM Journal on Scientific Computing (SISC)
%8 2011-00
%G eng

%0 Generic
%D 2011
%T Towards a Parallel Tile LDL Factorization for Multicore Architectures
%A Dulceneia Becker
%A Mathieu Faverge
%A Jack Dongarra
%K plasma
%K quark
%B ICL Technical Report
%C Seattle, WA
%8 2011-04
%G eng

%0 Conference Proceedings
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%D 2011
%T Two-stage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures
%A Piotr Luszczek
%A Hatem Ltaeif
%A Jack Dongarra
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%C Anchorage, AK
%8 2011-05
%G eng

%0 Conference Proceedings
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%D 2011
%T A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Pierre Lemariner
%A Narapat Ohm Saengpatsa
%A Stanimire Tomov
%A Jack Dongarra
%K dague
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%C Anchorage, AK
%8 2011-05
%G eng

%0 Journal Article
%J Procedia Computer Science
%D 2011
%T User-Defined Events for Hardware Performance Monitoring
%A Shirley Moore
%A James Ralph
%K mumi
%K papi
%X PAPI is a widely used cross-platform interface to hardware performance counters. PAPI currently supports native events, which are those provided by a given platform, and preset events, which are pre-defined events thought to be common across platforms. Presets are currently mapped and defined at the time that PAPI is compiled and installed. The idea of user-defined events is to allow users to define their own metrics and to have those metrics mapped to events on a platform without the need to re-install PAPI. User-defined events can be defined in terms of native, preset, and previously defined user-defined events. The user can combine events and constants in an arbitrary expression to define a new metric and give a name to the new metric. This name can then be specified as a PAPI event in a PAPI library call the same way as native and preset events. End-user tools such as TAU and Scalasca that use PAPI can also use the user-defined metrics. Users can publish their metric definitions so that other users can use them as well. We present several examples of how user-defined events can be used for performance analysis and modeling.
%B Procedia Computer Science
%I Elsevier
%V 4
%P 2096-2104
%8 2011-05
%G eng
%R https://doi.org/10.1016/j.procs.2011.04.229

%0 Conference Proceedings
%B PPAM 2009 Proceedings
%D 2010
%T 8th International Conference on Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (LNCS)
%E Roman Wyrzykowski
%E Jack Dongarra
%E Konrad Karczewski
%E Jerzy Wasniewski
%B PPAM 2009 Proceedings
%I Springer
%C Wroclaw, Poland
%V 6067
%8 2010-09
%G eng

%0 Journal Article
%J Proc. of VECPAR'10
%D 2010
%T Accelerating GPU Kernels for Dense Linear Algebra
%A Rajib Nath
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B Proc. of VECPAR'10
%C Berkeley, CA
%8 2010-06
%G eng

%0 Generic
%D 2010
%T Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
%A Stanimire Tomov
%A George Bosilca
%A Cedric Augonnet
%I 2010 Symposium on Application Accelerators in. High-Performance Computing (SAAHPC'10), Tutorial
%8 2010-07
%G eng

%0 Journal Article
%J Parallel Computing
%D 2010
%T Accelerating the Reduction to Upper Hessenberg, Tridiagonal, and Bidiagonal Forms through Hybrid GPU-Based Computing
%A Stanimire Tomov
%A Rajib Nath
%A Jack Dongarra
%K magma
%B Parallel Computing
%V 36
%P 645-654
%8 2010-00
%G eng

%0 Journal Article
%J Submitted to Concurrency and Computations: Practice and Experience
%D 2010
%T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
%A Azzam Haidar
%A Hatem Ltaeif
%A Asim YarKhan
%A Jack Dongarra
%K plasma
%K quark
%B Submitted to Concurrency and Computations: Practice and Experience
%8 2010-11
%G eng

%0 Generic
%D 2010
%T Analysis of Various Scalar, Vector, and Parallel Implementations of RandomAccess
%A Piotr Luszczek
%A Jack Dongarra
%K hpcc
%B Innovative Computing Laboratory (ICL) Technical Report
%8 2010-06
%G eng

%0 Generic
%D 2010
%T Autotuning Dense Linear Algebra Libraries on GPUs
%A Rajib Nath
%A Stanimire Tomov
%A Emmanuel Agullo
%A Jack Dongarra
%I Sixth International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010)
%C Basel, Switzerland
%8 2010-06
%G eng

%0 Book Section
%B Scientific Computing with Multicore and Accelerators
%D 2010
%T Blas for GPUs
%A Rajib Nath
%A Stanimire Tomov
%A Jack Dongarra
%B Scientific Computing with Multicore and Accelerators
%S Chapman & Hall/CRC Computational Science
%I CRC Press
%C Boca Raton, Florida
%@ 9781439825365
%G eng
%& 4

%0 Conference Proceedings
%B 3rd Workshop on Functionality of Hardware Performance Monitoring
%D 2010
%T Can Hardware Performance Counters Produce Expected, Deterministic Results?
%A Vincent M Weaver
%A Jack Dongarra
%K papi
%B 3rd Workshop on Functionality of Hardware Performance Monitoring
%C Atlanta, GA
%8 2010-12
%G eng

%0 Journal Article
%J Parallel Computing (to appear)
%D 2010
%T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%B Parallel Computing (to appear)
%8 2010-00
%G eng

%0 Journal Article
%J Tools for High Performance Computing 2009
%D 2010
%T Collecting Performance Data with PAPI-C
%A Dan Terpstra
%A Heike Jagode
%A Haihang You
%A Jack Dongarra
%K mumi
%K papi
%X Modern high performance computer systems continue to increase in size and complexity. Tools to measure application performance in these increasingly complex environments must also increase the richness of their measurements to provide insights into the increasingly intricate ways in which software and hardware interact. PAPI (the Performance API) has provided consistent platform and operating system independent access to CPU hardware performance counters for nearly a decade. Recent trends toward massively parallel multi-core systems with often heterogeneous architectures present new challenges for the measurement of hardware performance information, which is now available not only on the CPU core itself, but scattered across the chip and system. We discuss the evolution of PAPI into Component PAPI, or PAPI-C, in which multiple sources of performance data can be measured simultaneously via a common software interface. Several examples of components and component data measurements are discussed. We explore the challenges to hardware performance measurement in existing multi-core architectures. We conclude with an exploration of future directions for the PAPI interface.
%B Tools for High Performance Computing 2009
%I Springer Berlin / Heidelberg
%C 3rd Parallel Tools Workshop, Dresden, Germany
%P 157-173
%8 2010-05
%G eng
%R https://doi.org/10.1007/978-3-642-11261-4_11

%0 Journal Article
%J Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale
%D 2010
%T Constructing Resiliant Communication Infrastructure for Runtime Environments in Advances in Parallel Computing
%A George Bosilca
%A Camille Coti
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%E Barbara Chapman
%E Frederic Desprez
%E Gerhard R. Joubert
%E Alain Lichnewsky
%E Frans Peters
%E T. Priol
%B Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale
%V 19
%P 441-451
%G eng
%R 10.3233/978-1-60750-530-3-441

%0 Generic
%D 2010
%T DAGuE: A generic distributed DAG engine for high performance computing
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%K dague
%B Innovative Computing Laboratory Technical Report
%8 2010-04
%G eng

%0 Book Section
%B Scientific Computing with Multicore and Accelerators
%D 2010
%T Dense Linear Algebra for Hybrid GPU-based Systems
%A Stanimire Tomov
%A Jack Dongarra
%B Scientific Computing with Multicore and Accelerators
%S Chapman & Hall/CRC Computational Science
%I CRC Press
%C Boca Raton, Florida
%@ 9781439825365
%G eng
%& 3

%0 Conference Proceedings
%B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
%D 2010
%T Dense Linear Algebra Solvers for Multicore with GPU Accelerators
%A Stanimire Tomov
%A Rajib Nath
%A Hatem Ltaeif
%A Jack Dongarra
%X Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators. We describe how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures. The approach taken is based on hybridization techniques in the context of Cholesky, LU, and QR factorizations. We use a high-level parallel programming model and leverage existing software infrastructure, e.g. optimized BLAS for CPU and GPU, and LAPACK for sequential CPU processing. Included also are architecture and algorithm-specific optimizations for standard solvers as well as mixed-precision iterative refinement solvers. The new algorithms, depending on the hardware configuration and routine parameters, can lead to orders of magnitude acceleration when compared to the same algorithms on standard multicore architectures that do not contain GPU accelerators. The newly developed DLA solvers are integrated and freely available through the MAGMA library.
%B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
%C Atlanta, GA
%P 1-8
%G eng
%R 10.1109/IPDPSW.2010.5470941

%0 Generic
%D 2010
%T Dense Linear Algebra Solvers for Multicore with GPU Accelerators
%A Stanimire Tomov
%I International Parallel and Distributed Processing Symposium (IPDPS 2010)
%C Atlanta, GA
%8 2010-04
%G eng

%0 Generic
%D 2010
%T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-10-660
%8 2010-09
%G eng

%0 Generic
%D 2010
%T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K plasma
%B Innovative Computing Laboratory Technical Report
%8 2010-00
%G eng

%0 Journal Article
%J SIAM Journal on Scientific Computing (submitted)
%D 2010
%T Divide & Conquer on Hybrid GPU-Accelerated Multicore Systems
%A Christof Voemel
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B SIAM Journal on Scientific Computing (submitted)
%8 2010-08
%G eng

%0 Conference Proceedings
%B Proceedings of EuroMPI 2010
%D 2010
%T Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%E Jack Dongarra
%E Michael Resch
%E Rainer Keller
%E Edgar Gabriel
%K ftmpi
%B Proceedings of EuroMPI 2010
%I Springer
%C Stuttgart, Germany
%8 2010-09
%G eng

%0 Journal Article
%J in Performance Tuning of Scientific Applications (to appear)
%D 2010
%T Empirical Performance Tuning of Dense Linear Algebra Software
%A Jack Dongarra
%A Shirley Moore
%E David Bailey
%E Robert Lucas
%E Sam Williams
%B in Performance Tuning of Scientific Applications (to appear)
%8 2010-00
%G eng

%0 Generic
%D 2010
%T EZTrace: a generic framework for performance analysis
%A Jack Dongarra
%A Mathieu Faverge
%A Yutaka Ishikawa
%A Raymond Namyst
%A François Rue
%A Francois Trahay
%B ICL Technical Report
%8 2010-12
%G eng

%0 Generic
%D 2010
%T Faster, Cheaper, Better - A Hybridization Methodology to Develop Linear Algebra Software for GPUs
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Samuel Thibault
%A Stanimire Tomov
%K magma
%K morse
%B LAPACK Working Note
%8 2010-00
%G eng

%0 Journal Article
%J IEEE Transaction on Parallel and Distributed Systems (submitted)
%D 2010
%T Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators
%A Hatem Ltaeif
%A Stanimire Tomov
%A Rajib Nath
%A Jack Dongarra
%K magma
%K plasma
%B IEEE Transaction on Parallel and Distributed Systems (submitted)
%8 2010-03
%G eng

%0 Generic
%D 2010
%T An Improved MAGMA GEMM for Fermi GPUs
%A Rajib Nath
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B University of Tennessee Computer Science Technical Report
%8 2010-07
%G eng

%0 Journal Article
%J International Journal of High Performance Computing
%D 2010
%T An Improved MAGMA GEMM for Fermi GPUs
%A Rajib Nath
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B International Journal of High Performance Computing
%V 24
%P 511-515
%8 2010-00
%G eng

%0 Conference Proceedings
%B Proceedings of International Conference on Computational Science, ICCS 2010 (to appear)
%D 2010
%T Improvement of parallelization efficiency of batch pattern BP training algorithm using Open MPI
%A Volodymyr Turchenko
%A Lucio Grandinetti
%A George Bosilca
%A Jack Dongarra
%K hpcchallenge
%B Proceedings of International Conference on Computational Science, ICCS 2010 (to appear)
%I Elsevier
%C Amsterdam The Netherlands
%8 2010-06
%G eng

%0 Journal Article
%J VECPAR 2010, 9th International Meeting on High Performance Computing for Computational Science
%D 2010
%T Intelligent Service Trading and Brokering for Distributed Network Services in GridSolve
%A Aurlie Hurault
%A Asim YarKhan
%K gridpac
%K netsolve
%B VECPAR 2010, 9th International Meeting on High Performance Computing for Computational Science
%C Berkeley, CA
%8 2010-06
%G eng

%0 Generic
%D 2010
%T International Exascale Software Project Roadmap v1.0
%A Jack Dongarra
%A Pete Beckman
%B University of Tennessee Computer Science Technical Report, UT-CS-10-654
%8 2010-05
%G eng

%0 Generic
%D 2010
%T An Introduction to the MAGMA project - Acceleration of Dense Linear Algebra
%A Jack Dongarra
%A Stanimire Tomov
%I NVIDIA Webinar
%8 2010-06
%G eng
%U http://developer.download.nvidia.com/CUDA/training/introtomagma.mp4

%0 Generic
%D 2010
%T Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs
%A Teng Ma
%A George Bosilca
%A Aurelien Bouteiller
%A Brice Goglin
%A J. Squyres
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report, UT-CS-10-663
%8 2010-11
%G eng

%0 Journal Article
%J ACM TOMS (submitted), also LAPACK Working Note (LAWN) 211
%D 2010
%T Level-3 Cholesky Kernel Subroutine of a Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm
%A Fred G. Gustavson
%A Jerzy Wasniewski
%A Jack Dongarra
%B ACM TOMS (submitted), also LAPACK Working Note (LAWN) 211
%8 2010-00
%G eng

%0 Journal Article
%J PARA 2010
%D 2010
%T LINPACK on Future Manycore and GPu Based Systems
%A Jack Dongarra
%B PARA 2010
%C Reykjavik, Iceland
%8 2010-06
%G eng

%0 Conference Proceedings
%B Proceedings of the 17th EuroMPI conference
%D 2010
%T Locality and Topology aware Intra-node Communication Among Multicore CPUs
%A Teng Ma
%A Aurelien Bouteiller
%A George Bosilca
%A Jack Dongarra
%B Proceedings of the 17th EuroMPI conference
%I LNCS
%C Stuttgart, Germany
%8 2010-09
%G eng

%0 Journal Article
%J Sparse Days 2010 Meeting at CERFACS
%D 2010
%T MaPHyS or the Development of a Parallel Algebraic Domain Decomposition Solver in the Course of the Solstice Project
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Jean Roman
%A Yohan Lee-Tin-Yien
%B Sparse Days 2010 Meeting at CERFACS
%C Toulouse, France
%8 2010-06
%G eng

%0 Conference Proceedings
%B First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010)
%D 2010
%T Mixed-Tool Performance Analysis on Hybrid Multicore Architectures
%A Peng Du
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010)
%C San Diego, CA
%8 2010-09
%G eng

%0 Conference Proceedings
%B Symposium on Application Accelerators in High-Performance Computing (SAAHPC '10)
%D 2010
%T OpenCL Evaluation for Numerical Linear Algebra Library Development
%A Peng Du
%A Piotr Luszczek
%A Jack Dongarra
%K magma
%B Symposium on Application Accelerators in High-Performance Computing (SAAHPC '10)
%C Knoxville, TN
%8 2010-07
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2010
%T Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%P 417-423
%8 2010-04
%G eng

%0 Conference Proceedings
%B Proceedings of the Cray Users' Group Meeting
%D 2010
%T Performance Evaluation for Petascale Quantum Simulation Tools
%A Stanimire Tomov
%A Wenchang Lu
%A 
%A Jerzy Bernholc
%A Shirley Moore
%A Jack Dongarra
%B Proceedings of the Cray Users' Group Meeting
%C Atlanta, GA
%8 2010-05
%G eng

%0 Generic
%D 2010
%T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report)
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report, UT-CS-89-85
%8 2010-00
%G eng

%0 Journal Article
%J ICCS 2010
%D 2010
%T Proceedings of the International Conference on Computational Science
%E Peter M. Sloot
%E Geert Dick van Albada
%E Jack Dongarra
%B ICCS 2010
%I Elsevier
%C Amsterdam
%8 2010-05
%G eng

%0 Journal Article
%J Future Generation Computer Systems
%D 2010
%T QCG-OMPI: MPI Applications on Grids
%A Emmanuel Agullo
%A Camille Coti
%A Thomas Herault
%A Julien Langou
%A Sylvain Peyronnet
%A A. Rezmerita
%A Franck Cappello
%A Jack Dongarra
%B Future Generation Computer Systems
%V 27
%P 357-369
%8 2010-03
%G eng

%0 Journal Article
%J Scientific Programming
%D 2010
%T QR Factorization for the CELL Processor
%A Jakub Kurzak
%A Jack Dongarra
%B Scientific Programming
%V 17
%P 31-42
%8 2010-00
%G eng

%0 Conference Proceedings
%B 24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224)
%D 2010
%T QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
%A Emmanuel Agullo
%A Camille Coti
%A Jack Dongarra
%A Thomas Herault
%A Julien Langou
%B 24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224)
%C Atlanta, GA
%8 2010-04
%G eng

%0 Conference Proceedings
%B Proceedings of IPDPS 2011
%D 2010
%T QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Samuel Thibault
%A Stanimire Tomov
%K magma
%K morse
%K plasma
%B Proceedings of IPDPS 2011
%C Anchorage, AK
%8 2010-10
%G eng

%0 Conference Proceedings
%B EuroMPI 2010 Proceedings
%D 2010
%T Recent Advances in the Message Passing Interface, Lecture Notes in Computer Science (LNCS)
%E Rainer Keller
%E Edgar Gabriel
%E Michael Resch
%E Jack Dongarra
%B EuroMPI 2010 Proceedings
%I Springer
%C Stuttgart, Germany
%V 6305
%8 2010-09
%G eng

%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2010
%T Rectangular Full Packed Format for Cholesky’s Algorithm: Factorization, Solution, and Inversion
%A Fred G. Gustavson
%A Jerzy Wasniewski
%A Jack Dongarra
%A Julien Langou
%B ACM Transactions on Mathematical Software (TOMS)
%C Atlanta, GA
%V 37
%8 2010-04
%G eng

%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2010
%T Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion
%A Fred G. Gustavson
%A Jerzy Wasniewski
%A Jack Dongarra
%A Julien Langou
%B ACM Transactions on Mathematical Software (TOMS)
%V 37
%8 2010-04
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience (online version)
%D 2010
%T Redesigning the Message Logging Model for High Performance
%A Aurelien Bouteiller
%A George Bosilca
%A Jack Dongarra
%B Concurrency and Computation: Practice and Experience (online version)
%8 2010-06
%G eng

%0 Generic
%D 2010
%T Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling
%A Jack Dongarra
%A Piotr Luszczek
%K hpcc
%B University of Tennessee Computer Science Technical Report
%8 2010-10
%G eng

%0 Journal Article
%J PARA 2010
%D 2010
%T Scalability Study of a Quantum Simulation Code
%A Jerzy Bernholc
%A Miroslav Hodak
%A Wenchang Lu
%A Shirley Moore
%A Stanimire Tomov
%B PARA 2010
%C Reykjavik, Iceland
%8 2010-06
%G eng

%0 Journal Article
%J Proc. of VECPAR'10 (to appear)
%D 2010
%T A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators
%A Hatem Ltaeif
%A Stanimire Tomov
%A Rajib Nath
%A Peng Du
%A Jack Dongarra
%K magma
%K plasma
%B Proc. of VECPAR'10 (to appear)
%C Berkeley, CA
%8 2010-06
%G eng

%0 Journal Article
%J SC'10
%D 2010
%T Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
%A Fengguang Song
%A Hatem Ltaeif
%A Bilel Hadri
%A Jack Dongarra
%K plasma
%B SC'10
%I ACM SIGARCH/ IEEE Computer Society
%C New Orleans, LA
%8 2010-11
%G eng

%0 Generic
%D 2010
%T Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
%A Fengguang Song
%A Hatem Ltaeif
%A Bilel Hadri
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report
%V –10-653
%8 2010-04
%G eng

%0 Generic
%D 2010
%T Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Rajib Nath
%A Jean Roman
%A Samuel Thibault
%A Stanimire Tomov
%I 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster
%C Knoxville, TN
%8 2010-07
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2010
%T Scheduling Dense Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%K gridpac
%K plasma
%B Concurrency and Computation: Practice and Experience
%V 22
%P 15-44
%8 2010-01
%G eng

%0 Journal Article
%J Journal of Scientific Computing
%D 2010
%T Scheduling Two-sided Transformations using Tile Algorithms on Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%A Rosa M. Badia
%K plasma
%B Journal of Scientific Computing
%V 18
%P 33-50
%8 2010-00
%G eng

%0 Journal Article
%J Future Generation Computer Systems
%D 2010
%T Self-Healing Network for Scalable Fault-Tolerant Runtime Environments
%A Thara Angskun
%A Graham Fagg
%A George Bosilca
%A Jelena Pjesivac–Grbovic
%A Jack Dongarra
%B Future Generation Computer Systems
%V 26
%P 479-485
%8 2010-03
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience (to appear)
%D 2010
%T SmartGridRPC: The new RPC model for high performance Grid Computing and Its Implementation in SmartGridSolve
%A Thomas Brady
%A Alexey Lastovetsky
%A Keith Seymour
%A Michele Guidolin
%A Jack Dongarra
%K netsolve
%B Concurrency and Computation: Practice and Experience (to appear)
%8 2010-01
%G eng

%0 Journal Article
%J Numerical Mathematics: Theory, Methods and Applications
%D 2010
%T Sparse approximations of the Schur complement for parallel algebraic hybrid solvers in 3D
%A Luc Giraud
%A Azzam Haidar
%A Yousef Saad
%E C. Zhiming
%B Numerical Mathematics: Theory, Methods and Applications
%I Golbal Science Press
%C Beijing
%V 3
%P 64-82
%8 2010-00
%G eng

%0 Conference Proceedings
%B 24th IEEE International Parallel and Distributed Processing Symposium (submitted)
%D 2010
%T Tile QR Factorization with Parallel Panel Processing for Multicore Architectures
%A Bilel Hadri
%A Emmanuel Agullo
%A Jack Dongarra
%B 24th IEEE International Parallel and Distributed Processing Symposium (submitted)
%8 2010-00
%G eng

%0 Journal Article
%J PARA 2010
%D 2010
%T Towards a Complexity Analysis of Sparse Hybrid Linear Solvers
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Jean Roman
%B PARA 2010
%C Reykjavik, Iceland
%8 2010-06
%G eng

%0 Journal Article
%J Parallel Computing
%D 2010
%T Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems
%A Stanimire Tomov
%A Jack Dongarra
%A Marc Baboulin
%K magma
%B Parallel Computing
%V 36
%P 232-240
%8 2010-00
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications (to appear)
%D 2010
%T Trace-based Performance Analysis for the Petascale Simulation Code FLASH
%A Heike Jagode
%A Andreas Knuepfer
%A Jack Dongarra
%A Matthias Jurenz
%A Matthias S. Mueller
%A Wolfgang E. Nagel
%B International Journal of High Performance Computing Applications (to appear)
%8 2010-00
%G eng

%0 Journal Article
%J FOSS4G 2010
%D 2010
%T Tuning Principal Component Analysis for GRASS GIS on Multi-core and GPU Architectures
%A Peng Du
%A Matthew Parsons
%A Erika Fuentes
%A Shih-Lung Shaw
%A Jack Dongarra
%K magma
%B FOSS4G 2010
%C Barcelona, Spain
%8 2010-09
%G eng

%0 Journal Article
%J PGI Insider
%D 2010
%T Using MAGMA with PGI Fortran
%A Stanimire Tomov
%A Mathieu Faverge
%A Piotr Luszczek
%A Jack Dongarra
%K magma
%B PGI Insider
%8 2010-11
%G eng

%0 Journal Article
%J Parallel Computing
%D 2010
%T Using multiple levels of parallelism to enhance the performance of domain decomposition solvers
%A Luc Giraud
%A Azzam Haidar
%A Stephane Pralet
%E Costas Bekas
%E Pascua D’Ambra
%E Ananth Grama
%E Yousef Saad
%E Petko Yanev
%B Parallel Computing
%I Elsevier journals
%V 36
%P 285-296
%8 2010-00
%G eng

%0 Journal Article
%J Computer Physics Communications
%D 2009
%T Accelerating Scientific Computations with Mixed Precision Algorithms
%A Marc Baboulin
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julie Langou
%A Julien Langou
%A Piotr Luszczek
%A Stanimire Tomov
%X On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented.
%B Computer Physics Communications
%V 180
%P 2526-2533
%8 2009-12
%G eng
%N 12
%R https://doi.org/10.1016/j.cpc.2008.11.005

%0 Generic
%D 2009
%T Accelerating the Reduction to Upper Hessenberg Form through Hybrid GPU-Based Computing
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B University of Tennessee Computer Science Technical Report, UT-CS-09-642 (also LAPACK Working Note 219)
%8 2009-05
%G eng

%0 Journal Article
%J SciDAC Review
%D 2009
%T Accelerating Time-To-Solution for Computational Science and Engineering
%A James Demmel
%A Jack Dongarra
%A Armando Fox
%A Sam Williams
%A Vasily Volkov
%A Katherine Yelick
%B SciDAC Review
%8 2009-00
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2009
%T Algorithmic Based Fault Tolerance Applied to High Performance Computing
%A Jack Dongarra
%A George Bosilca
%A Remi Delmas
%A Julien Langou
%B Journal of Parallel and Distributed Computing
%V 69
%P 410-416
%8 2009-00
%G eng

%0 Journal Article
%J IEEE Cluster 2009
%D 2009
%T Analytical Modeling and Optimization for Affinity Based Thread Scheduling on Multicore Systems
%A Fengguang Song
%A Shirley Moore
%A Jack Dongarra
%K gridpac
%K mumi
%B IEEE Cluster 2009
%C New Orleans
%8 2009-08
%G eng

%0 Journal Article
%J International Journal of Parallel Programming
%D 2009
%T Capturing and Analyzing the Execution Control Flow of OpenMP Applications
%A Karl Fürlinger
%A Shirley Moore
%B International Journal of Parallel Programming
%V 37
%P 266-276
%8 2009-00
%G eng

%0 Journal Article
%J Parallel Computing
%D 2009
%T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B Parallel Computing
%V 35
%P 38-53
%8 2009-00
%G eng

%0 Conference Proceedings
%B 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear)
%D 2009
%T Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware
%A Emmanuel Agullo
%A Bilel Hadri
%A Hatem Ltaeif
%A Jack Dongarra
%B 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear)
%8 2009-00
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science: Theoretical Computer Science and General Issues
%D 2009
%T Computational Science – ICCS 2009, Proceedings of the 9th International Conference
%E Gabrielle Allen
%E Jarosław Nabrzyski
%E E. Seidel
%E Geert Dick van Albada
%E Jack Dongarra
%E Peter M. Sloot
%B Lecture Notes in Computer Science: Theoretical Computer Science and General Issues
%C Baton Rouge, LA
%V -
%8 2009-05
%G eng

%0 Journal Article
%J Numerical Linear Algebra with Applications
%D 2009
%T Computing the Conditioning of the Components of a Linear Least-squares Solution
%A Marc Baboulin
%A Jack Dongarra
%A Serge Gratton
%A Julien Langou
%B Numerical Linear Algebra with Applications
%V 16
%P 517-533
%8 2009-00
%G eng

%0 Generic
%D 2009
%T Constructing resiliant communication infrastructure for runtime environments
%A George Bosilca
%A Camille Coti
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%8 2009-07
%G eng

%0 Journal Article
%J ParCo 2009
%D 2009
%T Constructing Resilient Communication Infrastructure for Runtime Environments
%A Pierre Lemariner
%A George Bosilca
%A Camille Coti
%A Thomas Herault
%A Jack Dongarra
%B ParCo 2009
%C Lyon France
%8 2009-09
%G eng

%0 Journal Article
%J PPAM 2009
%D 2009
%T Dependency-Driven Scheduling of Dense Matrix Factorizations on Shared-Memory Systems
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%B PPAM 2009
%C Poland
%8 2009-09
%G eng

%0 Conference Proceedings
%B International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09)
%D 2009
%T Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems
%A Fengguang Song
%A Asim YarKhan
%A Jack Dongarra
%K mumi
%K plasma
%B International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09)
%C Portland, OR
%8 2009-11
%G eng

%0 Journal Article
%J Submitted to Transaction on Parallel and Distributed Systems
%D 2009
%T Enhancing Parallelism of Tile QR Factorization for Multicore Architectures
%A Bilel Hadri
%A Hatem Ltaeif
%A Emmanuel Agullo
%A Jack Dongarra
%K plasma
%B Submitted to Transaction on Parallel and Distributed Systems
%8 2009-12
%G eng

%0 Generic
%D 2009
%T Fully Dynamic Scheduler for Numerical Computing on Multicore Processors
%A Jakub Kurzak
%A Jack Dongarra
%B University of Tennessee Computer Science Department Technical Report, UT-CS-09-643 (Also LAPACK Working Note 220)
%8 2009-00
%G eng

%0 Conference Proceedings
%B Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering
%D 2009
%T Grid Computing applied to the Boundary Element Method
%A Manoel Cunha
%A Jose Telles
%A Asim YarKhan
%A Jack Dongarra
%E B. H. V. Topping
%E Peter Iványi
%K netsolve
%B Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering
%I Civil-Comp Press
%C Stirlingshire, UK
%V 27
%8 2009-00
%G eng

%0 Journal Article
%J IEEE Transactions on Computers
%D 2009
%T Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing
%A Zizhong Chen
%A Jack Dongarra
%X As the number of processors in today's high-performance computers continues to grow, the mean-time-to-failure of these computers is becoming significantly shorter than the execution time of many current high-performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most of today's high-performance computing applications cannot survive node failures. Therefore, whenever a node fails, all surviving processes on surviving nodes usually have to be aborted and the whole application has to be restarted. In this paper, we present a framework for building self-healing high-performance numerical computing applications so that they can adapt to node or link failures without aborting themselves. The framework is based on FT-MPI and diskless checkpointing. Our diskless checkpointing uses weighted checksum schemes, a variation of Reed-Solomon erasure codes over floating-point numbers. We introduce several scalable encoding strategies into the existing diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p]. k ((beta + 2gamma) m + alpha) to (1 + O (radic(p)/radic(m))) 2 . k (beta + 2gamma)m, where alpha is the communication latency, 1/beta is the network bandwidth between processes, {1\over \gamma } is the rate to perform calculations, and m is the size of local checkpoint per process. When additional checkpoint processors are used, the overhead can be reduced to (1 + O (1/radic(m))). k (beta + 2gamma)m, which is independent of the total number of computational processors. The introduced self-healing algorithms are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of our self-healing approach by using a preconditioned conjugate gradient equation solver as an example.
%B IEEE Transactions on Computers
%V 58
%P 1512-1524
%8 2009-11
%G eng
%N 11
%R https://doi.org/10.1109/TC.2009.42

%0 Conference Proceedings
%B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing
%D 2009
%T A Holistic Approach for Performance Measurement and Analysis for Petascale Applications
%A Heike Jagode
%A Jack Dongarra
%A Sadaf Alam
%A Jeffrey Vetter
%A W. Spear
%A Allen D. Malony
%E Gabrielle Allen
%K point
%K test
%B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing
%I Springer-Verlag Berlin Heidelberg 2009
%C Baton Rouge, Louisiana
%V 2009
%P 686-695
%8 2009-05
%G eng

%0 Journal Article
%J Euro-Par 2009, Lecture Notes in Computer Science
%D 2009
%T Impact of Quad-core Cray XT4 System and Software Stack on Scientific Computation
%A Sadaf Alam
%A Richard F. Barrett
%A Heike Jagode
%A J. A. Kuehn
%A Steve W. Poole
%A R. Sankaran
%K test
%B Euro-Par 2009, Lecture Notes in Computer Science
%I Springer Berlin / Heidelberg
%C Delft, The Netherlands
%V 5704/2009
%P 334-344
%8 2009-08
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications (to appear)
%D 2009
%T The International Exascale Software Project: A Call to Cooperative Action by the Global High Performance Community
%A Jack Dongarra
%A Pete Beckman
%A Patrick Aerts
%A Franck Cappello
%A Thomas Lippert
%A Satoshi Matsuoka
%A Paul Messina
%A Terry Moore
%A Rick Stevens
%A Anne Trefethen
%A Mateo Valero
%B International Journal of High Performance Computing Applications (to appear)
%8 2009-07
%G eng

%0 Journal Article
%J ISC'09
%D 2009
%T I/O Performance Analysis for the Petascale Simulation Code FLASH
%A Heike Jagode
%A Shirley Moore
%A Dan Terpstra
%A Jack Dongarra
%A Andreas Knuepfer
%A Matthias Jurenz
%A Matthias S. Mueller
%A Wolfgang E. Nagel
%K test
%B ISC'09
%C Hamburg, Germany
%8 2009-06
%G eng

%0 Conference Proceedings
%B Proceedings of DoD HPCMP UGC 2009
%D 2009
%T Making Performance Analysis and Tuning Part of the Software Development Cycle
%A Ricardo Portillo
%A Patricia J. Teller
%A David Cronk
%A Shirley Moore
%B Proceedings of DoD HPCMP UGC 2009
%I IEEE
%C San Diego, CA
%8 2009-06
%G eng

%0 Conference Proceedings
%B SciDAC 2009, Journal of Physics: Conference Series
%D 2009
%T Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team
%A Bronis R. de Supinski
%A Sadaf Alam
%A David Bailey
%A Laura Carrington
%A Chris Daley
%A Anshu Dubey
%A Todd Gamblin
%A Dan Gunter
%A Paul D. Hovland
%A Heike Jagode
%A Karen Karavanic
%A Gabriel Marin
%A John Mellor-Crummey
%A Shirley Moore
%A Boyana Norris
%A Leonid Oliker
%A Catherine Olschanowsky
%A Philip C. Roth
%A Martin Schulz
%A Sameer Shende
%A Allan Snavely
%K test
%B SciDAC 2009, Journal of Physics: Conference Series
%I IOP Publishing
%C San Diego, California
%V 180(2009)012039
%8 2009-07
%G eng

%0 Conference Proceedings
%B Proceedings of the 23rd annual International Conference on Supercomputing (ICS '09)
%D 2009
%T MPI-aware Compiler Optimizations for Improving Communication-Computation Overlap
%A Anthony Danalis
%A Lori Pollock
%A Martin Swany
%A John Cavazos
%B Proceedings of the 23rd annual International Conference on Supercomputing (ICS '09)
%I ACM
%C Yorktown Heights, NY, USA
%P 316-325
%8 2009-06
%G eng

%0 Conference Proceedings
%B 9th International Conference on Computational Science (ICCS 2009)
%D 2009
%T A Note on Auto-tuning GEMM for GPUs
%A Yinan Li
%A Jack Dongarra
%A Stanimire Tomov
%E Gabrielle Allen
%E Jarosław Nabrzyski
%E E. Seidel
%E Geert Dick van Albada
%E Jack Dongarra
%E Peter M. Sloot
%B 9th International Conference on Computational Science (ICCS 2009)
%C Baton Rouge, LA
%P 884-892
%8 2009-05
%G eng
%R 10.1007/978-3-642-01970-8_89

%0 Conference Proceedings
%B Journal of Physics: Conference Series
%D 2009
%T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
%A Emmanuel Agullo
%A James Demmel
%A Jack Dongarra
%A Bilel Hadri
%A Jakub Kurzak
%A Julien Langou
%A Hatem Ltaeif
%A Piotr Luszczek
%A Stanimire Tomov
%K magma
%K plasma
%B Journal of Physics: Conference Series
%V 180
%8 2009-00
%G eng

%0 Generic
%D 2009
%T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
%A Emmanuel Agullo
%A James Demmel
%A Jack Dongarra
%A Bilel Hadri
%A Jakub Kurzak
%A Julien Langou
%A Hatem Ltaeif
%A Piotr Luszczek
%A Rajib Nath
%A Stanimire Tomov
%A Asim YarKhan
%A Vasily Volkov
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09)
%C Portland, OR
%8 2009-11
%G eng

%0 Generic
%D 2009
%T Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project
%A Rajib Nath
%A Jack Dongarra
%A Stanimire Tomov
%A Hatem Ltaeif
%A Peng Du
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09)
%C Portland, Oregon
%8 2009-11
%G eng

%0 Journal Article
%J Parallel Computing
%D 2009
%T Optimizing Matrix Multiplication for a Short-Vector SIMD Architecture - CELL Processor
%A Wesley Alvaro
%A Jakub Kurzak
%A Jack Dongarra
%B Parallel Computing
%V 35
%P 138-150
%8 2009-00
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems (to appear)
%D 2009
%T Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems (to appear)
%8 2009-05
%G eng

%0 Journal Article
%J in Cyberinfrastructure Technologies and Applications
%D 2009
%T Parallel Dense Linear Algebra Software in the Multicore Era
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%E Junwei Cao
%K plasma
%B in Cyberinfrastructure Technologies and Applications
%I Nova Science Publishers, Inc.
%P 9-24
%8 2009-00
%G eng

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2009
%T Parallel Programming in MATLAB
%A Piotr Luszczek
%K lfc
%K plasma
%B The International Journal of High Performance Computing Applications
%V 23
%P 277-283
%8 2009-07
%G eng

%0 Journal Article
%J Cluster Computing Journal: Special Issue on High Performance Distributed Computing
%D 2009
%T Paravirtualization Effect on Single- and Multi-threaded Memory-Intensive Linear Algebra Software
%A Lamia Youseff
%A Keith Seymour
%A Haihang You
%A Dmitrii Zagorodnov
%A Jack Dongarra
%A Rich Wolski
%B Cluster Computing Journal: Special Issue on High Performance Distributed Computing
%I Springer Netherlands
%V 12
%P 101-122
%8 2009-00
%G eng

%0 Conference Proceedings
%B Proceedings of CUG09
%D 2009
%T Performance evaluation for petascale quantum simulation tools
%A Stanimire Tomov
%A Wenchang Lu
%A Jerzy Bernholc
%A Shirley Moore
%A Jack Dongarra
%K doe-nano
%B Proceedings of CUG09
%C Atlanta, GA
%8 2009-05
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2009
%T The Problem with the Linpack Benchmark Matrix Generator
%A Julien Langou
%A Jack Dongarra
%K hpl
%B International Journal of High Performance Computing Applications
%V 23
%P 5-14
%8 2009-00
%G eng

%0 Journal Article
%J Scientific Programming (to appear)
%D 2009
%T QR Factorization for the CELL Processor
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B Scientific Programming (to appear)
%8 2009-00
%G eng

%0 Conference Paper
%B CLUSTER '09
%D 2009
%T Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery
%A Aurelien Bouteiller
%A Thomas Ropars
%A George Bosilca
%A Christine Morin
%A Jack Dongarra
%K fault tolerant computing
%K libraries message passing
%K parallel machines
%K protocols
%X With the growing scale of high performance computing platforms, fault tolerance has become a major issue. Among the various approaches for providing fault tolerance to MPI applications, message logging has been proved to tolerate higher failure rate. However, this advantage comes at the expense of a higher overhead on communications, due to latency intrusive logging of events to a stable storage. Previous work proposed and evaluated several protocols relaxing the synchronicity of event logging to moderate this overhead. Recently, the model of message logging has been refined to better match the reality of high performance network cards, where message receptions are decomposed in multiple interdependent events. According to this new model, deterministic and non-deterministic events are clearly discriminated, reducing the overhead induced by message logging. In this paper we compare, experimentally, a pessimistic and an optimistic message logging protocol, using this new model and implemented in the Open MPI library. Although pessimistic and optimistic message logging are, respectively, the most and less synchronous message logging paradigms, experiments show that most of the time their performance is comparable.
%B CLUSTER '09
%I IEEE
%C New Orleans
%8 2009-08
%G eng
%R 10.1109/CLUSTR.2009.5289157

%0 Journal Article
%J in Birth of Numerical Analysis (to appear)
%D 2009
%T Recent Trends in High Performance Computing
%A Jack Dongarra
%A Hans Meuer
%A Horst D. Simon
%A Erich Strohmaier
%B in Birth of Numerical Analysis (to appear)
%8 2009-00
%G eng

%0 Journal Article
%J Future Generation Computing Systems
%D 2009
%T Recording the Control Flow of Parallel Applications to Determine Iterative and Phase-Based Behavior
%A Karl Fürlinger
%A Shirley Moore
%B Future Generation Computing Systems
%V 26
%P 162-166
%8 2009-00
%G eng

%0 Journal Article
%J ACM TOMS (to appear)
%D 2009
%T Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion
%A Fred G. Gustavson
%A Jerzy Wasniewski
%A Jack Dongarra
%A Julien Langou
%B ACM TOMS (to appear)
%8 2009-00
%G eng

%0 Journal Article
%J in Handbook of Research on Scalable Computing Technologies (to appear)
%D 2009
%T Reliability and Performance Modeling and Analysis for Grid Computing
%A Yuan-Shun Dai
%A Jack Dongarra
%E Kuan-Ching Li
%E Ching-Hsien Hsu
%E Laurence Yang
%E Jack Dongarra
%E Hans Zima
%B in Handbook of Research on Scalable Computing Technologies (to appear)
%I IGI Global
%P 219-245
%8 2009-00
%G eng

%0 Conference Proceedings
%B The International Conference on Computational Science 2009 (ICCS 2009)
%D 2009
%T A Scalable Non-blocking Multicast Scheme for Distributed DAG Scheduling
%A Fengguang Song
%A Shirley Moore
%A Jack Dongarra
%K plasma
%B The International Conference on Computational Science 2009 (ICCS 2009)
%C Baton Rouge, LA
%V 5544
%P 195-204
%8 2009-05
%G eng

%0 Generic
%D 2009
%T Scheduling Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%B University of Tennessee Computer Science Department Technical Report, UT-CS-09-636 (Also LAPACK Working Note 213)
%8 2009-00
%G eng

%0 Journal Article
%J Concurrency Practice and Experience (to appear)
%D 2009
%T Scheduling Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%K plasma
%B Concurrency Practice and Experience (to appear)
%8 2009-00
%G eng

%0 Generic
%D 2009
%T Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures
%A Bilel Hadri
%A Hatem Ltaeif
%A Emmanuel Agullo
%A Jack Dongarra
%K plasma
%B Innovative Computing Laboratory Technical Report (also LAPACK Working Note 222 and CS Tech Report UT-CS-09-645)
%8 2009-09
%G eng

%0 Conference Proceedings
%B accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010)
%D 2009
%T Tile QR Factorization with Parallel Panel Processing for Multicore Architectures
%A Bilel Hadri
%A Hatem Ltaeif
%A Emmanuel Agullo
%A Jack Dongarra
%K plasma
%B accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010)
%C Atlanta, GA
%8 2009-12
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface - 16th European PVM/MPI Users' Group Meeting
%D 2009
%T Towards Efficient MapReduce Using MPI
%A Torsten Hoefler
%A Yuan-Shun Dai
%A Jack Dongarra
%E M. Ropo
%E J Westerholm
%E Jack Dongarra
%B Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface - 16th European PVM/MPI Users' Group Meeting
%I Springer Berlin / Heidelberg
%C Espoo, Finland
%V 5759
%P 240-249
%8 2009-00
%G eng

%0 Generic
%D 2009
%T Trace-based Performance Analysis for the Petascale Simulation Code FLASH
%A Heike Jagode
%A Andreas Knuepfer
%A Jack Dongarra
%A Matthias Jurenz
%A Matthias S. Mueller
%A Wolfgang E. Nagel
%K test
%B Innovative Computing Laboratory Technical Report
%8 2009-04
%G eng

%0 Journal Article
%J in Cloud Computing and Software Services: Theory and Techniques (to appear)
%D 2009
%T Transparent Cross-Platform Access to Software Services using GridSolve and GridRPC
%A Keith Seymour
%A Asim YarKhan
%A Jack Dongarra
%E Syed Ahson
%E Mohammad Ilyas
%K netsolve
%B in Cloud Computing and Software Services: Theory and Techniques (to appear)
%I CRC Press
%8 2009-00
%G eng

%0 Conference Proceedings
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%D 2009
%T VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance
%A Lavanya Ramakrishan
%A Daniel Nurmi
%A Anirban Mandal
%A Charles Koelbel
%A Dennis Gannon
%A Mark Huang
%A Yang-Suk Kee
%A Graziano Obertelli
%A Kiran Thyagaraja
%A Rich Wolski
%A Asim YarKhan
%A Dmitrii Zagorodnov
%K grads
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%C Portland, OR
%8 2009-00
%G eng

%0 Journal Article
%J 15th European PVM/MPI Users' Group Meeting, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science
%D 2008
%E Alexey Lastovetsky
%E Tahar Kechadi
%E Jack Dongarra
%B 15th European PVM/MPI Users' Group Meeting, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science
%I Springer Berlin
%C Dublin Ireland
%V 5205
%8 2008-01
%G eng

%0 Conference Proceedings
%B 7th International parallel Processing and Applied Mathematics Conference, Lecture Notes in Comptuer Science
%D 2008
%E Roman Wyrzykowski
%E Jack Dongarra
%E Konrad Karczewski
%E Jerzy Wasniewski
%B 7th International parallel Processing and Applied Mathematics Conference, Lecture Notes in Comptuer Science
%I Springer Berlin
%C Gdansk, Poland
%V 4967
%8 2008-01
%G eng

%0 Conference Proceedings
%B 8th International Conference on Computational Science (ICCS), Proceedings Parts I, II, and III, Lecture Notes in Computer Science
%D 2008
%E Marian Bubak
%E Geert Dick van Albada
%E Jack Dongarra
%E Peter M. Sloot
%B 8th International Conference on Computational Science (ICCS), Proceedings Parts I, II, and III, Lecture Notes in Computer Science
%I Springer Berlin
%C Krakow, Poland
%V 5101
%8 2008-01
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2008
%T Algorithm-Based Fault Tolerance for Fail-Stop Failures
%A Zizhong Chen
%A Jack Dongarra
%K FT-MPI
%K lapack
%K scalapack
%B IEEE Transactions on Parallel and Distributed Systems
%V 19
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Algorithmic Based Fault Tolerance Applied to High Performance Computing
%A George Bosilca
%A Remi Delmas
%A Jack Dongarra
%A Julien Langou
%B University of Tennessee Computer Science Technical Report, UT-CS-08-620 (also LAPACK Working Note 205)
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Analytical Modeling for Affinity-Based Thread Scheduling on Multicore Platforms
%A Fengguang Song
%A Shirley Moore
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report, UT-CS-08-626
%8 2008-01
%G eng

%0 Conference Proceedings
%B The 3rd international Workshop on Automatic Performance Tuning
%D 2008
%T A Comparison of Search Heuristics for Empirical Code Optimization
%A Keith Seymour
%A Haihang You
%A Jack Dongarra
%K gco
%B The 3rd international Workshop on Automatic Performance Tuning
%C Tsukuba, Japan
%8 2008-10
%G eng

%0 Journal Article
%J VECPAR '08, High Performance Computing for Computational Science
%D 2008
%T Computing the Conditioning of the Components of a Linear Least Squares Solution
%A Marc Baboulin
%A Jack Dongarra
%A Serge Gratton
%A Julien Langou
%B VECPAR '08, High Performance Computing for Computational Science
%C Toulouse, France
%8 2008-01
%G eng

%0 Conference Proceedings
%B Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA-08)
%D 2008
%T Custom assignment of MPI ranks for parallel multi-dimensional FFTs: Evaluation of BG/P versus BG/L
%A Heike Jagode
%A Joachim Hein
%B Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA-08)
%I IEEE Computer Society
%C Sydney, Australia
%P 271-283
%8 2008-01
%G eng

%0 Journal Article
%J in Advances in Computers
%D 2008
%T DARPA's HPCS Program: History, Models, Tools, Languages
%A Jack Dongarra
%A Robert Graybill
%A William Harrod
%A Robert Lucas
%A Ewing Lusk
%A Piotr Luszczek
%A Janice McMahon
%A Allan Snavely
%A Jeffrey Vetter
%A Katherine Yelick
%A Sadaf Alam
%A Roy Campbell
%A Laura Carrington
%A Tzu-Yi Chen
%A Omid Khalili
%A Jeremy Meredith
%A Mustafa Tikir
%E M. Zelkowitz
%B in Advances in Computers
%I Elsevier
%V 72
%8 2008-01
%G eng

%0 Conference Proceedings
%B Proceedings of the 2008 International Conference on Computational Science (ICCS 2008)
%D 2008
%T Detection and Analysis of Iterative Behavior in Parallel Applications
%A Karl Fürlinger
%A Shirley Moore
%K point
%B Proceedings of the 2008 International Conference on Computational Science (ICCS 2008)
%C Krakow, Poland
%V 5103
%P 261-267
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Enhancing the Performance of Dense Linear Algebra Solvers on GPUs (in the MAGMA Project)
%A Marc Baboulin
%A James Demmel
%A Jack Dongarra
%A Stanimire Tomov
%A Vasily Volkov
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC08)
%C Austin, TX
%8 2008-11
%G eng

%0 Journal Article
%J in High Performance Computing and Grids in Action
%D 2008
%T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%A Julien Langou
%A Piotr Luszczek
%A Stanimire Tomov
%E Lucio Grandinetti
%B in High Performance Computing and Grids in Action
%I IOS Press
%C Amsterdam
%8 2008-01
%G eng

%0 Conference Proceedings
%B Proceedings of the DoD HPCMP User Group Conference
%D 2008
%T Exploring New Architectures in Accelerating CFD for Air Force Applications
%A Jack Dongarra
%A Shirley Moore
%A Gregory D. Peterson
%A Stanimire Tomov
%A Jeff Allred
%A Vincent Natoli
%A David Richie
%K magma
%B Proceedings of the DoD HPCMP User Group Conference
%C Seattle, Washington
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the CELL Processor
%A Wesley Alvaro
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report
%8 2008-01
%G eng

%0 Conference Proceedings
%B 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008)
%D 2008
%T Fault Tolerance Management for a Hierarchical GridRPC Middleware
%A Aurelien Bouteiller
%A Frederic Desprez
%B 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008)
%C Lyon, France
%8 2008-01
%G eng

%0 Journal Article
%J Recent developments in Grid Technology and Applications
%D 2008
%T High Performance GridRPC Middleware
%A Yves Caniou
%A Eddy Caron
%A Frederic Desprez
%A Hidemoto Nakada
%A Yoshio Tanaka
%A Keith Seymour
%E George A. Gravvanis
%E John P. Morrison
%E Hamid R. Arabnia
%E D. A. Power
%K netsolve
%B Recent developments in Grid Technology and Applications
%I Nova Science Publishers
%8 2008-00
%G eng

%0 Journal Article
%J in Beautiful Code Leading Programmers Explain How They Think (Chapter 14)
%D 2008
%T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination
%A Jack Dongarra
%A Piotr Luszczek
%E Andy Oram
%E G. Wilson
%B in Beautiful Code Leading Programmers Explain How They Think (Chapter 14)
%P 243-282
%8 2008-01
%G eng

%0 Generic
%D 2008
%T HPCS Library Study Effort
%A Jack Dongarra
%A James Demmel
%A Parry Husbands
%A Piotr Luszczek
%B University of Tennessee Computer Science Technical Report, UT-CS-08-617
%8 2008-01
%G eng

%0 Conference Proceedings
%B ACM/IEEE International Symposium on High Performance Distributed Computing
%D 2008
%T The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software
%A Lamia Youseff
%A Keith Seymour
%A Haihang You
%A Jack Dongarra
%A Rich Wolski
%K gco
%K netsolve
%B ACM/IEEE International Symposium on High Performance Distributed Computing
%C Boston, MA.
%8 2008-06
%G eng

%0 Journal Article
%J Computing and Informatics
%D 2008
%T Interactive Grid-Access Using Gridsolve and Giggle
%A Marcus Hardt
%A Keith Seymour
%A Jack Dongarra
%A Michael Zapf
%A Nicole Ruiter
%K netsolve
%B Computing and Informatics
%V 27
%P 233-248,ISSN1335-9150
%8 2008-00
%G eng

%0 Conference Proceedings
%B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing
%D 2008
%T Interior State Computation of Nano Structures
%A Andrew Canning
%A Jack Dongarra
%A Julien Langou
%A Osni Marques
%A Stanimire Tomov
%A Christof Voemel
%A Lin-Wang Wang
%B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing
%C Trondheim, Norway
%8 2008-05
%G eng

%0 Journal Article
%J Concurrency: Practice and Experience
%D 2008
%T The LINPACK Benchmark: Past, Present, and Future
%A Jack Dongarra
%A Piotr Luszczek
%A Antoine Petitet
%K hpl
%B Concurrency: Practice and Experience
%V 15
%P 803-820
%8 2008-00
%G eng

%0 Conference Proceedings
%B 2008 PPoPP Conference
%D 2008
%T Matrix Product on Heterogeneous Master Worker Platforms
%A Jack Dongarra
%A Jean-Francois Pineau
%A Yves Robert
%A Frederic Vivien
%B 2008 PPoPP Conference
%C Salt Lake City, Utah
%8 2008-01
%G eng

%0 Journal Article
%J IEEE Annals of the History of Computing
%D 2008
%T Netlib and NA-Net: Building a Scientific Computing Community
%A Jack Dongarra
%A Gene H. Golub
%A Eric Grosse
%A Cleve Moler
%A Keith Moore
%B IEEE Annals of the History of Computing
%V 30
%P 30-41
%8 2008-01
%G eng

%0 Conference Proceedings
%B Proc. 2008 IEEE International Conference on Cluster Computing (CLUSTER 2008)
%D 2008
%T OpenMP-centric Performance Analysis of Hybrid Applications
%A Karl Fürlinger
%A Shirley Moore
%B Proc. 2008 IEEE International Conference on Cluster Computing (CLUSTER 2008)
%C Tsukuba, Japan
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-08-624 (also LAPACK Working Note 208)
%8 2008-08
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2008
%T Parallel Tiled QR Factorization for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%B Concurrency and Computation: Practice and Experience
%V 20
%P 1573-1590
%8 2008-01
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science, OpenMP Shared Memory Parallel Programming
%D 2008
%T Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications
%A Oscar Hernandez
%A Fengguang Song
%A Barbara Chapman
%A Jack Dongarra
%A Bernd Mohr
%A Shirley Moore
%A Felix Wolf
%B Lecture Notes in Computer Science, OpenMP Shared Memory Parallel Programming
%I Springer Berlin / Heidelberg
%V 4315
%8 2008-00
%G eng

%0 Generic
%D 2008
%T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report)
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report, CS-89-85
%8 2008-01
%G eng

%0 Journal Article
%J Proc. SciDAC 2008
%D 2008
%T PERI Auto-tuning
%A David Bailey
%A Jacqueline Chame
%A Chun Chen
%A Jack Dongarra
%A Mary Hall
%A Jeffrey K. Hollingsworth
%A Paul D. Hovland
%A Shirley Moore
%A Keith Seymour
%A Jaewook Shin
%A Ananta Tiwari
%A Sam Williams
%A Haihang You
%K gco
%B Proc. SciDAC 2008
%I Journal of Physics
%C Seatlle, Washington
%V 125
%8 2008-01
%G eng

%0 Journal Article
%J Computing in Science and Engineering
%D 2008
%T The PlayStation 3 for High Performance Scientific Computing
%A Jakub Kurzak
%A Alfredo Buttari
%A Piotr Luszczek
%A Jack Dongarra
%B Computing in Science and Engineering
%P 80-83
%8 2008-01
%G eng

%0 Generic
%D 2008
%T The PlayStation 3 for High Performance Scientific Computing
%A Jakub Kurzak
%A Alfredo Buttari
%A Piotr Luszczek
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report
%8 2008-01
%G eng

%0 Generic
%D 2008
%T The Problem with the Linpack Benchmark Matrix Generator
%A Jack Dongarra
%A Julien Langou
%B University of Tennessee Computer Science Technical Report, UT-CS-08-621 (also LAPACK Working Note 206)
%8 2008-06
%G eng

%0 Generic
%D 2008
%T QR Factorization for the CELL Processor
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-08-616 (also LAPACK Working Note 201)
%8 2008-05
%G eng

%0 Generic
%D 2008
%T Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion
%A Fred G. Gustavson
%A Jerzy Wasniewski
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report, UT-CS-08-614 (also LAPACK Working Note 199)
%8 2008-04
%G eng

%0 Conference Proceedings
%B International Supercomputer Conference (ISC 2008)
%D 2008
%T Redesigning the Message Logging Model for High Performance
%A Aurelien Bouteiller
%A George Bosilca
%A Jack Dongarra
%B International Supercomputer Conference (ISC 2008)
%C Dresden, Germany
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Request Sequencing: Enabling Workflow for Efficient Parallel Problem Solving in GridSolve
%A Yinan Li
%A Jack Dongarra
%K netsolve
%B ICL Technical Report
%8 2008-04
%G eng

%0 Conference Proceedings
%B International Conference on Grid and Cooperative Computing (GCC 2008) (submitted)
%D 2008
%T Request Sequencing: Enabling Workflow for Efficient Problem Solving in GridSolve
%A Yinan Li
%A Jack Dongarra
%A Keith Seymour
%A Asim YarKhan
%B International Conference on Grid and Cooperative Computing (GCC 2008) (submitted)
%C Shenzhen, China
%8 2008-10
%G eng

%0 Journal Article
%J International Journal of Foundations of Computer Science (IJFCS)
%D 2008
%T Revisiting Matrix Product on Master-Worker Platforms
%A Jack Dongarra
%A Jean-Francois Pineau
%A Yves Robert
%A Zhiao Shi
%A Frederic Vivien
%B International Journal of Foundations of Computer Science (IJFCS)
%V 19
%P 1317-1336
%8 2008-12
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2008
%T Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
%A Jakub Kurzak
%A Alfredo Buttari
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%V 19
%P 1-11
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures
%A Marc Baboulin
%A Jack Dongarra
%A Stanimire Tomov
%K magma
%B University of Tennessee Computer Science Technical Report, UT-CS-08-615 (also LAPACK Working Note 200)
%8 2008-01
%G eng

%0 Conference Proceedings
%B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing
%D 2008
%T Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures
%A Marc Baboulin
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing
%C Trondheim Norway
%8 2008-05
%G eng

%0 Journal Article
%J Journal of Computational Physics
%D 2008
%T State-of-the-Art Eigensolvers for Electronic Structure Calculations of Large Scale Nano-Systems
%A Christof Voemel
%A Stanimire Tomov
%A Osni Marques
%A Andrew Canning
%A Lin-Wang Wang
%A Jack Dongarra
%B Journal of Computational Physics
%V 227
%P 7113-7124
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Task placement of parallel multi-dimensional FFTs on a mesh communication network
%A Heike Jagode
%A Joachim Hein
%A Arthur Trew
%B University of Tennessee Computer Science Technical Report
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems
%A Stanimire Tomov
%A Jack Dongarra
%A Marc Baboulin
%K magma
%B University of Tennessee Computer Science Technical Report, UT-CS-08-632 (also LAPACK Working Note 210)
%8 2008-01
%G eng

%0 Journal Article
%J Computing in Science and Engineering
%D 2008
%T A Tribute to Gene Golub
%A Jack Dongarra
%B Computing in Science and Engineering
%I IEEE
%P 5
%8 2008-01
%G eng

%0 Conference Proceedings
%B Proceedings of the 2nd International Workshop on Tools for High Performance Computing
%D 2008
%T Usage of the Scalasca Toolset for Scalable Performance Analysis of Large-scale Parallel Applications
%A Felix Wolf
%A Brian Wylie
%A Erika Abraham
%A Wolfgang Frings
%A Karl Fürlinger
%A Markus Geimer
%A Marc-Andre Hermanns
%A Bernd Mohr
%A Shirley Moore
%A Matthias Pfeifer
%E Michael Resch
%E Rainer Keller
%E Valentin Himmler
%E Bettina Krammer
%E A Schulz
%K point
%B Proceedings of the 2nd International Workshop on Tools for High Performance Computing
%I Springer
%C Stuttgart, Germany
%P 157-167
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Using dual techniques to derive componentwise and mixed condition numbers for a linear functional of a linear least squares solution
%A Marc Baboulin
%A Serge Gratton
%B University of Tennessee Computer Science Technical Report, UT-CS-08-622 (also LAPACK Working Note 207)
%8 2008-01
%G eng

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2008
%T Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%K plasma
%B ACM Transactions on Mathematical Software
%V 34
%P 17-22
%8 2008-00
%G eng

%0 Conference Proceedings
%B Proc. 4th International Workshop on OpenMP (IWOMP 2008)
%D 2008
%T Visualizing the Program Execution Control Flow of OpenMP Applications
%A Karl Fürlinger
%A Shirley Moore
%B Proc. 4th International Workshop on OpenMP (IWOMP 2008)
%I Lecture Notes in Computer Science 5004
%C West Lafayette, Indiana
%P 181-190
%8 2008-01
%G eng

%0 Generic
%D 2007
%T Automated Empirical Tuning of a Multiresolution Analysis Kernel
%A Haihang You
%A Keith Seymour
%A Jack Dongarra
%A Shirley Moore
%K gco
%B ICL Technical Report
%P 10
%8 2007-01
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2007
%T Automatic Analysis of Inefficiency Patterns in Parallel Applications
%A Felix Wolf
%A Bernd Mohr
%A Jack Dongarra
%A Shirley Moore
%B Concurrency and Computation: Practice and Experience
%V 19
%P 1481-1496
%8 2007-08
%G eng

%0 Conference Proceedings
%B Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07)
%D 2007
%T Binomial Graph: A Scalable and Fault- Tolerant Logical Network Topology
%A Thara Angskun
%A George Bosilca
%A Jack Dongarra
%K ftmpi
%B Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07)
%I Springer
%C Niagara Falls, Canada
%8 2007-08
%G eng

%0 Conference Proceedings
%B 19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (submitted)
%D 2007
%T Bi-objective Scheduling Algorithms for Optimizing Makespan and Reliability on Heterogeneous Systems
%A Jack Dongarra
%A Emmanuel Jeannot
%A Erik Saule
%A Zhiao Shi
%B 19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (submitted)
%C San Diego, CA
%8 2007-06
%G eng

%0 Generic
%D 2007
%T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report
%8 2007-01
%G eng

%0 Journal Article
%J Cray User Group, CUG 2007
%D 2007
%T A Comparison of Application Performance Using Open MPI and Cray MPI
%A Richard L. Graham
%A George Bosilca
%A Jelena Pjesivac–Grbovic
%B Cray User Group, CUG 2007
%8 2007-05
%G eng

%0 Generic
%D 2007
%T Computing the Conditioning of the Components of a Linear Least Squares Solution
%A Marc Baboulin
%A Jack Dongarra
%A Serge Gratton
%A Julien Langou
%B University of Tennessee Computer Science Technical Report
%8 2007-01
%G eng

%0 Conference Proceedings
%B Proceedings of the 2007 Conference on Parallel Computing (PARCO 2007)
%D 2007
%T Continuous Runtime Profiling of OpenMP Applications
%A Karl Fürlinger
%A Shirley Moore
%K kojak
%B Proceedings of the 2007 Conference on Parallel Computing (PARCO 2007)
%C Juelich and Aachen, Germany
%8 2007-01
%G eng

%0 Journal Article
%J DOE SciDAC Review (to appear)
%D 2007
%T Creating Software Technology to Harness the Power of Leadership-class Computing Systems
%A John Mellor-Crummey
%A Pete Beckman
%A Jack Dongarra
%A Barton Miller
%A Katherine Yelick
%B DOE SciDAC Review (to appear)
%8 2007-06
%G eng

%0 Journal Article
%J Euro-Par 2007
%D 2007
%T Decision Trees and MPI Collective Algorithm Selection Problem
%A Jelena Pjesivac–Grbovic
%A George Bosilca
%A Graham Fagg
%A Thara Angskun
%A Jack Dongarra
%K ftmpi
%B Euro-Par 2007
%I Springer
%C Rennes, France
%P 105–115
%8 2007-08
%G eng

%0 Journal Article
%J in Petascale Computing: Algorithms and Applications (to appear)
%D 2007
%T Disaster Survival Guide in Petascale Computing: An Algorithmic Approach
%A Jack Dongarra
%A Zizhong Chen
%A George Bosilca
%A Julien Langou
%B in Petascale Computing: Algorithms and Applications (to appear)
%I Chapman & Hall - CRC Press
%8 2007-00
%G eng

%0 Generic
%D 2007
%T Empirical Tuning of a Multiresolution Analysis Kernel using a Specialized Code Generator
%A Haihang You
%A Keith Seymour
%A Jack Dongarra
%A Shirley Moore
%K gco
%B ICL Technical Report
%8 2007-01
%G eng

%0 Journal Article
%J EuroPVM/MPI 2007
%D 2007
%T An Evaluation of Open MPI's Matching Transport Layer on the Cray XT
%A Richard L. Graham
%A Ron Brightwell
%A Brian Barrett
%A George Bosilca
%A Jelena Pjesivac–Grbovic
%B EuroPVM/MPI 2007
%8 2007-09
%G eng

%0 Journal Article
%J In High Performance Computing and Grids in Action (to appear)
%D 2007
%T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%A Julie Langou
%A Piotr Luszczek
%A Stanimire Tomov
%E Lucio Grandinetti
%B In High Performance Computing and Grids in Action (to appear)
%I IOS Press
%C Amsterdam
%8 2007-00
%G eng

%0 Conference Proceedings
%B IEEE International Symposium on High Performance Distributed Computing
%D 2007
%T Feedback-Directed Thread Scheduling with Memory Considerations
%A Fengguang Song
%A Shirley Moore
%A Jack Dongarra
%B IEEE International Symposium on High Performance Distributed Computing
%C Monterey Bay, CA
%8 2007-06
%G eng

%0 Conference Proceedings
%B Grid-Based Problem Solving Environments: IFIP TC2/WG 2.5 Working Conference on Grid-Based Problem Solving Environments (Prescott, AZ, July 2006)
%D 2007
%T GridSolve: The Evolution of Network Enabled Solver
%A Asim YarKhan
%A Jack Dongarra
%A Keith Seymour
%E Patrick Gaffney
%K netsolve
%B Grid-Based Problem Solving Environments: IFIP TC2/WG 2.5 Working Conference on Grid-Based Problem Solving Environments (Prescott, AZ, July 2006)
%I Springer
%P 215-226
%8 2007-00
%G eng

%0 Journal Article
%J International Journal for High Performance Computer Applications
%D 2007
%T High Performance Development for High End Computing with Python Language Wrapper (PLW)
%A Jack Dongarra
%A Piotr Luszczek
%B International Journal for High Performance Computer Applications
%V 21
%P 360-369
%8 2007-00
%G eng

%0 Journal Article
%J in Beautiful Code Leading Programmers Explain How They Think
%D 2007
%T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination
%A Jack Dongarra
%A Piotr Luszczek
%E Andy Oram
%E Greg Wilson
%B in Beautiful Code Leading Programmers Explain How They Think
%I O'Reilly Media, Inc.
%8 2007-06
%G eng

%0 Journal Article
%J CTWatch Quarterly
%D 2007
%T The Impact of Multicore on Computational Science Software
%A Jack Dongarra
%A Dennis Gannon
%A Geoffrey Fox
%A Ken Kennedy
%B CTWatch Quarterly
%V 3
%8 2007-02
%G eng
%N 1

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2007
%T Implementation of Mixed Precision in Solving Systems of Linear Equations on the Cell Processor
%A Jakub Kurzak
%A Jack Dongarra
%B Concurrency and Computation: Practice and Experience
%V 19
%P 1371-1385
%8 2007-07
%G eng

%0 Journal Article
%J Parallel Processing Letters
%D 2007
%T Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Servers Middleware
%A Emmanuel Jeannot
%A Keith Seymour
%A Asim YarKhan
%A Jack Dongarra
%B Parallel Processing Letters
%V 17
%P 47-59
%8 2007-03
%G eng

%0 Conference Proceedings
%B Proceedings of the 2007 International Conference on Parallel Processing
%D 2007
%T L2 Cache Modeling for Scientific Applications on Chip Multi-Processors
%A Fengguang Song
%A Shirley Moore
%A Jack Dongarra
%B Proceedings of the 2007 International Conference on Parallel Processing
%I IEEE Computer Society
%C Xi'an, China
%8 2007-01
%G eng

%0 Generic
%D 2007
%T Limitations of the Playstation 3 for High Performance Cluster Computing
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%B University of Tennessee Computer Science Technical Report, UT-CS-07-597 (Also LAPACK Working Note 185)
%8 2007-00
%G eng

%0 Conference Proceedings
%B Proc. DoD HPCMP Users Group Conference (HPCMP-UGC'07)
%D 2007
%T Memory Leak Detection in Fortran Applications using TAU
%A Sameer Shende
%A Allen D. Malony
%A Shirley Moore
%A David Cronk
%B Proc. DoD HPCMP Users Group Conference (HPCMP-UGC'07)
%I IEEE Computer Society
%C Pittsburgh, PA
%8 2007-01
%G eng

%0 Journal Article
%J International Journal of High Performance Computer Applications (to appear)
%D 2007
%T Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems
%A Alfredo Buttari
%A Jack Dongarra
%A Julien Langou
%A Julie Langou
%A Piotr Luszczek
%A Jakub Kurzak
%B International Journal of High Performance Computer Applications (to appear)
%8 2007-08
%G eng

%0 Journal Article
%J Parallel Computing (Special Edition: EuroPVM/MPI 2006)
%D 2007
%T MPI Collective Algorithm Selection and Quadtree Encoding
%A Jelena Pjesivac–Grbovic
%A George Bosilca
%A Graham Fagg
%A Thara Angskun
%A Jack Dongarra
%K ftmpi
%B Parallel Computing (Special Edition: EuroPVM/MPI 2006)
%I Elsevier
%8 2007-00
%G eng

%0 Conference Proceedings
%B Journal of Physics: Conference Series, SciDAC 2007
%D 2007
%T Multithreading for synchronization tolerance in matrix factorization
%A Alfredo Buttari
%A Jack Dongarra
%A Parry Husbands
%A Jakub Kurzak
%A Katherine Yelick
%B Journal of Physics: Conference Series, SciDAC 2007
%V 78
%8 2007-01
%G eng

%0 Journal Article
%J In IEEE Annals of the History of Computing (to appear)
%D 2007
%T Netlib and NA-Net: building a scientific computing community
%A Jack Dongarra
%A Gene H. Golub
%A Cleve Moler
%A Keith Moore
%B In IEEE Annals of the History of Computing (to appear)
%8 2007-08
%G eng

%0 Book Section
%B Distributed and Parallel Systems
%D 2007
%T A New Approach to MPI Collective Communication Implementations
%A Torsten Hoefler
%A Jeffrey M. Squyres
%A Graham Fagg
%A George Bosilca
%A Wolfgang Rehm
%A Andrew Lumsdaine
%K Automatic Selection
%K Collective Operation
%K Framework
%K Message Passing (MPI)
%K Open MPI
%X Recent research into the optimization of collective MPI operations has resulted in a wide variety of algorithms and corresponding implementations, each typically only applicable in a relatively narrow scope: on a specific architecture, on a specific network, with a specific number of processes, with a specific data size and/or data-type – or any combination of these (or other) factors. This situation presents an enormous challenge to portable MPI implementations which are expected to provide optimized collective operation performance on all platforms. Many portable implementations have attempted to provide a token number of algorithms that are intended to realize good performance on most systems. However, many platform configurations are still left without well-tuned collective operations. This paper presents a proposal for a framework that will allow a wide variety of collective algorithm implementations and a flexible, multi-tiered selection process for choosing which implementation to use when an application invokes an MPI collective function.
%B Distributed and Parallel Systems
%I Springer US
%P 45-54
%@ 978-0-387-69857-1
%G eng
%R 10.1007/978-0-387-69858-8_5

%0 Generic
%D 2007
%T Numerical Metadata API Reference
%A Victor Eijkhout
%K salsa
%B Innovative Computing Laboratory Technical Report
%8 2007-02
%G eng

%0 Conference Proceedings
%B The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT)
%D 2007
%T Optimal Routing in Binomial Graph Networks
%A Thara Angskun
%A George Bosilca
%A Brad Vander Zanden
%A Jack Dongarra
%K ftmpi
%B The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT)
%I IEEE Computer Society
%C Adelaide, Australia
%8 2007-12
%G eng

%0 Generic
%D 2007
%T Parallel Tiled QR Factorization for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-598 (also LAPACK Working Note 190)
%8 2007-00
%G eng

%0 Journal Article
%J Cluster computing
%D 2007
%T Performance Analysis of MPI Collective Operations
%A Jelena Pjesivac–Grbovic
%A Thara Angskun
%A George Bosilca
%A Graham Fagg
%A Edgar Gabriel
%A Jack Dongarra
%K ftmpi
%B Cluster computing
%I Springer Netherlands
%V 10
%P 127-143
%8 2007-06
%G eng

%0 Generic
%D 2007
%T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report)
%A Jack Dongarra
%B University of Tennessee Computer Science Dept. Technical Report CS-89-85
%8 2007-00
%G eng

%0 Journal Article
%J SIAM SISC (to appear)
%D 2007
%T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
%A Julien Langou
%A Zizhong Chen
%A George Bosilca
%A Jack Dongarra
%B SIAM SISC (to appear)
%8 2007-05
%G eng

%0 Conference Proceedings
%B Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)
%D 2007
%T Reliability Analysis of Self-Healing Network using Discrete-Event Simulation
%A Thara Angskun
%A George Bosilca
%A Graham Fagg
%A Jelena Pjesivac–Grbovic
%A Jack Dongarra
%K ftmpi
%B Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)
%I IEEE Computer Society
%P 437-444
%8 2007-05
%G eng

%0 Journal Article
%J SciDAC Review
%D 2007
%T Remembering Ken Kennedy
%A Jack Dongarra
%A et al.,
%B SciDAC Review
%V 5
%8 2007-00
%G eng

%0 Conference Proceedings
%B Journal of Physics: Conference Series, SciDAC 2007
%D 2007
%T Results of the PERI survey of SciDAC applications
%A Bronis R. de Supinski
%A Jeffrey K. Hollingsworth
%A Shirley Moore
%A Patrick H. Worley
%B Journal of Physics: Conference Series, SciDAC 2007
%V 78
%8 2007-01
%G eng

%0 Journal Article
%J Accepted for Euro PVM/MPI 2007
%D 2007
%T Retrospect: Deterministic Relay of MPI Applications for Interactive Distributed Debugging
%A Aurelien Bouteiller
%A George Bosilca
%A Jack Dongarra
%K ftmpi
%B Accepted for Euro PVM/MPI 2007
%I Springer
%8 2007-09
%G eng

%0 Journal Article
%J International Journal of Foundations of Computer Science (IJFCS) (accepted)
%D 2007
%T Revisiting Matrix Product on Master-Worker Platforms
%A Jack Dongarra
%A Jean-Francois Pineau
%A Yves Robert
%A Zhiao Shi
%A Frederic Vivien
%B International Journal of Foundations of Computer Science (IJFCS) (accepted)
%8 2007-00
%G eng

%0 Conference Proceedings
%B Proceedings of the 2007 International Conference on Computational Science (ICCS 2007)
%D 2007
%T Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors
%A Karl Fürlinger
%A Michael Gerndt
%A Jack Dongarra
%E Yong Shi
%E Jack Dongarra
%E Geert Dick van Albada
%E Peter M. Sloot
%K kojak
%B Proceedings of the 2007 International Conference on Computational Science (ICCS 2007)
%I Springer LNCS
%C Beijing, China
%V 4487-4490
%P 815-822
%G eng
%R 10.1007/978-3-540-72586-2_115

%0 Generic
%D 2007
%T SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3
%A Alfredo Buttari
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%A George Bosilca
%K multi-core
%B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-595
%8 2007-00
%G eng

%0 Conference Proceedings
%B Proceedings of Workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing at IPDPS
%D 2007
%T Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing
%A Zizhong Chen
%A Ming Yang
%A Guillermo Francia III
%A Jack Dongarra
%B Proceedings of Workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing at IPDPS
%P 1-8
%8 2007-03
%G eng

%0 Conference Proceedings
%B 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007)
%D 2007
%T Self-Healing in Binomial Graph Networks
%A Thara Angskun
%A George Bosilca
%A Jack Dongarra
%B 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007)
%C Vilamoura, Algarve, Portugal
%8 2007-11
%G eng

%0 Generic
%D 2007
%T Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
%A Jakub Kurzak
%A Alfredo Buttari
%A Jack Dongarra
%K lapack
%B UT Computer Science Technical Report (Also LAPACK Working Note 184)
%8 2007-01
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2007
%T Specification and detection of performance problems with ASL
%A Michael Gerndt
%A Karl Fürlinger
%B Concurrency and Computation: Practice and Experience
%I John Wiley and Sons Ltd.
%V 19
%P 1451-1464
%8 2007-01
%G eng

%0 Journal Article
%J Journal of Computational Physics
%D 2007
%T The Use of Bulk States to Accelerate the Band Edge State Calculation of a Semiconductor Quantum Dot
%A Christof Voemel
%A Stanimire Tomov
%A Lin-Wang Wang
%A Osni Marques
%A Jack Dongarra
%B Journal of Computational Physics
%V 223
%P 774-782
%8 2007-00
%G eng

%0 Conference Proceedings
%B Proceedings of the 13th International Euro-Par Conference on Parallel Processing (Euro-Par '07)
%D 2007
%T On Using Incremental Profiling for the Performance Analysis of Shared Memory Parallel Applications
%A Karl Fürlinger
%A Jack Dongarra
%A Michael Gerndt
%K kojak
%B Proceedings of the 13th International Euro-Par Conference on Parallel Processing (Euro-Par '07)
%I Springer LNCS
%C Rennes, France
%8 2007-01
%G eng

%0 Conference Proceedings
%B IPDPS 2006, 20th IEEE International Parallel and Distributed Processing Symposium
%D 2006
%T Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources
%A Zizhong Chen
%A Jack Dongarra
%B IPDPS 2006, 20th IEEE International Parallel and Distributed Processing Symposium
%C Rhodes Island, Greece
%8 2006-01
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications (submitted)
%D 2006
%T Application of Machine Learning to the Selection of Sparse Linear Solvers
%A Sanjukta Bhowmick
%A Victor Eijkhout
%A Yoav Freund
%A Erika Fuentes
%A David Keyes
%K salsa
%K sans
%B International Journal of High Performance Computing Applications (submitted)
%8 2006-00
%G eng

%0 Journal Article
%J Future Generation Computer Systems
%D 2006
%T An Asynchronous Algorithm on NetSolve Global Computing System
%A Jack Dongarra
%A Nahid Emad
%A Seyed Abolfazl Shahzadeh-Fazeli
%K Arnoldi method
%K Explicit restarting
%K Global computing
%K Large eigenproblem
%K netsolve
%X The explicitly restarted Arnoldi method (ERAM) allows one to find a few eigenpairs of a large sparse matrix. The multiple explicitly restarted Arnoldi method (MERAM) is a technique based upon a multiple projection of ERAM and accelerates its convergence [N. Emamad, S. Petiton, G. Edjlali, Multiple explicitly restarted Arnoldi method for solving large eigenproblems, SIAM J. Sci. Comput. SJSC 27 (1) (2005) 253-277]. MERAM allows one to update the restarting vector of an ERAM by taking into account the interesting eigen-information obtained by its other ERAM processes. This method is particularly well suited to the GRID-type environments. We present an adaptation of the asynchronous version of MERAM for the NetSolve global computing system. We point out some advantages and limitations of this kind of system to implement the asynchronous hybrid algorithms. We give some results of our experiments and show that we can obtain a good acceleration of the convergence compared to ERAM. These results also show the potential of the MERAM-like hybrid methods for the GRID computing environments.
%B Future Generation Computer Systems
%V 22
%P 279-290
%8 2006-02
%G eng
%N 3
%R https://doi.org/10.1016/j.future.2005.10.003

%0 Generic
%D 2006
%T ATLAS on the BlueGene/L – Preliminary Results
%A Keith Seymour
%A Haihang You
%A Jack Dongarra
%K gco
%B ICL Technical Report
%8 2006-01
%G eng

%0 Journal Article
%J International Journal of Computational Science and Engineering
%D 2006
%T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures
%A Stanimire Tomov
%A Julien Langou
%A Jack Dongarra
%A Andrew Canning
%A Lin-Wang Wang
%B International Journal of Computational Science and Engineering
%V 2
%P 205-212
%8 2006-00
%G eng

%0 Conference Proceedings
%B 18th IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2006 (submitted)
%D 2006
%T Experiments with Strassen's Algorithm: From Sequential to Parallel
%A Fengguang Song
%A Jack Dongarra
%A Shirley Moore
%B 18th IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2006 (submitted)
%C Dallas, Texas
%8 2006-01
%G eng

%0 Journal Article
%J University of Tennessee Computer Science Tech Report
%D 2006
%T Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy
%A Julien Langou
%A Julien Langou
%A Piotr Luszczek
%A Jakub Kurzak
%A Alfredo Buttari
%A Jack Dongarra
%K iter-ref
%B University of Tennessee Computer Science Tech Report
%8 2006-04
%G eng

%0 Journal Article
%J 2006 Euro PVM/MPI (submitted)
%D 2006
%T Flexible collective communication tuning architecture applied to Open MPI
%A Graham Fagg
%A Jelena Pjesivac–Grbovic
%A George Bosilca
%A Thara Angskun
%A Jack Dongarra
%K ftmpi
%B 2006 Euro PVM/MPI (submitted)
%C Bonn, Germany
%8 2006-01
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2006
%T FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study
%A David Dewolfs
%A Jan Broeckhove
%A Vaidy Sunderam
%A Graham Fagg
%K ftmpi
%B Lecture Notes in Computer Science
%I Springer Berlin / Heidelberg
%V 4192
%P 133-140
%8 2006-00
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications (to appear)
%D 2006
%T High Performance Development for High End Computing with Python Language Wrapper (PLW)
%A Piotr Luszczek
%K hpcc
%K lfc
%B International Journal of High Performance Computing Applications (to appear)
%8 2006-00
%G eng

%0 Journal Article
%J Euro PVM/MPI 2006
%D 2006
%T High Performance RDMA Protocols in HPC
%A Galen M. Shipman
%A George Bosilca
%A Maccabe, Arthur B.
%B Euro PVM/MPI 2006
%C Bonn, Germany
%8 2006-09
%G eng

%0 Journal Article
%J HeteroPar 2006
%D 2006
%T A High-Performance, Heterogeneous MPI
%A Richard L. Graham
%A Galen M. Shipman
%A Brian Barrett
%A Ralph Castain
%A George Bosilca
%A Andrew Lumsdaine
%B HeteroPar 2006
%C Barcelona, Spain
%8 2006-09
%G eng

%0 Conference Proceedings
%B SC06 Conference Tutorial
%D 2006
%T The HPC Challenge (HPCC) Benchmark Suite
%A Piotr Luszczek
%A David Bailey
%A Jack Dongarra
%A Jeremy Kepner
%A Robert Lucas
%A Rolf Rabenseifner
%A Daisuke Takahashi
%K hpcc
%K hpcchallenge
%B SC06 Conference Tutorial
%I IEEE
%C Tampa, Florida
%8 2006-11
%G eng

%0 Journal Article
%J PARA 2006
%D 2006
%T The Impact of Multicore on Math Software
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%A Piotr Luszczek
%A Stanimire Tomov
%K plasma
%B PARA 2006
%C Umea, Sweden
%8 2006-06
%G eng

%0 Journal Article
%J Euro PVM/MPI 2006
%D 2006
%T Implementation and Usage of the PERUSE-Interface in Open MPI
%A Rainer Keller
%A George Bosilca
%A Graham Fagg
%A Michael Resch
%A Jack Dongarra
%B Euro PVM/MPI 2006
%C Bonn, Germany
%8 2006-09
%G eng

%0 Journal Article
%J University of Tennessee Computer Science Tech Report
%D 2006
%T Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the CELL Processor
%A Jakub Kurzak
%A Jack Dongarra
%K iter-ref
%B University of Tennessee Computer Science Tech Report
%8 2006-09
%G eng

%0 Journal Article
%J University of Tennessee Computer Science Tech Report, UT-CS-06-581, LAPACK Working Note #178
%D 2006
%T Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead
%A Jakub Kurzak
%A Jack Dongarra
%B University of Tennessee Computer Science Tech Report, UT-CS-06-581, LAPACK Working Note #178
%8 2006-01
%G eng

%0 Journal Article
%J Parallel Processing Letters
%D 2006
%T Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Server
%A Emmanuel Jeannot
%A Keith Seymour
%A Asim YarKhan
%A Jack Dongarra
%K netsolve
%B Parallel Processing Letters
%V 17
%P 47-59
%8 2006-03
%G eng

%0 Conference Proceedings
%B 8th Workshop 'Parallel Systems and Algorithms' (PASA), Lecture Notes in Informatics
%D 2006
%T Large Event Traces in Parallel Performance Analysis
%A Felix Wolf
%A Felix Freitag
%A Bernd Mohr
%A Shirley Moore
%A Brian Wylie
%K kojak
%B 8th Workshop 'Parallel Systems and Algorithms' (PASA), Lecture Notes in Informatics
%I Gesellschaft für Informatik
%C Frankfurt/Main, Germany
%8 2006-03
%G eng

%0 Generic
%D 2006
%T Modeling of L2 Cache Behavior for Thread-Parallel Scientific Programs on Chip Multi-Processors
%A Fengguang Song
%A Shirley Moore
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report
%8 2006-01
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2006
%T MPI Collective Algorithm Selection and Quadtree Encoding
%A Jelena Pjesivac–Grbovic
%A Graham Fagg
%A Thara Angskun
%A George Bosilca
%A Jack Dongarra
%K ftmpi
%B Lecture Notes in Computer Science
%I Springer Berlin / Heidelberg
%V 4192
%P 40-48
%8 2006-09
%G eng

%0 Generic
%D 2006
%T MPI Collective Algorithm Selection and Quadtree Encoding
%A Jelena Pjesivac–Grbovic
%A Graham Fagg
%A Thara Angskun
%A George Bosilca
%A Jack Dongarra
%K ftmpi
%B ICL Technical Report
%8 2006-00
%G eng

%0 Conference Proceedings
%B IEEE/ACM Proceedings of HPCNano SC06 (to appear)
%D 2006
%T Performance evaluation of eigensolvers in nano-structure computations
%A Andrew Canning
%A Jack Dongarra
%A Julien Langou
%A Osni Marques
%A Stanimire Tomov
%A Christof Voemel
%A Lin-Wang Wang
%K doe-nano
%B IEEE/ACM Proceedings of HPCNano SC06 (to appear)
%8 2006-01
%G eng

%0 Conference Proceedings
%B Second International Workshop on OpenMP
%D 2006
%T Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications
%A Oscar Hernandez
%A Fengguang Song
%A Barbara Chapman
%A Jack Dongarra
%A Bernd Mohr
%A Shirley Moore
%A Felix Wolf
%K kojak
%B Second International Workshop on OpenMP
%C Reims, France
%8 2006-01
%G eng

%0 Generic
%D 2006
%T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report)
%A Jack Dongarra
%B University of Tennessee Computer Science Department Technical Report, UT-CS-04-526
%V –89-95
%8 2006-01
%G eng

%0 Journal Article
%J J. Phys.: Conf. Ser. 46
%D 2006
%T Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures
%A Alex Zunger
%A Alberto Franceschetti
%A Gabriel Bester
%A Wesley B. Jones
%A Kwiseon Kim
%A Peter A. Graf
%A Lin-Wang Wang
%A Andrew Canning
%A Osni Marques
%A Christof Voemel
%A Jack Dongarra
%A Julien Langou
%A Stanimire Tomov
%K DOE_NANO
%B J. Phys.: Conf. Ser. 46
%V :101088/1742-6596/46/1/040
%P 292-298
%8 2006-01
%G eng

%0 Conference Proceedings
%B Proceedings of IEEE CCGrid 2006
%D 2006
%T Proposal of MPI operation level Checkpoint/Rollback and one implementation
%A Yuan Tang
%A Graham Fagg
%A Jack Dongarra
%K HARNESS/FT-PI
%B Proceedings of IEEE CCGrid 2006
%I IEEE Computer Society
%8 2006-01
%G eng

%0 Journal Article
%J PARA 2006
%D 2006
%T Prospectus for the Next LAPACK and ScaLAPACK Libraries
%A James Demmel
%A Jack Dongarra
%A B. Parlett
%A William Kahan
%A Ming Gu
%A David Bindel
%A Yozo Hida
%A Xiaoye Li
%A Osni Marques
%A Jason E. Riedy
%A Christof Voemel
%A Julien Langou
%A Piotr Luszczek
%A Jakub Kurzak
%A Alfredo Buttari
%A Julien Langou
%A Stanimire Tomov
%B PARA 2006
%C Umea, Sweden
%8 2006-06
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications (Special Issue: Scheduling for Large-Scale Heterogeneous Platforms)
%D 2006
%T Recent Developments in GridSolve
%A Asim YarKhan
%A Keith Seymour
%A Kiran Sagi
%A Zhiao Shi
%A Jack Dongarra
%E Yves Robert
%K netsolve
%B International Journal of High Performance Computing Applications (Special Issue: Scheduling for Large-Scale Heterogeneous Platforms)
%I Sage Science Press
%V 20
%8 2006-00
%G eng

%0 Journal Article
%J 2006 Euro PVM/MPI
%D 2006
%T Scalable Fault Tolerant Protocol for Parallel Runtime Environments
%A Thara Angskun
%A Graham Fagg
%A George Bosilca
%A Jelena Pjesivac–Grbovic
%A Jack Dongarra
%K ftmpi
%B 2006 Euro PVM/MPI
%C Bonn, Germany
%8 2006-00
%G eng

%0 Journal Article
%J IBM Journal of Research and Development
%D 2006
%T Self Adapting Numerical Software SANS Effort
%A George Bosilca
%A Zizhong Chen
%A Jack Dongarra
%A Victor Eijkhout
%A Graham Fagg
%A Erika Fuentes
%A Julien Langou
%A Piotr Luszczek
%A Jelena Pjesivac–Grbovic
%A Keith Seymour
%A Haihang You
%A Sathish Vadhiyar
%K gco
%B IBM Journal of Research and Development
%V 50
%P 223-238
%8 2006-01
%G eng

%0 Conference Proceedings
%B DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems
%D 2006
%T Self-Healing Network for Scalable Fault Tolerant Runtime Environments
%A Thara Angskun
%A Graham Fagg
%A George Bosilca
%A Jelena Pjesivac–Grbovic
%A Jack Dongarra
%B DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems
%C Innsbruck, Austria
%8 2006-01
%G eng

%0 Conference Proceedings
%B Proc. of the 5th International Workshop on Performance Modeling, Evaluation, and Organization of Parallel and Distributed Systems (PMEO-PDS 2006)
%D 2006
%T A Systematic Multi-step Methodology for Performance Analysis of Communication Traces of Distributed Applications based on Hierarchical Clustering
%A Gabriela Aguilera
%A Patricia J. Teller
%A Michela Taufer
%A Felix Wolf
%K kojak
%B Proc. of the 5th International Workshop on Performance Modeling, Evaluation, and Organization of Parallel and Distributed Systems (PMEO-PDS 2006)
%I IEEE Computer Society
%C Rhodes Island, Greece
%8 2006-04
%G eng

%0 Generic
%D 2006
%T Technical Comparison between several representative checkpoint/rollback solutions for MPI programs
%A Yuan Tang
%B ICL Technical Report
%8 2006-01
%G eng

%0 Conference Proceedings
%B IEEE/ACM Proceedings of HPCNano SC06 (to appear)
%D 2006
%T Towards bulk based preconditioning for quantum dot computations
%A Andrew Canning
%A Jack Dongarra
%A Julien Langou
%A Osni Marques
%A Stanimire Tomov
%A Christof Voemel
%A Lin-Wang Wang
%K doe-nano
%B IEEE/ACM Proceedings of HPCNano SC06 (to appear)
%8 2006-01
%G eng

%0 Generic
%D 2006
%T Twenty-Plus Years of Netlib and NA-Net
%A Jack Dongarra
%A Gene H. Golub
%A Eric Grosse
%A Cleve Moler
%A Keith Moore
%B University of Tennessee Computer Science Department Technical Report, UT-CS-04-526
%8 2006-00
%G eng

%0 Journal Article
%J Journal of Computational Physics (submitted)
%D 2006
%T The use of bulk states to accelerate the band edge state calculation of a semiconductor quantum dot
%A Christof Voemel
%A Stanimire Tomov
%A Lin-Wang Wang
%A Osni Marques
%A Jack Dongarra
%K doe-nano
%B Journal of Computational Physics (submitted)
%8 2006-01
%G eng

%0 Generic
%D 2005
%T Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources
%A Zizhong Chen
%A Jack Dongarra
%B University of Tennessee Computer Science Department Technical Report
%V –05-561
%8 2005-11
%G eng

%0 Conference Proceedings
%B Proceedings of Parallel Computing 2005 (ParCo)
%D 2005
%T Analysis and Optimization of Yee_Bench using Hardware Performance Counters
%A Ulf Andersson
%A Phil Mucci
%K papi
%X In this paper, we report on our analysis and optimization of a serial Fortran 90 benchmark called Yee bench. This benchmark has been run on a variety of architectures and its performance is reasonably well understood. However, on AMD Opteron based machines, we found unexpected dips in the delivered MFLOPS of the code for a seemingly random set of problem sizes. Through the use of the Opteron’s on-chip hardware performance counters andPapiEx, aPAPI based tool, we discovered that these drops were directly related to high L1 cache miss rates for these problem sizes. The high miss rates could be attributed to the fact that in the two core regions of the code we have references to three dynamically allocated arrays which compete for the same set in the Opteron’s 2-way set associative cache. We validated this conclusion by accurately predicting those problem sizes that exhibit this problem. We were able to alleviate these performance anomalies using variable intra-array padding to effectively accomplish inter-array padding. We conclude with some comments on the general applicability of this method as well how one might improving the implementation of the Fortran 90ALLOCATE intrinsic to handle this case. 1.
%B Proceedings of Parallel Computing 2005 (ParCo)
%C Malaga, Spain
%8 2005-01
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience, Special issue "Automatic Performance Analysis" (submitted)
%D 2005
%T Automatic analysis of inefficiency patterns in parallel applications
%A Felix Wolf
%A Bernd Mohr
%A Jack Dongarra
%A Shirley Moore
%K kojak
%B Concurrency and Computation: Practice and Experience, Special issue "Automatic Performance Analysis" (submitted)
%8 2005-00
%G eng

%0 Conference Proceedings
%B In Proceedings of the International Conference on Parallel Processing
%D 2005
%T Automatic Experimental Analysis of Communication Patterns in Virtual Topologies
%A Nikhil Bhatia
%A Fengguang Song
%A Felix Wolf
%A Jack Dongarra
%A Bernd Mohr
%A Shirley Moore
%K kojak
%B In Proceedings of the International Conference on Parallel Processing
%I IEEE Computer Society
%C Oslo, Norway
%8 2005-06
%G eng

%0 Journal Article
%J Future Generation Computing Systems
%D 2005
%T Biological Sequence Alignment on the Computational Grid Using the GrADS Framework
%A Asim YarKhan
%A Jack Dongarra
%K grads
%B Future Generation Computing Systems
%I Elsevier
%V 21
%P 980-986
%8 2005-06
%G eng

%0 Conference Proceedings
%B Proceedings of 5th International Conference on Computational Science (ICCS)
%D 2005
%T Comparison of Nonlinear Conjugate-Gradient methods for computing the Electronic Properties of Nanostructure Architectures
%A Stanimire Tomov
%A Julien Langou
%A Andrew Canning
%A Lin-Wang Wang
%A Jack Dongarra
%E V. S. Sunderman
%E Geert Dick van Albada
%E Peter M. Sloot
%E Jack Dongarra
%K doe-nano
%B Proceedings of 5th International Conference on Computational Science (ICCS)
%I Springer's Lecture Notes in Computer Science
%C Atlanta, GA, USA
%P 317-325
%8 2005-01
%G eng

%0 Journal Article
%J International Journal of Parallel Programming
%D 2005
%T The Component Structure of a Self-Adapting Numerical Software System
%A Victor Eijkhout
%A Erika Fuentes
%A Thomas Eidson
%A Jack Dongarra
%K salsa
%K sans
%B International Journal of Parallel Programming
%V 33
%8 2005-06
%G eng

%0 Journal Article
%J SIAM Journal on Matrix Analysis and Applications (to appear)
%D 2005
%T Condition Numbers of Gaussian Random Matrices
%A Zizhong Chen
%A Jack Dongarra
%K ftmpi
%K grads
%K lacsi
%K sans
%B SIAM Journal on Matrix Analysis and Applications (to appear)
%8 2005-01
%G eng

%0 Generic
%D 2005
%T Condition Numbers of Gaussian Random Matrices
%A Zizhong Chen
%A Jack Dongarra
%K ft-la
%B University of Tennessee Computer Science Department Technical Report
%V –04-539
%8 2005-00
%G eng

%0 Journal Article
%J International Journal of Computational Science and Engineering (to appear)
%D 2005
%T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures
%A Stanimire Tomov
%A Julien Langou
%A Andrew Canning
%A Lin-Wang Wang
%A Jack Dongarra
%B International Journal of Computational Science and Engineering (to appear)
%8 2005-01
%G eng

%0 Conference Proceedings
%B Proceedings of DoD HPCMP UGC 2005 (to appear)
%D 2005
%T Dynamic Process Management for Pipelined Applications
%A David Cronk
%A Graham Fagg
%A Susan Emeny
%A Scott Tucker
%B Proceedings of DoD HPCMP UGC 2005 (to appear)
%I IEEE
%C Nashville, TN
%8 2005-01
%G eng

%0 Generic
%D 2005
%T An Effective Empirical Search Method for Automatic Software Tuning
%A Haihang You
%A Keith Seymour
%A Jack Dongarra
%K gco
%B ICL Technical Report
%8 2005-01
%G eng

%0 Conference Proceedings
%B In Proceedings of the European Conference on Parallel Computing (Euro-Par)
%D 2005
%T Event-based Measurement and Analysis of One-sided Communication
%A Marc-Andre Hermanns
%A Bernd Mohr
%A Felix Wolf
%K kojak
%B In Proceedings of the European Conference on Parallel Computing (Euro-Par)
%I Springer
%C Lisbon, Portugal
%8 2005-08
%G eng

%0 Conference Proceedings
%B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear)
%D 2005
%T Fault Tolerant High Performance Computing by a Coding Approach
%A Zizhong Chen
%A Graham Fagg
%A Edgar Gabriel
%A Julien Langou
%A Thara Angskun
%A George Bosilca
%A Jack Dongarra
%K ftmpi
%K grads
%K lacsi
%K sans
%B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear)
%C Chicago, Illinois
%8 2005-01
%G eng

%0 Conference Proceedings
%B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI
%D 2005
%T Hash Functions for Datatype Signatures in MPI
%A George Bosilca
%A Jack Dongarra
%A Graham Fagg
%A Julien Langou
%E Beniamino Di Martino
%K ftmpi
%B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI
%I Springer-Verlag Berlin
%C Sorrento (Naples), Italy
%V 3666
%P 76-83
%8 2005-09
%G eng

%0 Journal Article
%J SC|05 Tutorial - S13
%D 2005
%T HPC Challenge v1.x Benchmark Suite
%A Piotr Luszczek
%A David Koester
%K hpcc
%B SC|05 Tutorial - S13
%C Seattle, Washington
%8 2005-01
%G eng

%0 Conference Proceedings
%B Second Workshop on Productivity and Performance in High-End Computing (P-PHEC) at 11th International Symposium on High Performance Computer Architecture (HPCA-2005)
%D 2005
%T Improving Time to Solution with Automated Performance Analysis
%A Shirley Moore
%A Felix Wolf
%A Jack Dongarra
%A Bernd Mohr
%K kojak
%B Second Workshop on Productivity and Performance in High-End Computing (P-PHEC) at 11th International Symposium on High Performance Computer Architecture (HPCA-2005)
%C San Francisco
%8 2005-02
%G eng

%0 Journal Article
%D 2005
%T Introduction to the HPC Challenge Benchmark Suite
%A Piotr Luszczek
%A Jack Dongarra
%A David Koester
%A Rolf Rabenseifner
%A Bob Lucas
%A Jeremy Kepner
%A John McCalpin
%A David Bailey
%A Daisuke Takahashi
%K hpcc
%K hpcchallenge
%8 2005-03
%G eng

%0 Generic
%D 2005
%T Introduction to the HPCChallenge Benchmark Suite
%A Jack Dongarra
%A Piotr Luszczek
%K hpcc
%K hpcchallenge
%B ICL Technical Report
%8 2005-01
%G eng

%0 Journal Article
%D 2005
%T LAPACK 2005 Prospectus: Reliable and Scalable Software for Linear Algebra Computations on High End Computers
%A James Demmel
%A Jack Dongarra
%I LAPACK Working Note 164
%8 2005-01
%G eng

%0 Journal Article
%J Journal of Physics: Conference Series
%D 2005
%T NanoPSE: A Nanoscience Problem Solving Environment for Atomistic Electronic Structure of Semiconductor Nanostructures
%A Wesley B. Jones
%A Gabriel Bester
%A Andrew Canning
%A Alberto Franceschetti
%A Peter A. Graf
%A Kwiseon Kim
%A Julien Langou
%A Lin-Wang Wang
%A Jack Dongarra
%A Alex Zunger
%X Researchers at the National Renewable Energy Laboratory and their collaborators have developed over the past ~10 years a set of algorithms for an atomistic description of the electronic structure of nanostructures, based on plane-wave pseudopotentials and configuration interaction. The present contribution describes the first step in assembling these various codes into a single, portable, integrated set of software packages. This package is part of an ongoing research project in the development stage. Components of NanoPSE include codes for atomistic nanostructure generation and passivation, valence force field model for atomic relaxation, code for potential field generation, empirical pseudopotential method solver, strained linear combination of bulk bands method solver, configuration interaction solver for excited states, selection of linear algebra methods, and several inverse band structure solvers. Although not available for general distribution at this time as it is being developed and tested, the design goal of the NanoPSE software is to provide a software context for collaboration. The software package is enabled by fcdev, an integrated collection of best practice GNU software for open source development and distribution augmented to better support FORTRAN.
%B Journal of Physics: Conference Series
%P 277-282
%8 2005-06
%G eng
%U https://iopscience.iop.org/article/10.1088/1742-6596/16/1/038/meta
%N 16
%R https://doi.org/10.1088/1742-6596/16/1/038

%0 Journal Article
%J Grid Computing and New Frontiers of High Performance Processing
%D 2005
%T NetSolve: Grid Enabling Scientific Computing Environments
%A Keith Seymour
%A Asim YarKhan
%A Sudesh Agrawal
%A Jack Dongarra
%E Lucio Grandinetti
%K netsolve
%B Grid Computing and New Frontiers of High Performance Processing
%I Elsevier
%8 2005-00
%G eng

%0 Journal Article
%J International Journal of Parallel Programming
%D 2005
%T New Grid Scheduling and Rescheduling Methods in the GrADS Project
%A Francine Berman
%A Henri Casanova
%A Andrew Chien
%A Keith Cooper
%A Holly Dail
%A Anshuman Dasgupta
%A Wei Deng
%A Jack Dongarra
%A Lennart Johnsson
%A Ken Kennedy
%A Charles Koelbel
%A Bo Liu
%A Xu Liu
%A Anirban Mandal
%A Gabriel Marin
%A Mark Mazina
%A John Mellor-Crummey
%A Celso Mendes
%A A. Olugbile
%A Jignesh M. Patel
%A Dan Reed
%A Zhiao Shi
%A Otto Sievert
%A H. Xia
%A Asim YarKhan
%K grads
%B International Journal of Parallel Programming
%I Springer
%V 33
%P 209-229
%8 2005-06
%G eng

%0 Journal Article
%J NCSA Access Online
%D 2005
%T A Not So Simple Matter of Software
%A Jack Dongarra
%B NCSA Access Online
%I NCSA
%8 2005-00
%G eng

%0 Conference Proceedings
%B The International Conference on Computational Science
%D 2005
%T Numerically Stable Real Number Codes Based on Random Matrices
%A Zizhong Chen
%A Jack Dongarra
%K ftmpi
%K grads
%K lacsi
%K sans
%B The International Conference on Computational Science
%I LNCS 3514, Springer-Verlag
%C Atlanta, GA
%8 2005-01
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems (submitted)
%D 2005
%T Optimization Problem Solving System Using GridRPC
%A Hisashi Shimosaka
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems (submitted)
%8 2005-01
%G eng

%0 Journal Article
%J Journal of Computational Acoustics (to appear)
%D 2005
%T On the Parallel Solution of Large Industrial Wave Propagation Problems
%A Luc Giraud
%A Julien Langou
%A G. Sylvand
%B Journal of Computational Acoustics (to appear)
%8 2005-01
%G eng

%0 Conference Proceedings
%B Workshop on Patterns in High Performance Computing
%D 2005
%T A Pattern-Based Approach to Automated Application Performance Analysis
%A Nikhil Bhatia
%A Shirley Moore
%A Felix Wolf
%A Jack Dongarra
%A Bernd Mohr
%K kojak
%B Workshop on Patterns in High Performance Computing
%C University of Illinois at Urbana-Champaign
%8 2005-05
%G eng

%0 Conference Paper
%B European Conference on Parallel Processing (Euro-Par 2005)
%D 2005
%T PerfMiner: Cluster-Wide Collection, Storage and Presentation of Application Level Hardware Performance Data
%A Phil Mucci
%A Daniel Ahlin
%A Johan Danielsson
%A Per Ekman
%A Lars Malinowski
%K papi
%X We present PerfMiner, a system for the transparent collection, storage and presentation of thread-level hardware performance data across an entire cluster. Every sub-process/thread spawned by the user through the batch system is measured with near zero overhead and no dilation of run-time. Performance metrics are collected at the thread level using tool built on top of the Performance Application Programming Interface (PAPI). As the hardware counters are virtualized by the OS, the resulting counts are largely unaffected by other kernel or user processes. PerfMiner correlates this performance data with metadata from the batch system and places it in a database. Through a command line and web interface, the user can make queries to the database to report information on everything from overall workload characterization and system utilization to the performance of a single thread in a specific application. This is in contrast to other monitoring systems that report aggregate system-wide metrics sampled over a period of time. In this paper, we describe our implementation of PerfMiner as well as present some results from the test deployment of PerfMiner across three different clusters at the Center for Parallel Computers at The Royal Institute of Technology in Stockholm, Sweden.
%B European Conference on Parallel Processing (Euro-Par 2005)
%I Springer
%C Monte de Caparica, Portugal
%8 2005-09
%G eng
%R https://doi.org/10.1007/11549468_1

%0 Conference Proceedings
%B In Proceedings of the 2005 SciDAC Conference
%D 2005
%T Performance Analysis of GYRO: A Tool Evaluation
%A Patrick H. Worley
%A Jeff Candy
%A Laura Carrington
%A Kevin Huck
%A Timothy Kaiser
%A Kumar Mahinthakumar
%A Allen D. Malony
%A Shirley Moore
%A Dan Reed
%A Philip C. Roth
%A H. Shan
%A Sameer Shende
%A Allan Snavely
%A S. Sreepathi
%A Felix Wolf
%A Y. Zhang
%K kojak
%B In Proceedings of the 2005 SciDAC Conference
%C San Francisco, CA
%8 2005-06
%G eng

%0 Conference Proceedings
%B 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05)
%D 2005
%T Performance Analysis of MPI Collective Operations
%A Jelena Pjesivac–Grbovic
%A Thara Angskun
%A George Bosilca
%A Graham Fagg
%A Edgar Gabriel
%A Jack Dongarra
%K ftmpi
%B 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05)
%C Denver, Colorado
%8 2005-04
%G eng

%0 Journal Article
%J Cluster Computing Journal (to appear)
%D 2005
%T Performance Analysis of MPI Collective Operations
%A Jelena Pjesivac–Grbovic
%A Thara Angskun
%A George Bosilca
%A Graham Fagg
%A Edgar Gabriel
%A Jack Dongarra
%K ftmpi
%B Cluster Computing Journal (to appear)
%8 2005-01
%G eng

%0 Conference Proceedings
%B Mini-Symposium "Tools Support for Parallel Programming", Proceedings of Parallel Computing (ParCo)
%D 2005
%T Performance Analysis of One-sided Communication Mechanisms
%A Bernd Mohr
%A Andrej Kühnal
%A Marc-Andre Hermanns
%A Felix Wolf
%K kojak
%B Mini-Symposium "Tools Support for Parallel Programming", Proceedings of Parallel Computing (ParCo)
%C Malaga, Spain
%8 2005-09
%G eng

%0 Conference Paper
%B Proceedings of DoD HPCMP UGC 2005
%D 2005
%T Performance Profiling and Analysis of DoD Applications using PAPI and TAU
%A Shirley Moore
%A David Cronk
%A Felix Wolf
%A Avi Purkayastha
%A Patricia J. Teller
%A Robert Araiza
%A Gabriela Aguilera
%A Jamie Nava
%K papi
%B Proceedings of DoD HPCMP UGC 2005
%I IEEE
%C Nashville, TN
%8 2005-06
%G eng

%0 Conference Proceedings
%B In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference
%D 2005
%T Performance Profiling Overhead Compensation for MPI Programs
%A Sameer Shende
%A Allen D. Malony
%A Alan Morris
%A Felix Wolf
%K kojak
%B In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference
%I Springer LNCS
%8 2005-09
%G eng

%0 Generic
%D 2005
%T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
%A George Bosilca
%A Zizhong Chen
%A Jack Dongarra
%A Julien Langou
%K ft-la
%B University of Tennessee Computer Science Department Technical Report, UT-CS-04-538
%8 2005-00
%G eng

%0 Generic
%D 2005
%T Remote Software Toolkit Installer
%A Eric Meek
%A Jeff Larkin
%A Jack Dongarra
%K rest
%B ICL Technical Report
%8 2005-06
%G eng

%0 Journal Article
%J Numerische Mathematik
%D 2005
%T Rounding Error Analysis of the Classical Gram-Schmidt Orthogonalization Process
%A Luc Giraud
%A Julien Langou
%A Miroslav Rozložník
%A Jasper van den Eshof
%B Numerische Mathematik
%V 101
%P 87-100
%8 2005-01
%G eng

%0 Conference Proceedings
%B In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference
%D 2005
%T A Scalable Approach to MPI Application Performance Analysis
%A Shirley Moore
%A Felix Wolf
%A Jack Dongarra
%A Sameer Shende
%A Allen D. Malony
%A Bernd Mohr
%K kojak
%B In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference
%I Springer LNCS
%8 2005-09
%G eng

%0 Conference Proceedings
%B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI
%D 2005
%T Scalable Fault Tolerant MPI: Extending the Recovery Algorithm
%A Graham Fagg
%A Thara Angskun
%A George Bosilca
%A Jelena Pjesivac–Grbovic
%A Jack Dongarra
%E Beniamino Di Martino
%K ftmpi
%B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI
%I Springer-Verlag Berlin
%C Sorrento (Naples) , Italy
%V 3666
%P 67
%8 2005-09
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience, Special Issue: Grid Performance
%D 2005
%T Self Adaptivity in Grid Computing
%A Sathish Vadhiyar
%A Jack Dongarra
%E John Gurd
%E Anthony Hey
%E Juri Papay
%E Graham Riley
%K netsolve
%K sans
%B Concurrency and Computation: Practice and Experience, Special Issue: Grid Performance
%V 17
%P 235-257
%8 2005-00
%G eng

%0 Generic
%D 2005
%T Towards an Accurate Model for Collective Communications
%A Sathish Vadhiyar
%A Graham Fagg
%A Jack Dongarra
%B ICL Technical Report
%8 2005-01
%G eng

%0 Conference Proceedings
%B In Proc. of the International Conference on High Performance Computing and Communications (HPCC)
%D 2005
%T Trace-Based Parallel Performance Overhead Compensation
%A Felix Wolf
%A Allen D. Malony
%A Sameer Shende
%A Alan Morris
%K kojak
%B In Proc. of the International Conference on High Performance Computing and Communications (HPCC)
%C Sorrento (Naples), Italy
%8 2005-09
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2004)
%D 2004
%T Accurate Cache and TLB Characterization Using Hardware Counters
%A Jack Dongarra
%A Shirley Moore
%A Phil Mucci
%A Keith Seymour
%A Haihang You
%K gco
%K lacsi
%K papi
%X We have developed a set of microbenchmarks for accurately determining the structural characteristics of data cache memories and TLBs. These characteristics include cache size, cache line size, cache associativity, memory page size, number of data TLB entries, and data TLB associativity. Unlike previous microbenchmarks that used time-based measurements, our microbenchmarks use hardware event counts to more accurately and quickly determine these characteristics while requiring fewer limiting assumptions.
%B International Conference on Computational Science (ICCS 2004)
%I Springer
%C Krakow, Poland
%8 2004-06
%G eng
%R https://doi.org/10.1007/978-3-540-24688-6_57

%0 Conference Proceedings
%B 4th International Symposium on Cluster Computing and the Grid (CCGrid 2004)(submitted)
%D 2004
%T Active Logistical State Management in the GridSolve/L
%A Micah Beck
%A Jack Dongarra
%A Jian Huang
%A Terry Moore
%A James Plank
%K netsolve
%B 4th International Symposium on Cluster Computing and the Grid (CCGrid 2004)(submitted)
%C Chicago, Illinois
%8 2004-01
%G eng

%0 Conference Proceedings
%B 2004 International Conference on Parallel Processing (ICCP-04)
%D 2004
%T An Algebra for Cross-Experiment Performance Analysis
%A Fengguang Song
%A Felix Wolf
%A Nikhil Bhatia
%A Jack Dongarra
%A Shirley Moore
%K kojak
%B 2004 International Conference on Parallel Processing (ICCP-04)
%C Montreal, Quebec, Canada
%8 2004-08
%G eng

%0 Generic
%D 2004
%T An Asynchronous Algorithm on NetSolve Global Computing System
%A Nahid Emad
%A S. A. Shahzadeh Fazeli
%A Jack Dongarra
%K netsolve
%B PRiSM - Laboratoire de recherche en informatique, Université de Versailles St-Quentin Technical Report
%8 2004-03
%G eng

%0 Conference Paper
%B 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004)
%D 2004
%T Automatic Blocking of QR and LU Factorizations for Locality
%A Qing Yi
%A Ken Kennedy
%A Haihang You
%A Keith Seymour
%A Jack Dongarra
%K gco
%K papi
%K sans
%X QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures.
%B 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004)
%I ACM
%C Washington, DC
%8 2004-06
%G eng
%R 10.1145/1065895.1065898

%0 Conference Paper
%B 5th LCI International Conference on Linux Clusters: The HPC Revolution
%D 2004
%T Automating the Large-Scale Collection and Analysis of Performance
%A Phil Mucci
%A Jack Dongarra
%A Rick Kufrin
%A Shirley Moore
%A Fengguang Song
%A Felix Wolf
%K kojak
%K papi
%B 5th LCI International Conference on Linux Clusters: The HPC Revolution
%C Austin, Texas
%8 2004-05
%G eng

%0 Journal Article
%J International Journal of High Performance Applications and Supercomputing (to appear)
%D 2004
%T Building and using a Fault Tolerant MPI implementation
%A Graham Fagg
%A Jack Dongarra
%K ftmpi
%K lacsi
%K sans
%B International Journal of High Performance Applications and Supercomputing (to appear)
%8 2004-00
%G eng

%0 Journal Article
%J Oak Ridge National Laboratory Report
%D 2004
%T Cray X1 Evaluation Status Report
%A Pratul Agarwal
%A R. A. Alexander
%A E. Apra
%A Satish Balay
%A Arthur S. Bland
%A James Colgan
%A Eduardo D'Azevedo
%A Jack Dongarra
%A Tom Dunigan
%A Mark Fahey
%A Al Geist
%A M. Gordon
%A Robert Harrison
%A Dinesh Kaushik
%A M. Krishnakumar
%A Piotr Luszczek
%A Tony Mezzacapa
%A Jeff Nichols
%A Jarek Nieplocha
%A Leonid Oliker
%A T. Packwood
%A M. Pindzola
%A Thomas C. Schulthess
%A Jeffrey Vetter
%A James B White
%A T. Windus
%A Patrick H. Worley
%A Thomas Zacharia
%B Oak Ridge National Laboratory Report
%V /-2004/13
%8 2004-01
%G eng

%0 Generic
%D 2004
%T CUBE User Manual
%A Fengguang Song
%A Felix Wolf
%K kojak
%B ICL Technical Report
%8 2004-02
%G eng

%0 Conference Proceedings
%B International Conference on Computational Science
%D 2004
%T Design of an Interactive Environment for Numerically Intensive Parallel Linear Algebra Calculations
%A Piotr Luszczek
%A Jack Dongarra
%E Marian Bubak
%E Geert Dick van Albada
%E Peter M. Sloot
%E Jack Dongarra
%K lacsi
%K lfc
%B International Conference on Computational Science
%I Springer Verlag
%C Poland
%8 2004-06
%G eng
%R 10.1007/978-3-540-25944-2_35

%0 Generic
%D 2004
%T EARL - API Documentation
%A Felix Wolf
%K kojak
%B ICL Technical Report
%8 2004-10
%G eng

%0 Conference Proceedings
%B Proceedings of Euro-Par 2004
%D 2004
%T Efficient Pattern Search in Large Traces through Successive Refinement
%A Felix Wolf
%A Bernd Mohr
%A Jack Dongarra
%A Shirley Moore
%K kojak
%B Proceedings of Euro-Par 2004
%I Springer-Verlag
%C Pisa, Italy
%8 2004-08
%G eng

%0 Conference Proceedings
%B Proceedings of ISC2004 (to appear)
%D 2004
%T Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems
%A Graham Fagg
%A Edgar Gabriel
%A George Bosilca
%A Thara Angskun
%A Zizhong Chen
%A Jelena Pjesivac–Grbovic
%A Kevin London
%A Jack Dongarra
%K ftmpi
%K lacsi
%B Proceedings of ISC2004 (to appear)
%C Heidelberg, Germany
%8 2004-06
%G eng

%0 Conference Proceedings
%B IPDPS 2004, NGS Workshop (to appear)
%D 2004
%T Improvements in the Efficient Composition of Applications
%A Thomas Eidson
%A Victor Eijkhout
%A Jack Dongarra
%K salsa
%K sans
%B IPDPS 2004, NGS Workshop (to appear)
%C Sante Fe
%8 2004-00
%G eng

%0 Conference Proceedings
%B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04')
%D 2004
%T LAPACK for Clusters Project: An Example of Self Adapting Numerical Software
%A Zizhong Chen
%A Jack Dongarra
%A Piotr Luszczek
%A Kenneth Roche
%K lacsi
%K lfc
%B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04')
%C Big Island, Hawaii
%V 9
%P 90282
%8 2004-01
%G eng

%0 Conference Proceedings
%B 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (submitted)
%D 2004
%T Memory Bandwidth and the Performance of Scientific Applications: A Study of the AMD Opteron Processor
%A Phil Mucci
%B 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (submitted)
%8 2004-01
%G eng

%0 Generic
%D 2004
%T NetBuild: Automated Installation and Use of Network-Accessible Software Libraries
%A Keith Moore
%A Jack Dongarra
%A Shirley Moore
%A Eric Grosse
%K netbuild
%B ICL Technical Report
%8 2004-01
%G eng

%0 Generic
%D 2004
%T Numerically Stable Real-Number Codes Based on Random Matrices
%A Zizhong Chen
%A Jack Dongarra
%K ftmpi
%B University of Tennessee Computer Science Department Technical Report
%V –04-526
%8 2004-10
%G eng

%0 Journal Article
%J Engineering the Grid (to appear)
%D 2004
%T An Overview of Heterogeneous High Performance and Grid Computing
%A Jack Dongarra
%A Alexey Lastovetsky
%E Beniamino Di Martino
%E Jack Dongarra
%E Adolfy Hoisie
%E Laurence Yang
%E Hans Zima
%B Engineering the Grid (to appear)
%I Nova Science Publishers, Inc.
%8 2004-00
%G eng

%0 Generic
%D 2004
%T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report)
%A Jack Dongarra
%B University of Tennessee Computer Science Department Technical Report, CS-89-85
%8 2004-01
%G eng

%0 Generic
%D 2004
%T Performance Optimization and Modeling of Blocked Sparse Kernels
%A Alfredo Buttari
%A Victor Eijkhout
%A Julien Langou
%A Salvatore Filippone
%K sans
%B ICL Technical Report
%8 2004-00
%G eng

%0 Journal Article
%J International Journal for High Performance Applications and Supercomputing (to appear)
%D 2004
%T Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing
%A Graham Fagg
%A Edgar Gabriel
%A Zizhong Chen
%A Thara Angskun
%A George Bosilca
%A Jelena Pjesivac–Grbovic
%A Jack Dongarra
%K ftmpi
%K lacsi
%B International Journal for High Performance Applications and Supercomputing (to appear)
%8 2004-04
%G eng

%0 Journal Article
%J RFC 3834
%D 2004
%T Recommendations for Automatic Responses to Electronic Mail
%A Keith Moore
%B RFC 3834
%I Internet Engineering Task Force (IETF)
%8 2004-01
%G eng

%0 Generic
%D 2004
%T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
%A George Bosilca
%A Zizhong Chen
%A Jack Dongarra
%A Julien Langou
%B ICL Technical Report
%8 2004-01
%G eng

%0 Conference Proceedings
%B IEEE Proceedings (to appear)
%D 2004
%T Self Adapting Linear Algebra Algorithms and Software
%A James Demmel
%A Jack Dongarra
%A Victor Eijkhout
%A Erika Fuentes
%A Antoine Petitet
%A Rich Vuduc
%A Clint Whaley
%A Katherine Yelick
%K salsa
%K sans
%B IEEE Proceedings (to appear)
%8 2004-00
%G eng

%0 Journal Article
%J International Journal of High Performance Applications, Special Issue: Automatic Performance Tuning
%D 2004
%T Towards an Accurate Model for Collective Communications
%A Sathish Vadhiyar
%A Graham Fagg
%A Jack Dongarra
%K lacsi
%B International Journal of High Performance Applications, Special Issue: Automatic Performance Tuning
%V 18
%P 159-167
%8 2004-01
%G eng

%0 Journal Article
%J The Computer Journal
%D 2004
%T Trends in High Performance Computing
%A Jack Dongarra
%B The Computer Journal
%I The British Computer Society
%V 47
%P 399-403
%8 2004-00
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2004
%T The Virtual Instrument: Support for Grid-enabled Scientific Simulations
%A Henri Casanova
%A Thomas Bartol
%A Francine Berman
%A Adam Birnbaum
%A Jack Dongarra
%A Mark Ellisman
%A Marcio Faerman
%A Erhan Gockay
%A Michelle Miller
%A Graziano Obertelli
%A Stuart Pomerantz
%A Terry Sejnowski
%A Joel Stiles
%A Rich Wolski
%B International Journal of High Performance Computing Applications
%V 18
%P 3-17
%8 2004-01
%G eng

%0 Conference Proceedings
%B IPDPS 2003, Workshop on NSF-Next Generation Software
%D 2003
%T Applying Aspect-Oriented Programming Concepts to a Component-based Programming Model
%A Thomas Eidson
%A Jack Dongarra
%A Victor Eijkhout
%K salsa
%K sans
%B IPDPS 2003, Workshop on NSF-Next Generation Software
%C Nice, France
%8 2003-03
%G eng

%0 Journal Article
%J Journal of Systems Architecture, Special Issue 'Evolutions in parallel distributed and network-based processing'
%D 2003
%T Automatic performance analysis of hybrid MPI/OpenMP applications
%A Felix Wolf
%A Bernd Mohr
%E Andrea Clematis
%E Daniele D'Agostino
%K kojak
%B Journal of Systems Architecture, Special Issue 'Evolutions in parallel distributed and network-based processing'
%I Elsevier
%V 49(10-11)
%P 421-439
%8 2003-11
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2003
%T Automatic Translation of Fortran to JVM Bytecode
%A Keith Seymour
%A Jack Dongarra
%K f2j
%B Concurrency and Computation: Practice and Experience
%V 15
%P 202-207
%8 2003-00
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T Computational Science — ICCS 2003
%A Peter M. Sloot
%A David Abramson
%A Alexander V. Bogdanov
%A Jack Dongarra
%A Albert Zomaya
%A Yuriy Gorbachev
%B Lecture Notes in Computer Science
%I Springer-Verlag, Berlin
%C ICCS 2003, International Conference. Melbourne, Australia
%V 2657-2660
%8 2003-06
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T Distributed Probablistic Model-Building Genetic Algorithm
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Masaki Sano
%A Hisashi Shimosaka
%A Shigeyoshi Tsutsui
%A Jack Dongarra
%B Lecture Notes in Computer Science
%I Springer-Verlag, Heidelberg
%V 2723
%P 1015-1028
%8 2003-01
%G eng

%0 Journal Article
%J ICL Tech Report
%D 2003
%T Distributed Storage in RIB
%A Thomas B. Boehmann
%K rib
%B ICL Tech Report
%8 2003-03
%G eng

%0 Journal Article
%J Special Issue on Biological Applications of Genetic and Evolutionary Computation (submitted)
%D 2003
%T Energy Minimization of Protein Tertiary Structure by Parallel Simulated Annealing using Genetic Crossover
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Shinya Ogura
%A Keiko Aoi
%A Takeshi Yoshida
%A Yuko Okamoto
%A Jack Dongarra
%B Special Issue on Biological Applications of Genetic and Evolutionary Computation (submitted)
%8 2003-03
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th European PVM/MPI User's Group Meeting
%D 2003
%T Evaluating The Performance Of MPI-2 Dynamic Communicators And One-Sided Communication
%A Edgar Gabriel
%A Graham Fagg
%A Jack Dongarra
%K ftmpi
%B Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th European PVM/MPI User's Group Meeting
%I Springer-Verlag, Berlin
%C Venice, Italy
%V 2840
%P 88-97
%8 2003-09
%G eng

%0 Conference Paper
%B PADTAD Workshop, IPDPS 2003
%D 2003
%T Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters
%A Jack Dongarra
%A Kevin London
%A Shirley Moore
%A Phil Mucci
%A Dan Terpstra
%A Haihang You
%A Min Zhou
%K lacsi
%K papi
%X The PAPI project has defined and implemented a cross-platform interface to the hardware counters available on most modern microprocessors. The interface has gained widespread use and acceptance from hardware vendors, users, and tool developers. This paper reports on experiences with the community-based open-source effort to define the PAPI specification and implement it on a variety of platforms. Collaborations with tool developers who have incorporated support for PAPI are described. Issues related to interpretation and accuracy of hardware counter data and to the overheads of collecting this data are discussed. The paper concludes with implications for the design of the next version of PAPI.
%B PADTAD Workshop, IPDPS 2003
%I IEEE
%C Nice, France
%8 2003-04
%@ 0-7695-1926-1
%G eng

%0 Conference Proceedings
%B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented)
%D 2003
%T Fault Tolerant Communication Library and Applications for High Performance Computing
%A Graham Fagg
%A Edgar Gabriel
%A Zizhong Chen
%A Thara Angskun
%A George Bosilca
%A Antonin Bukovsky
%A Jack Dongarra
%K ftmpi
%K lacsi
%B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented)
%C Santa Fe, NM
%8 2003-10
%G eng

%0 Conference Proceedings
%B 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science
%D 2003
%T A Fault-Tolerant Communication Library for Grid Environments
%A Edgar Gabriel
%A Graham Fagg
%A Antonin Bukovsky
%A Thara Angskun
%A Jack Dongarra
%K ftmpi
%K lacsi
%B 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science
%C San Francisco
%8 2003-06
%G eng

%0 Generic
%D 2003
%T Finite-choice Algorithm Optimization in Conjugate Gradients (LAPACK Working Note 159)
%A Jack Dongarra
%A Victor Eijkhout
%B University of Tennessee Computer Science Technical Report, UT-CS-03-502
%8 2003-01
%G eng

%0 Journal Article
%J National Research Council
%D 2003
%T The Future of Supercomputing: An Interim Report
%A 
%B National Research Council
%I The National Academies Press
%C Washington, D.C.
%8 2003-01
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing (submitted)
%D 2003
%T GrADSolve - A Grid-based RPC System for Remote Invocation of Parallel Software
%A Sathish Vadhiyar
%A Jack Dongarra
%K grads
%B Journal of Parallel and Distributed Computing (submitted)
%8 2003-03
%G eng

%0 Conference Proceedings
%B Lecture Notes in Computer Science, Proceedings of the 9th International Euro-Par Conference
%D 2003
%T GrADSolve - RPC for High Performance Computing on the Grid
%A Sathish Vadhiyar
%A Jack Dongarra
%A Asim YarKhan
%E Harald Kosch
%E Laszlo Boszormenyi
%E Hermann Hellwagner
%K netsolve
%B Lecture Notes in Computer Science, Proceedings of the 9th International Euro-Par Conference
%I Springer-Verlag, Berlin
%C Klagenfurt, Austria
%V 2790
%P 394-403
%8 2003-01
%G eng
%R 10.1007/978-3-540-45209-6_58

%0 Journal Article
%J Advances in Parallel Computing
%D 2003
%T Hardware-Counter Based Automatic Performance Analysis of Parallel Programs
%A Felix Wolf
%A Bernd Mohr
%K kojak
%K papi
%X The KOJAK performance-analysis environment identifies a large number of performance problems on parallel computers with SMP nodes. The current version concentrates on parallelism-related performance problems that arise from an inefficient usage of the parallel programming interfaces MPI and OpenMP, while ignoring individual CPU performance. This chapter describes an extended design of KOJAK capable of diagnosing low individual-CPU performance based on hardware-counter information and of integrating the results with those of the parallelism-centered analysis. The performance of parallel applications is determined by a variety of different factors. Performance of single components frequently influences the overall behavior in unexpected ways. Application programmers on current parallel machines have to deal with numerous performance-critical aspects: different modes of parallel execution, such as message passing, multi-threading or even a combination of the two, and performance on individual CPU that is determined by the interaction of different functional units. The KOJAK analysis process is composed of two parts: a semi-automatic instrumentation of the user application followed by an automatic analysis of the generated performance data. KOJAK's instrumentation software runs on most major UNlX platforms and works on multiple levels, including source-code, compiler, and linker.
%B Advances in Parallel Computing
%I Elsevier
%C Dresden, Germany
%V 13
%P 753-760
%8 2004-01
%G eng
%R https://doi.org/10.1016/S0927-5452(04)80092-3

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T High Performance Computing for Computational Science
%A Jose Palma
%A Jack Dongarra
%A Vicente Hernández
%E Antonio Augusto Sousa
%B Lecture Notes in Computer Science
%I Springer-Verlag, Berlin
%C VECPAR 2002, 5th International Conference June 26-28, 2002
%V 2565
%8 2003-01
%G eng

%0 Conference Proceedings
%B Lecture Notes in Computer Science, High Performance Computing, 5th International Symposium ISHPC
%D 2003
%T High Performance Computing Trends and Self Adapting Numerial Software
%A Jack Dongarra
%B Lecture Notes in Computer Science, High Performance Computing, 5th International Symposium ISHPC
%I Springer-Verlag, Heidelberg
%C Tokyo-Odaiba, Japan
%V 2858
%P 1-9
%8 2003-01
%G eng

%0 Conference Proceedings
%B Information Processing Society of Japan Symposium Series
%D 2003
%T High Performance Computing Trends, Supercomputers, Clusters, and Grids
%A Jack Dongarra
%B Information Processing Society of Japan Symposium Series
%V 2003
%P 55-58
%8 2003-01
%G eng

%0 Conference Proceedings
%B Proc. of the European Conference on Parallel Computing (EuroPar)
%D 2003
%T KOJAK - A Tool Set for Automatic Performance Analysis of Parallel Applications
%A Bernd Mohr
%A Felix Wolf
%K kojak
%B Proc. of the European Conference on Parallel Computing (EuroPar)
%I Springer-Verlag
%C Klagenfurt, Austria
%V 2790
%P 1301-1304
%8 2003-08
%G eng

%0 Journal Article
%J Making the Global Infrastructure a Reality
%D 2003
%T NetSolve: Past, Present, and Future - A Look at a Grid Enabled Server
%A Sudesh Agrawal
%A Jack Dongarra
%A Keith Seymour
%A Sathish Vadhiyar
%E Francine Berman
%E Geoffrey Fox
%E Anthony Hey
%K netsolve
%B Making the Global Infrastructure a Reality
%I Wiley Publishing
%8 2003-00
%G eng

%0 Conference Proceedings
%B Information Processing Society of Japan Symposium Series
%D 2003
%T Optimization of Injection Schedule of Diesel Engine Using GridRPC
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Junji Sawada
%A Jack Dongarra
%B Information Processing Society of Japan Symposium Series
%V 2003
%P 189-197
%8 2003-01
%G eng

%0 Conference Proceedings
%B 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid
%D 2003
%T Optimization Problem Solving System using Grid RPC
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Hisashi Shimosaka
%A Jack Dongarra
%B 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid
%C Tokyo, Japan
%8 2003-03
%G eng

%0 Conference Proceedings
%B Proceedings of the IPDPS 2003, NGS Workshop
%D 2003
%T Optimizing Performance and Reliability in Distributed Computing Systems Through Wide Spectrum Storage
%A James Plank
%A Micah Beck
%A Jack Dongarra
%A Rich Wolski
%A Henri Casanova
%B Proceedings of the IPDPS 2003, NGS Workshop
%C Nice, France
%P 209
%8 2003-01
%G eng

%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2003
%T A Parallel Implementation of the Nonsymmetric QR Algorithm for Distributed Memory Architectures
%A Greg Henry
%A David Watkins
%A Jack Dongarra
%B SIAM Journal on Scientific Computing
%V 24
%P 284-311
%8 2003-01
%G eng

%0 Conference Paper
%B ICCS 2003 Terascale Workshop
%D 2003
%T Performance Instrumentation and Measurement for Terascale Systems
%A Jack Dongarra
%A Allen D. Malony
%A Shirley Moore
%A Phil Mucci
%A Sameer Shende
%K papi
%X As computer systems grow in size and complexity, tool support is needed to facilitate the efficient mapping of large-scale applications onto these systems. To help achieve this mapping, performance analysis tools must provide robust performance observation capabilities at all levels of the system, as well as map low-level behavior to high-level program constructs. Instrumentation and measurement strategies, developed over the last several years, must evolve together with performance analysis infrastructure to address the challenges of new scalable parallel systems.
%B ICCS 2003 Terascale Workshop
%I Springer, Berlin, Heidelberg
%C Melbourne, Australia
%8 2003-06
%G eng
%R https://doi.org/10.1007/3-540-44864-0_6

%0 Conference Proceedings
%B Proceedings of the 3rd International Symposium on Cluster Computing and the Grid
%D 2003
%T A Performance Oriented Migration Framework for the Grid
%A Sathish Vadhiyar
%K grads
%B Proceedings of the 3rd International Symposium on Cluster Computing and the Grid
%C Tokyo, Japan
%P 130-137
%8 2003-05
%G eng

%0 Generic
%D 2003
%T A Proposed Standard for Matrix Metadata
%A Victor Eijkhout
%A Erika Fuentes
%K salsa
%K sans
%B Innovative Computing Laboratory Technical Report
%C Submitted to ACM TOMS
%8 2003-11
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T Recent Advances in Parallel Virtual Machine and Message Passing Interface
%A Jack Dongarra
%A Domenico Laforenza
%A S. Orlando
%B Lecture Notes in Computer Science
%I Springer-Verlag, Berlin
%V 2840
%8 2003-01
%G eng

%0 Conference Proceedings
%B DOE/NSF Workshop on New Directions in Cyber-Security in Large-Scale Networks: Development Obstacles
%D 2003
%T Scalable, Trustworthy Network Computing Using Untrusted Intermediaries: A Position Paper
%A Micah Beck
%A Jack Dongarra
%A Victor Eijkhout
%A Mike Langston
%A Terry Moore
%A James Plank
%K netsolve
%B DOE/NSF Workshop on New Directions in Cyber-Security in Large-Scale Networks: Development Obstacles
%C National Conference Center - Landsdowne, Virginia
%8 2003-03
%G eng

%0 Journal Article
%J Resource Management in the Grid
%D 2003
%T Scheduling in the Grid Application Development Software Project
%A Holly Dail
%A Otto Sievert
%A Francine Berman
%A Henri Casanova
%A Asim YarKhan
%A Sathish Vadhiyar
%A Jack Dongarra
%A Chuang Liu
%A Lingyun Yang
%A Dave Angulo
%A Ian Foster
%K grads
%B Resource Management in the Grid
%I Kluwer Publishers
%8 2003-03
%G eng

%0 Journal Article
%J Concurrency: Practice and Experience (submitted)
%D 2003
%T Self Adaptability in Grid Computing
%A Sathish Vadhiyar
%A Jack Dongarra
%K sans
%B Concurrency: Practice and Experience (submitted)
%8 2003-03
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2003
%T Self Adapting Numerical Algorithm for Next Generation Applications
%A Jack Dongarra
%A Victor Eijkhout
%K lacsi
%K sans
%B International Journal of High Performance Computing Applications
%V 17
%P 125-132
%8 2003-01
%G eng

%0 Generic
%D 2003
%T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters (LAPACK Working Note 160)
%A Zizhong Chen
%A Jack Dongarra
%A Piotr Luszczek
%A Kenneth Roche
%K lacsi
%B University of Tennessee Computer Science Technical Report, UT-CS-03-499
%8 2003-01
%G eng

%0 Journal Article
%J Parallel Computing
%D 2003
%T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters
%A Zizhong Chen
%A Jack Dongarra
%A Piotr Luszczek
%A Kenneth Roche
%K lacsi
%K lfc
%K sans
%B Parallel Computing
%V 29
%P 1723-1743
%8 2003-11
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T Self-Adapting Numerical Software and Automatic Tuning of Heuristics
%A Jack Dongarra
%A Victor Eijkhout
%K salsa
%K sans
%B Lecture Notes in Computer Science
%I Springer Verlag
%C Melbourne, Australia
%V 2660
%P 759-770
%8 2003-06
%G eng

%0 Journal Article
%J Statistical Data Mining and Knowledge Discovery
%D 2003
%T The Semantic Conference Organizer
%A Kevin Heinrich
%A Michael Berry
%A Jack Dongarra
%A Sathish Vadhiyar
%E Hamparsum Bozdogan
%K netsolve
%B Statistical Data Mining and Knowledge Discovery
%I CRC Press
%8 2003-00
%G eng

%0 Conference Proceedings
%B ClusterWorld Conference and Expo
%D 2003
%T A Simple Installation and Administration Tool for Large-scaled PC Cluster System
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Kenzo Kodama
%A Junichi Uekawa
%A Jack Dongarra
%B ClusterWorld Conference and Expo
%C San Jose, CA
%8 2003-03
%G eng

%0 Journal Article
%J Parallel Processing Letters
%D 2003
%T SRS - A Framework for Developing Malleable and Migratable Parallel Software
%A Sathish Vadhiyar
%A Jack Dongarra
%K grads
%B Parallel Processing Letters
%V 13
%P 291-312
%8 2003-06
%G eng

%0 Conference Proceedings
%B Information Processing Society of Japan Symposium Series
%D 2003
%T Static Scheduling for ScaLAPACK on the Grid Using Genetic Algorithm
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Hiroki Saito
%A Yusuke Tanimura
%A Jack Dongarra
%B Information Processing Society of Japan Symposium Series
%V 2003
%P 3-10
%8 2003-01
%G eng

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T VisPerf: Monitoring Tool for Grid Computing
%A DongWoo Lee
%A Jack Dongarra
%E R. S. Ramakrishna
%K netsolve
%B Lecture Notes in Computer Science
%I Springer Verlag, Heidelberg
%V 2659
%P 233-243
%8 2003-00
%G eng

%0 Journal Article
%J Journal of Digital Information special issue on Interactivity in Digital Libraries
%D 2002
%T Active Netlib: An Active Mathematical Software Collection for Inquiry-based Computational Science and Engineering Education
%A Shirley Moore
%A A.J. Baker
%A Jack Dongarra
%A Christian Halloy
%A Chung Ng
%K activenetlib
%K rib
%B Journal of Digital Information special issue on Interactivity in Digital Libraries
%V 2
%8 2002-00
%G eng

%0 Journal Article
%J International Journal of Supercomputer Applications and High-Performance Computing
%D 2002
%T Adaptive Scheduling for Task Farming with Grid Middleware
%A Henri Casanova
%A Myung Ho Kim
%A James Plank
%A Jack Dongarra
%B International Journal of Supercomputer Applications and High-Performance Computing
%V 13
%P 231-240
%8 2002-10
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Computing
%D 2002
%T Algorithmic Redistribution Methods for Block Cyclic Decompositions
%A Antoine Petitet
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Computing
%V 10
%P 201-220
%8 2002-10
%G eng

%0 Journal Article
%J EuroPar 2002
%D 2002
%T Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load
%A Javier Cuenca
%A Domingo Giminez
%A José González
%A Jack Dongarra
%A Kenneth Roche
%B EuroPar 2002
%C Paderborn, Germany
%8 2002-08
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications: Special Issue - Part I & II
%D 2002
%T Basic Linear Algebra Subprograms Technical (BLAST) Forum Standard
%B International Journal of High Performance Computing Applications: Special Issue - Part I & II
%V 16
%P 1-199
%8 2002-01
%G eng

%0 Journal Article
%J SIAM News
%D 2002
%T Biannual Top-500 Computer Lists Track Changing Environments for Scientific Computing
%A Jack Dongarra
%A Hans Meuer
%A Horst D. Simon
%A Erich Strohmaier
%K top500
%B SIAM News
%V 34
%8 2002-10
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2002)
%D 2002
%T A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware
%A Shirley Moore
%K papi
%X Performance monitoring hardware is available on most modern microprocessors in the form of hardware counters and other registers that record data about processor events. This hardware may be used in counting mode, in which aggregate events counts are accumulated, and/or in sampling mode, in which time-based or event-based sampling is used to collect profiling data. This paper discusses uses of these two modes and considers the issues of efficiency and accuracy raised by each. Implications for the PAPI cross-platform hardware counter interface are also discussed.
%B International Conference on Computational Science (ICCS 2002)
%I Springer
%C Amsterdam, Netherlands
%8 2002-04
%G eng
%R https://doi.org/10.1007/3-540-46080-2_95

%0 Journal Article
%J Parallel and Distributed Computing Practices
%D 2002
%T A Comparison of Parallel Solvers for General Narrow Banded Linear Systems
%A Peter Arbenz
%A Andrew Cleary
%A Jack Dongarra
%A Markus Hegland
%B Parallel and Distributed Computing Practices
%V 2
%P 385-400
%8 2002-10
%G eng

%0 Conference Proceedings
%B Parallel Computing: Advances and Current Issues:Proceedings of the International Conference ParCo2001
%D 2002
%T Deploying Parallel Numerical Library Routines to Cluster Computing in a Self Adapting Fashion
%A Kenneth Roche
%A Jack Dongarra
%E Gerhard R. Joubert
%E Almerica Murli
%E Frans Peters
%E Marco Vanneschi
%K lfc
%K sans
%B Parallel Computing: Advances and Current Issues:Proceedings of the International Conference ParCo2001
%I Imperial College Press
%C London, England
%8 2002-01
%G eng

%0 Generic
%D 2002
%T Development of the PICMSS NetSolve Service
%A Matthew Kelleher Jr.
%K netsolve
%B ICL Technical Report
%8 2002-04
%G eng

%0 Conference Proceedings
%B Grid Computing - GRID 2002, Third International Workshop
%D 2002
%T Experiments with Scheduling Using Simulated Annealing in a Grid Environment
%A Asim YarKhan
%A Jack Dongarra
%E Manish Parashar
%K grads
%B Grid Computing - GRID 2002, Third International Workshop
%I Springer
%C Baltimore, MD
%V 2536
%P 232-242
%8 2002-11
%G eng

%0 Generic
%D 2002
%T GridRPC: A Remote Procedure Call API for Grid Computing
%A Keith Seymour
%A Hidemoto Nakada
%A Satoshi Matsuoka
%A Jack Dongarra
%A Craig Lee
%A Henri Casanova
%B ICL Technical Report
%8 2002-11
%G eng

%0 Generic
%D 2002
%T Hardware Software Server in NetSolve
%A Sudesh Agrawal
%K netsolve
%B ICL Technical Report
%8 2002-01
%G eng

%0 Journal Article
%J Future Generation Computer Systems
%D 2002
%T HARNESS Fault Tolerant MPI Design, Usage and Performance Issues
%A Graham Fagg
%A Jack Dongarra
%B Future Generation Computer Systems
%V 18
%P 1127-1142
%8 2002-01
%G eng

%0 Journal Article
%J Concurrency: Practice and Experience
%D 2002
%T Innovations of the NetSolve Grid Computing System
%A Dorian Arnold
%A Henri Casanova
%A Jack Dongarra
%K netsolve
%B Concurrency: Practice and Experience
%V 14
%P 1457-1479
%8 2002-01
%G eng

%0 Conference Proceedings
%B Proceedings of the second IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID 2002)
%D 2002
%T The Internet BackPlane Protocol: A Study in Resource Sharing
%A Alessandro Bassi
%A Micah Beck
%A Graham Fagg
%A Terry Moore
%A James Plank
%A Martin Swany
%A Rich Wolski
%K ftmpi
%B Proceedings of the second IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID 2002)
%C Berlin, Germany
%8 2002-10
%G eng

%0 Journal Article
%J Scientific Programming (to appear)
%D 2002
%T An Iterative Solver Benchmark
%A Jack Dongarra
%A Victor Eijkhout
%A Henk van der Vorst
%B Scientific Programming (to appear)
%8 2002-00
%G eng

%0 Journal Article
%J Scientific Programming
%D 2002
%T JLAPACK - Compiling LAPACK Fortran to Java
%A David Doolin
%A Jack Dongarra
%A Keith Seymour
%K f2j
%B Scientific Programming
%V 7
%P 111-138
%8 2002-10
%G eng

%0 Journal Article
%J Parallel Computing
%D 2002
%T The Marketplace for High-Performance Computers
%A Erich Strohmaier
%A Jack Dongarra
%A Hans Meuer
%A Horst D. Simon
%B Parallel Computing
%V 25
%P 1517-1545
%8 2002-10
%G eng

%0 Conference Proceedings
%B Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing (HPDC 2002)
%D 2002
%T A Metascheduler For The Grid
%A Sathish Vadhiyar
%A Jack Dongarra
%K grads
%B Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing (HPDC 2002)
%I IEEE Computer Society
%C Edinburgh, Scotland
%P 343-351
%8 2002-07
%G eng

%0 Journal Article
%J Parallel Computing
%D 2002
%T Middleware for the Use of Storage in Communication
%A Micah Beck
%A Dorian Arnold
%A Alessandro Bassi
%A Francine Berman
%A Henri Casanova
%A Jack Dongarra
%A Terry Moore
%A Graziano Obertelli
%A James Plank
%A Martin Swany
%A Sathish Vadhiyar
%A Rich Wolski
%K netsolve
%B Parallel Computing
%V 28
%P 1773-1788
%8 2002-08
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience, Special Issue: Grid Computing Environments
%D 2002
%T NetBuild: Transparent Cross-Platform Access to Computational Software Libraries
%A Keith Moore
%A Jack Dongarra
%K netbuild
%B Concurrency and Computation: Practice and Experience, Special Issue: Grid Computing Environments
%V 14
%P 1445-1456
%8 2002-11
%G eng

%0 Journal Article
%J International Journal of High Performance Applications and Supercomputing
%D 2002
%T Numerical Libraries and Tools for Scalable Parallel Cluster Computing
%A Shirley Browne
%A Jack Dongarra
%A Anne Trefethen
%B International Journal of High Performance Applications and Supercomputing
%V 15
%P 175-180
%8 2002-10
%G eng

%0 Journal Article
%J Meeting of the Japan Society of Mechanical Engineers
%D 2002
%T Optimization System Using Grid RPC
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Hisashi Shimosaka
%A Yusuke Tanimura
%A Jack Dongarra
%B Meeting of the Japan Society of Mechanical Engineers
%C Kyoto University, Kyoto, Japan
%8 2002-10
%G eng

%0 Conference Proceedings
%B Proceedings of the Third International Workshop on Grid Computing
%D 2002
%T Overview of GridRPC: A Remote Procedure Call API for Grid Computing
%A Keith Seymour
%A Hidemoto Nakada
%A Satoshi Matsuoka
%A Jack Dongarra
%A Craig Lee
%A Henri Casanova
%E Manish Parashar
%B Proceedings of the Third International Workshop on Grid Computing
%P 274-278
%8 2002-01
%G eng

%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2002
%T A Parallel Implementation of the Nonsymmetric QR Algorithm for Disitributed Memory Architectures
%A Greg Henry
%A David Watkins
%A Jack Dongarra
%B SIAM Journal on Scientific Computing
%V 16
%P 284-311
%8 2002-10
%G eng

%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2002
%T Parallelizing the Divide and Conquer Algorithm for the Symmetric Tridiagonal Eigenvalue Problem on Distributed Memory Architectures
%A Francoise Tisseur
%A Jack Dongarra
%B SIAM Journal on Scientific Computing
%V 6
%P 2223-2236
%8 2002-10
%G eng

%0 Generic
%D 2002
%T Polynomial Acceleration of Optimised Multi-grid Smoothers; Basic Theory
%A Victor Eijkhout
%B ICL Technical Report
%V 156
%8 2002-01
%G eng

%0 Generic
%D 2002
%T Self-adapting Numerical Software for Next Generation Applications (LAPACK Working Note 157)
%A Jack Dongarra
%A Victor Eijkhout
%K salsa
%K sans
%B ICL Technical Report
%8 2002-00
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2002
%T Stochastic Performance Prediction for Iterative Algorithms in Distributed Environments
%A Henri Casanova
%A Michael G. Thomason
%A Jack Dongarra
%B Journal of Parallel and Distributed Computing
%V 98
%P 68-91
%8 2002-10
%G eng

%0 Conference Proceedings
%B International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops
%D 2002
%T Toward a Framework for Preparing and Executing Adaptive Grid Programs
%A Ken Kennedy
%A John Mellor-Crummey
%A Keith Cooper
%A Linda Torczon
%A Francine Berman
%A Andrew Chien
%A Dave Angulo
%A Ian Foster
%A Dennis Gannon
%A Lennart Johnsson
%A Carl Kesselman
%A Jack Dongarra
%A Sathish Vadhiyar
%K grads
%B International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops
%C Fort Lauderdale, FL
%P 0171
%8 2002-04
%G eng

%0 Journal Article
%J Meeting of the Japan Society of Mechanical Engineers
%D 2002
%T Truss Structural Optimization Using NetSolve System
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Hisashi Shimosaka
%A Masaki Sano
%A Yusuke Tanimura
%A Yasunari Mimura
%A Shinobu Yoshimura
%A Jack Dongarra
%K netsolve
%B Meeting of the Japan Society of Mechanical Engineers
%C Kyoto University, Kyoto, Japan
%8 2002-10
%G eng

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2002
%T An Updated Set of Basic Linear Algebra Subprograms (BLAS)
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Iain Duff
%A Sven Hammarling
%A Greg Henry
%A Michael Heroux
%A Linda Kaufman
%A Andrew Lumsdaine
%A Antoine Petitet
%A Roldan Pozo
%A Karin Remington
%A Clint Whaley
%B ACM Transactions on Mathematical Software
%V 28
%P 135-151
%8 2002-12
%G eng
%R 10.1145/567806.567807

%0 Generic
%D 2002
%T Users' Guide to NetSolve v1.4.1
%A Sudesh Agrawal
%A Dorian Arnold
%A Susan Blackford
%A Jack Dongarra
%A Michelle Miller
%A Kiran Sagi
%A Zhiao Shi
%A Keith Seymour
%A Sathish Vadhiyar
%K netsolve
%B ICL Technical Report
%8 2002-06
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing (submitted)
%D 2002
%T The Virtual Instrument: Support for Grid-enabled Scientific Simulations
%A Henri Casanova
%A Thomas Bartol
%A Francine Berman
%A Adam Birnbaum
%A Jack Dongarra
%A Mark Ellisman
%A Marcio Faerman
%A Erhan Gockay
%A Michelle Miller
%A Graziano Obertelli
%A Stuart Pomerantz
%A Terry Sejnowski
%A Joel Stiles
%A Rich Wolski
%B Journal of Parallel and Distributed Computing (submitted)
%8 2002-10
%G eng

%0 Journal Article
%J Parallel Computing
%D 2001
%T Automated Empirical Optimization of Software and the ATLAS Project
%A Clint Whaley
%A Antoine Petitet
%A Jack Dongarra
%K atlas
%B Parallel Computing
%V 27
%P 3-25
%8 2001-01
%G eng

%0 Generic
%D 2001
%T Automatic Determination of Matrix-Blocks
%A Victor Eijkhout
%B Lapack Working Note 151, University of Tennessee Computer Science Technical Report
%8 2001-01
%G eng

%0 Conference Proceedings
%B Joint ACM Java Grande - ISCOPE 2001 Conference (submitted)
%D 2001
%T Automatic Translation of Fortran to JVM Bytecode
%A Keith Seymour
%A Jack Dongarra
%K f2j
%B Joint ACM Java Grande - ISCOPE 2001 Conference (submitted)
%C Stanford University, California
%8 2001-06
%G eng

%0 Journal Article
%J (an update), submitted to ACM TOMS
%D 2001
%T Basic Linear Algebra Subprograms (BLAS)
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Iain Duff
%A Sven Hammarling
%A Greg Henry
%A Michael Heroux
%A Linda Kaufman
%A Andrew Lumsdaine
%A Antoine Petitet
%A Roldan Pozo
%A Karin Remington
%A Clint Whaley
%B (an update), submitted to ACM TOMS
%8 2001-02
%G eng

%0 Journal Article
%D 2001
%T Basic Linear Algebra Subprograms Technical (BLAST) Forum Standard
%8 2001-01
%G eng

%0 Journal Article
%J Parallel Processing Letters
%D 2001
%T On the Convergence of Computational and Data Grids
%A Dorian Arnold
%A Sathish Vadhiyar
%A Jack Dongarra
%K netsolve
%B Parallel Processing Letters
%V 11
%P 187-202
%8 2001-01
%G eng

%0 Conference Proceedings
%B Tenth International World Wide Web Conference Proceedings (to appear),
%D 2001
%T Enabling Full Service Surrogates Using the Portable Channel Representation
%A Micah Beck
%A Terry Moore
%A Leif Abrahamsson
%A Chistophe Achouiantz
%A Patrik Johansson
%B Tenth International World Wide Web Conference Proceedings (to appear),
%C Hong Kong
%8 2001-05
%G eng

%0 Conference Paper
%B International Conference on Parallel and Distributed Computing Systems
%D 2001
%T End-user Tools for Application Performance Analysis, Using Hardware Counters
%A Kevin London
%A Jack Dongarra
%A Shirley Moore
%A Phil Mucci
%A Keith Seymour
%A T. Spencer
%K papi
%X One purpose of the end-user tools described in this paper is to give users a graphical representation of performance information that has been gathered by instrumenting an application with the PAPI library. PAPI is a project that specifies a standard API for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count &quot;events&quot;, which are occurrences of specific signals and states related to a processor’s function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. The perfometer tool developed by the PAPI project provides a graphical view of this information, allowing users to quickly see where performance bottlenecks are in their application. Only one function call has to be added by the user to their program to take advantage of perfometer. This makes it quick and simple to add and remove instrumentation from a program. Also, perfometer allows users to change the &quot;event&quot; they are monitoring. Add the ability to monitor parallel applications, set alarms and a Java front-end that can run anywhere, and this gives the user a powerful tool for quickly discovering where and why a bottleneck exists. A number of third-party tools for analyzing performance of message-passing and/or threaded programs have also incorporated support for PAPI so as to be able to display and analyze hardware counter data from their interfaces.
%B International Conference on Parallel and Distributed Computing Systems
%C Dallas, TX
%8 2001-08
%G eng

%0 Conference Proceedings
%B Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science
%D 2001
%T Fault Tolerant MPI for the HARNESS Meta-Computing System
%A Graham Fagg
%A Antonin Bukovsky
%A Jack Dongarra
%E Benjoe A. Juliano
%E R. Renner
%E K. Tan
%K ftmpi
%K harness
%B Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science
%I Springer Verlag
%C Berlin
%V 2073
%P 355-366
%8 2001-00
%G eng
%R 10.1007/3-540-45545-0_44

%0 Journal Article
%J International Journal of High Performance Applications and Supercomputing
%D 2001
%T The GrADS Project: Software Support for High-Level Grid Application Development
%A Francine Berman
%A Andrew Chien
%A Keith Cooper
%A Jack Dongarra
%A Ian Foster
%A Dennis Gannon
%A Lennart Johnsson
%A Ken Kennedy
%A Carl Kesselman
%A John Mellor-Crummey
%A Dan Reed
%A Linda Torczon
%A Rich Wolski
%K grads
%B International Journal of High Performance Applications and Supercomputing
%V 15
%P 327-344
%8 2001-01
%G eng

%0 Conference Proceedings
%B Proceedings of the High Performance Computing Symposium (HPC 2001) in 2001 Advanced Simulation Technologies Conference
%D 2001
%T Grid-Enabling Problem Solving Environments: A Case Study of SCIRUN and NetSolve
%A Michelle Miller
%A Christopher Moulding
%A Jack Dongarra
%A Christopher Johnson
%K netsolve
%B Proceedings of the High Performance Computing Symposium (HPC 2001) in 2001 Advanced Simulation Technologies Conference
%I Society for Modeling and Simulation International
%C Seattle, Washington
%8 2001-04
%G eng

%0 Journal Article
%J Parallel Computing
%D 2001
%T HARNESS and Fault Tolerant MPI
%A Graham Fagg
%A Antonin Bukovsky
%A Jack Dongarra
%B Parallel Computing
%V 27
%P 1479-1496
%8 2001-01
%G eng

%0 Journal Article
%J HERMIS
%D 2001
%T High Performance Computing Trends
%A Jack Dongarra
%A Hans Meuer
%A Horst D. Simon
%A Erich Strohmaier
%B HERMIS
%V 2
%P 155-163
%8 2001-11
%G eng

%0 Generic
%D 2001
%T Internet Backplane Protocol: API 1.0
%A Alessandro Bassi
%A Micah Beck
%A James Plank
%A Rich Wolski
%B University of Tennessee Computer Science Technical Report
%8 2001-01
%G eng

%0 Generic
%D 2001
%T Internet Backplane Protocol - Test Language v. 1.0
%A Alessandro Bassi
%A Xiaoye Li
%B University of Tennessee Computer Science Technical Report
%8 2001-01
%G eng

%0 Journal Article
%J Scientific Programming
%D 2001
%T Iterative Solver Benchmark (LAPACK Working Note 152)
%A Jack Dongarra
%A Victor Eijkhout
%A Henk van der Vorst
%B Scientific Programming
%V 9
%P 223-231
%8 2001-00
%G eng

%0 Journal Article
%J submitted to SC2001
%D 2001
%T Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication
%A Micah Beck
%A Dorian Arnold
%A Alessandro Bassi
%A Francine Berman
%A Henri Casanova
%A Jack Dongarra
%A Terry Moore
%A Graziano Obertelli
%A James Plank
%A Martin Swany
%A Sathish Vadhiyar
%A Rich Wolski
%K netsolve
%B submitted to SC2001
%C Denver, Colorado
%8 2001-11
%G eng

%0 Journal Article
%J SIAM Review (book review)
%D 2001
%T Measuring Computer Performance: A Practioner's Guide
%A Jack Dongarra
%B SIAM Review (book review)
%V 43
%P 383-384
%8 2001-00
%G eng

%0 Conference Proceedings
%B Department of Defense Users' Group Conference (to appear)
%D 2001
%T Metacomputing Support for the SARA3D Structural Acoustics Application
%A Shirley Moore
%A Dorian Arnold
%A David Cronk
%K netsolve
%B Department of Defense Users' Group Conference (to appear)
%C Biloxi, Mississippi
%8 2001-06
%G eng

%0 Generic
%D 2001
%T NetBuild
%A Keith Moore
%A Jack Dongarra
%K netbuild
%B University of Tennessee Computer Science Technical Report
%8 2001-01
%G eng

%0 Conference Proceedings
%B 2001 High Performance Computing Symposium (HPC'01), part of the Advance Simulation Technologies Conference
%D 2001
%T Network-Enabled Server Systems: Deploying Scientific Simulations on the Grid
%A Henri Casanova
%A Satoshi Matsuoka
%A Jack Dongarra
%B 2001 High Performance Computing Symposium (HPC'01), part of the Advance Simulation Technologies Conference
%C Seattle, Washington
%8 2001-04
%G eng

%0 Journal Article
%J SIAM News
%D 2001
%T Network-Enabled Solvers: A Step Toward Grid-Based Computing
%A Jack Dongarra
%B SIAM News
%V 34
%8 2001-12
%G eng

%0 Journal Article
%J International Journal of High Performance Applications and Supercomputing
%D 2001
%T Numerical Libraries and The Grid
%A Antoine Petitet
%A Susan Blackford
%A Jack Dongarra
%A Brett Ellis
%A Graham Fagg
%A Kenneth Roche
%A Sathish Vadhiyar
%K grads
%B International Journal of High Performance Applications and Supercomputing
%V 15
%P 359-374
%8 2001-01
%G eng

%0 Generic
%D 2001
%T Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK
%A Antoine Petitet
%A Susan Blackford
%A Jack Dongarra
%A Brett Ellis
%A Graham Fagg
%A Kenneth Roche
%A Sathish Vadhiyar
%K grads
%K scalapack
%B University of Tennessee Computer Science Technical Report
%8 2001-01
%G eng

%0 Journal Article
%J International Journal of High Performance Applications and Supercomputing
%D 2001
%T Numerical Libraries and Tools for Scalable Parallel Cluster Computing
%A Jack Dongarra
%A Shirley Moore
%A Anne Trefethen
%B International Journal of High Performance Applications and Supercomputing
%V 15
%P 175-180
%8 2001-01
%G eng

%0 Journal Article
%J Handbook of Massive Data Sets
%D 2001
%T Overview of High Performance Computers
%A Aad J. van der Steen
%A Jack Dongarra
%E James Abello
%E Panos Pardalos
%E Mauricio Resende
%B Handbook of Massive Data Sets
%I Kluwer Academic Publishers
%P 791-852
%8 2001-01
%G eng

%0 Conference Paper
%B Department of Defense Users' Group Conference Proceedings
%D 2001
%T The PAPI Cross-Platform Interface to Hardware Performance Counters
%A Kevin London
%A Shirley Moore
%A Phil Mucci
%A Keith Seymour
%A Richard Luczak
%K papi
%X The purpose of the PAPI project is to specify a standard API for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count &quot;events,&quot; which are occurrences of specific signals and states related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis and tuning. The PAPI project has developed a standard set of hardware events and a standard cross-platform library interface to the underlying counter hardware. The PAPI library has been implemented for a number of Shared Resource Center platforms. The PAPI project is developing end-user tools for dynamically selecting and displaying hardware counter performance data. PAPI support is also being incorporated into a number of third-party tools.
%B Department of Defense Users' Group Conference Proceedings
%C Biloxi, Mississippi
%8 2001-06
%G eng

%0 Conference Proceedings
%B Department of Defense Users' Group Conference Proceedings (to appear),
%D 2001
%T Parallel I/O for EQM Applications
%A David Cronk
%A Graham Fagg
%A Shirley Moore
%K ftmpi
%B Department of Defense Users' Group Conference Proceedings (to appear),
%C Biloxi, Mississippi
%8 2001-06
%G eng

%0 Journal Article
%J 8th European PVM/MPI User's Group Meeting, Lecture Notes in Computer Science
%D 2001
%T Parallel IO Support for Meta-Computing Applications: MPI_Connect IO Applied to PACX-MPI
%A Graham Fagg
%A Edgar Gabriel
%A Michael Resch
%K ftmpi
%B 8th European PVM/MPI User's Group Meeting, Lecture Notes in Computer Science
%I Springer Verlag, Berlin
%C Greece
%V 2131
%8 2001-09
%G eng

%0 Conference Proceedings
%B LACSI Symposium 2001
%D 2001
%T Performance Modeling for Self Adapting Collective Communications for MPI
%A Sathish Vadhiyar
%A Graham Fagg
%A Jack Dongarra
%K ftmpi
%B LACSI Symposium 2001
%C Santa Fe, NM
%8 2001-10
%G eng

%0 Generic
%D 2001
%T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report)
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report
%8 2001-01
%G eng

%0 Journal Article
%J Computing in Science and Engineering
%D 2001
%T The Quest for Petascale Computing
%A Jack Dongarra
%A David W. Walker
%B Computing in Science and Engineering
%V 3
%P 32-39
%8 2001-05
%G eng

%0 Journal Article
%J Scientific Programming
%D 2001
%T Recursive Approach in Sparse Matrix LU Factorization
%A Jack Dongarra
%A Victor Eijkhout
%A Piotr Luszczek
%B Scientific Programming
%V 9
%P 51-60
%8 2001-00
%G eng

%0 Generic
%D 2001
%T Repository in a Box Toolkit for Software and Resource Sharing
%A Shirley Browne
%A Paul McMahan
%A Scott Wells
%K rib
%B University of Tennessee Computer Science Department Technical Report
%8 2001-00
%G eng

%0 Journal Article
%J European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting, Lecture Notes in Computer Science 2131
%D 2001
%T Review of Performance Analysis Tools for MPI Parallel Programs
%A Shirley Moore
%A David Cronk
%A Kevin London
%A Jack Dongarra
%K papi
%X In order to produce MPI applications that perform well on today’s parallel architectures, programmers need effective tools for collecting and analyzing performance data. A variety of such tools, both commercial and research, are becoming available. This paper reviews and evaluations the available cross-platform MPI performance analysis tools.
%B European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting, Lecture Notes in Computer Science 2131
%I Springer Verlag, Berlin
%C Greece
%P 241-248
%8 2001-09
%G eng
%R https://doi.org/10.1007/3-540-45417-9_34

%0 Generic
%D 2001
%T RIBAPI - Repository in a Box Application Programmer's Interface
%A Jeremy Millar
%A Paul McMahan
%A Jack Dongarra
%K rib
%B University of Tennessee Computer Science Technical Report
%8 2001-00
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2001
%T Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries
%A Ken Kennedy
%A Bradley Broom
%A Keith Cooper
%A Jack Dongarra
%A Rob Fowler
%A Dennis Gannon
%A Lennart Johnsson
%A John Mellor-Crummey
%A Linda Torczon
%B Journal of Parallel and Distributed Computing
%V 61
%P 1803-1826
%8 2001-12
%G eng

%0 Conference Paper
%B Conference on Linux Clusters: The HPC Revolution
%D 2001
%T Using PAPI for Hardware Performance Monitoring on Linux Systems
%A Jack Dongarra
%A Kevin London
%A Shirley Moore
%A Phil Mucci
%A Dan Terpstra
%K papi
%X PAPI is a specification of a cross-platform interface to hardware performance counters on modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals related to a processor's function. Monitoring these events has a variety of uses in application performance analysis and tuning. The PAPI specification consists of both a standard set of events deemed most relevant for application performance tuning, as well as both high-level and low-level sets of routines for accessing the counters. The high level interface simply provides the ability to start, stop, and read sets of events, and is intended for the acquisition of simple but accurate measurement by application engineers. The fully programmable low-level interface provides sophisticated options for controlling the counters, such as setting thresholds for interrupt on overflow, as well as access to all native counting modes and events, and is intended for third-party tool writers or users with more sophisticated needs. PAPI has been implemented on a number of platforms, including Linux/x86 and Linux/IA-64. The Linux/x86 implementation requires a kernel patch that provides a driver for the hardware counters. The driver memory maps the counter registers into user space and allows virtualizing the counters on a perprocess or per-thread basis. The kernel patch is being proposed for inclusion in the main Linux tree. The PAPI library provides access on Linux platforms not only to the standard set of events mentioned above but also to all the Linux/x86 and Linux/IA-64 native events. PAPI has been installed and is in use, either directly or through incorporation into third-party end-user performance analysis tools, on a number of Linux clusters, including the New Mexico LosLobos cluster and Linux clusters at NCSA and the University of Tennessee being used for the GrADS (Grid Application Development Software) project.
%B Conference on Linux Clusters: The HPC Revolution
%I Linux Clusters Institute
%C Urbana, Illinois
%8 2001-06
%G eng

%0 Generic
%D 2000
%T Automated Empirical Optimizations of Software and the ATLAS Project (LAPACK Working Note 147)
%A Clint Whaley
%A Antoine Petitet
%A Jack Dongarra
%K atlas
%B University of Tennessee Computer Science Department Technical Report,
%8 2000-09
%G eng

%0 Conference Proceedings
%B Proceedings of SuperComputing 2000 (SC'2000)
%D 2000
%T Automatically Tuned Collective Communications
%A Sathish Vadhiyar
%A Graham Fagg
%A Jack Dongarra
%K ftmpi
%B Proceedings of SuperComputing 2000 (SC'2000)
%C Dallas, TX
%8 2000-11
%G eng

%0 Generic
%D 2000
%T Design and Implementation of NetSolve using DCOM as the Remoting Layer
%A Ganapathy Raman
%A Jack Dongarra
%K netsolve
%B University of Tennessee Computer Science Department Technical Report
%8 2000-05
%G eng

%0 Journal Article
%J Concurrency: Practice and Experience
%D 2000
%T The Design and Implementation of the Parallel Out of Core ScaLAPACK LU, QR, and Cholesky Factorization Routines
%A Eduardo D'Azevedo
%A Jack Dongarra
%B Concurrency: Practice and Experience
%V 12
%P 1481-1493
%8 2000-01
%G eng

%0 Conference Proceedings
%B to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications
%D 2000
%T Developing an Architecture to Support the Implementation and Development of Scientific Computing Applications
%A Dorian Arnold
%A Jack Dongarra
%K netsolve
%B to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications
%C Ottawa, Canada
%8 2000-10
%G eng

%0 Conference Proceedings
%B Lecture Notes in Computer Science: Proceedings of EuroPVM-MPI 2000
%D 2000
%T FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
%A Graham Fagg
%A Jack Dongarra
%K ftmpi
%B Lecture Notes in Computer Science: Proceedings of EuroPVM-MPI 2000
%C (Hungary: Springer Verlag, 2000)
%P V1908,346-353
%8 2000-01
%G eng

%0 Generic
%D 2000
%T The GrADS Project: Software Support for High-Level Grid Application Development
%A Francine Berman
%A Andrew Chien
%A Keith Cooper
%A Jack Dongarra
%A Ian Foster
%A Dennis Gannon
%A Lennart Johnsson
%A Ken Kennedy
%A Carl Kesselman
%A Dan Reed
%A Linda Torczon
%A Rich Wolski
%K grads
%B Technical Report
%8 2000-02
%G eng

%0 Conference Proceedings
%B FOMMS 2000: Foundations of Molecular Modeling and Simulation Conference (to appear)
%D 2000
%T High Performance Computing Today
%A Jack Dongarra
%A Hans Meuer
%A Horst D. Simon
%A Erich Strohmaier
%B FOMMS 2000: Foundations of Molecular Modeling and Simulation Conference (to appear)
%8 2000-01
%G eng

%0 Journal Article
%J In Active Middleware Services, Ed. Salim Hariri, Craig A. Lee, Cauligi S. Raghavendra (2000), Kluwer Academic
%D 2000
%T Logistical Networking: Sharing More Than the Wires
%A Micah Beck
%A Terry Moore
%A James Plank
%A Martin Swany
%B In Active Middleware Services, Ed. Salim Hariri, Craig A. Lee, Cauligi S. Raghavendra (2000), Kluwer Academic
%C Norwell, MA
%8 2000-01
%G eng

%0 Journal Article
%J Encyclopedia of Electrical and Engineering, Supplement 1
%D 2000
%T Message Passing Software Systems
%A Jack Dongarra
%A Graham Fagg
%A Rolf Hempel
%A David W. Walker
%E J. Webster
%K ftmpi
%B Encyclopedia of Electrical and Engineering, Supplement 1
%I John Wiley & Sons, Inc.
%8 2000-00
%G eng

%0 Generic
%D 2000
%T Metacomputing: An Evaluation of Emerging Systems
%A David Cronk
%A Brett Ellis
%A Graham Fagg
%B University of Tennessee Computer Science Department Technical Report
%8 2000-07
%G eng

%0 Conference Proceedings
%B 2000 International Conference on Parallel Processing (ICPP-2000)
%D 2000
%T The NetSolve Environment: Progressing Towards the Seamless Grid
%A Dorian Arnold
%A Jack Dongarra
%K netsolve
%B 2000 International Conference on Parallel Processing (ICPP-2000)
%C Toronto, Canada
%8 2000-08
%G eng

%0 Conference Proceedings
%B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation
%D 2000
%T A New Recursive Implementation of Sparse Cholesky Factorization
%A Jack Dongarra
%A Padma Raghavan
%B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation
%C Lausanne, Switzerland
%8 2000-08
%G eng

%0 Generic
%D 2000
%T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report)
%A Jack Dongarra
%B University of Tennessee Computer Science Department Technical Report
%8 2000-01
%G eng

%0 Generic
%D 2000
%T A Portable Programming Interface for Performance Evaluation on Modern Processors
%A Shirley Browne
%A Jack Dongarra
%A Nathan Garner
%A Kevin London
%A Phil Mucci
%B University of Tennessee Computer Science Technical Report, UT-CS-00-444
%8 2000-07
%G eng

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2000
%T A Portable Programming Interface for Performance Evaluation on Modern Processors
%A Shirley Browne
%A Jack Dongarra
%A Nathan Garner
%A George Ho
%A Phil Mucci
%K papi
%B The International Journal of High Performance Computing Applications
%V 14
%P 189-204
%8 2000-09
%G eng
%R https://doi.org/10.1177/109434200001400303

%0 Journal Article
%J ASTC-HPC 2000
%D 2000
%T Providing Infrastructure and Interface to High Performance Applications in a Distributed Setting
%A Dorian Arnold
%A Wonsuck Lee
%A Jack Dongarra
%A Mary Wheeler
%B ASTC-HPC 2000
%C Washington, DC
%8 2000-04
%G eng

%0 Conference Proceedings
%B Lecture Notes in Computer Science: Proceedings of 7th European PVM/MPI Users' Group Meeting 2000
%D 2000
%T Recent Advances in Parallel Virtual Machine and Message Passing Interface
%A Jack Dongarra
%A Peter Kacsuk
%A N. Podhorszki
%K ftmpi
%B Lecture Notes in Computer Science: Proceedings of 7th European PVM/MPI Users' Group Meeting 2000
%C (Hungary: Springer Verlag)
%P V1908
%8 2000-01
%G eng

%0 Conference Proceedings
%B Proceedings of 1st SGI Users Conference
%D 2000
%T Recursive approach in sparse matrix LU factorization
%A Jack Dongarra
%A Victor Eijkhout
%A Piotr Luszczek
%B Proceedings of 1st SGI Users Conference
%C Cracow, Poland (ACC Cyfronet UMM, 2000)
%P 409-418
%8 2000-01
%G eng

%0 Conference Proceedings
%B Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing
%D 2000
%T Request Sequencing: Optimizing Communication for the Grid
%A Dorian Arnold
%A Dieter Bachmann
%A Jack Dongarra
%K netsolve
%B Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing
%C (Germany: Springer Verlag 2000)
%P V1900,1213-1222
%8 2000-01
%G eng

%0 Conference Proceedings
%B Proceedings of SuperComputing 2000 (SC'00)
%D 2000
%T A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters
%A Shirley Browne
%A Jack Dongarra
%A Nathan Garner
%A Kevin London
%A Phil Mucci
%K papi
%B Proceedings of SuperComputing 2000 (SC'00)
%C Dallas, TX
%8 2000-11
%G eng

%0 Conference Proceedings
%B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation
%D 2000
%T Seamless Access to Adaptive Solver Algorithms
%A Dorian Arnold
%A Susan Blackford
%A Jack Dongarra
%A Victor Eijkhout
%A Tinghua Xu
%K netsolve
%B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation
%C Lausanne, Switzerland
%8 2000-08
%G eng

%0 Generic
%D 2000
%T Secure Remote Access to Numerical Software and Computation Hardware
%A Dorian Arnold
%A Shirley Browne
%A Jack Dongarra
%A Graham Fagg
%A Keith Moore
%B University of Tennessee Computer Science Technical Report, UT-CS-00-446
%8 2000-07
%G eng

%0 Conference Proceedings
%B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000
%D 2000
%T Secure Remote Access to Numerical Software and Computational Hardware
%A Dorian Arnold
%A Shirley Browne
%A Jack Dongarra
%A Graham Fagg
%A Keith Moore
%K netsolve
%B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000
%C Albuquerque, NM
%8 2000-06
%G eng

%0 Generic
%D 2000
%T Top500 Supercomputer Sites (15th edition)
%A Jack Dongarra
%A Hans Meuer
%A Erich Strohmaier
%K top500
%B University of Tennessee Computer Science Department Technical Report
%8 2000-06
%G eng

%0 Journal Article
%J Parallel Processing Letters
%D 1999
%T Algorithmic Issues on Heterogeneous Computing Platforms
%A Pierre Boulet
%A Jack Dongarra
%A Fabrice Rastello
%A Yves Robert
%A Frederic Vivien
%B Parallel Processing Letters
%V 9
%P 197-213
%8 1999-01
%G eng

%0 Journal Article
%J SIAM News
%D 1999
%T Atlanta Organizers Put Mathematics to Work For the Math Sciences Community
%A Michael Berry
%A Jack Dongarra
%B SIAM News
%V 32
%8 1999-01
%G eng

%0 Generic
%D 1999
%T A Comparison of Parallel Solvers for Diagonally Dominant and General Narrow Banded Linear Systems II (LAPACK Working Note 143)
%A Peter Arbenz
%A Andrew Cleary
%A Jack Dongarra
%A Markus Hegland
%B University of Tennessee Computer Science Department Technical Report
%8 1999-01
%G eng

%0 Generic
%D 1999
%T A Comparison of Parallel Solvers for General Narrow Banded Linear Systems (LAPACK Working Note 142)
%A Peter Arbenz
%A Andrew Cleary
%A Jack Dongarra
%A Markus Hegland
%B University of Tennessee Computer Science Technical Report
%8 1999-01
%G eng

%0 Journal Article
%J Future Generation Computer Systems
%D 1999
%T Deploying Fault-tolerance and Task Migration with NetSolve
%A Henri Casanova
%A James Plank
%A Micah Beck
%A Jack Dongarra
%K netsolve
%B Future Generation Computer Systems
%I Elsevier
%V 15
%P 745-755
%8 1999-10
%G eng

%0 Generic
%D 1999
%T On the Existence Problem of Incomplete Factorisation Methods
%A Victor Eijkhout
%B University of Tennessee Computer Science Department Technical Report
%8 1999-12
%G eng

%0 Journal Article
%J Parallel and Distributed Computing Practices, Special Issue: Cluster Computing
%D 1999
%T Experiences with Windows 95/NT as a Cluster Computing Platform for Parallel Computing
%A Markus Fischer
%A Jack Dongarra
%B Parallel and Distributed Computing Practices, Special Issue: Cluster Computing
%I Nova Science Publishers, USA
%V 2
%P 119-128
%8 1999-02
%G eng

%0 Journal Article
%J International Journal on Future Generation Computer Systems
%D 1999
%T HARNESS: A Next Generation Distributed Virtual Machine
%A Micah Beck
%A Jack Dongarra
%A Graham Fagg
%A Al Geist
%A Paul Gray
%A James Kohl
%A Mauro Migliardi
%A Keith Moore
%A Terry Moore
%A Philip Papadopoulous
%A Stephen L. Scott
%A Vaidy Sunderam
%K harness
%B International Journal on Future Generation Computer Systems
%V 15
%P 571-582
%8 1999-01
%G eng

%0 Generic
%D 1999
%T IBP - Internet Backplane Protocol: Infrastructure for Distributed Storage (V O.2)
%A Wael Elwasif
%A Micah Beck
%A James Plank
%B University of Tennessee Computer Science Department Technical Report
%8 1999-02
%G eng

%0 Journal Article
%J Philadelphia: Society for Industrial and Applied Mathematics
%D 1999
%T LAPACK Users' Guide, 3rd ed.
%A Ed Anderson
%A Zhaojun Bai
%A Christian Bischof
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Jeremy Du Croz
%A Anne Greenbaum
%A Sven Hammarling
%A Alan McKenney
%A Danny Sorensen
%B Philadelphia: Society for Industrial and Applied Mathematics
%8 1999-01
%G eng

%0 Journal Article
%J Computer Communications
%D 1999
%T Logistical Quality of Service in NetSolve
%A Micah Beck
%A Henri Casanova
%A Jack Dongarra
%A Terry Moore
%A James Plank
%A Francine Berman
%A Rich Wolski
%K netsolve
%B Computer Communications
%V 22
%P 1034-1044
%8 1999-01
%G eng

%0 Journal Article
%J IEEE Cluster Computing BOF at SC99
%D 1999
%T Numerical Libraries and Tools for Scalable Parallel Cluster Computing
%A Shirley Browne
%A Jack Dongarra
%A Anne Trefethen
%B IEEE Cluster Computing BOF at SC99
%C Portland, Oregon
%8 1999-01
%G eng

%0 Journal Article
%J Encyclopedia of Computer Science and Technology, eds. Kent, A., Williams, J.
%D 1999
%T Numerical Linear Algebra
%A Jack Dongarra
%A Victor Eijkhout
%E Marcel Dekker
%B Encyclopedia of Computer Science and Technology, eds. Kent, A., Williams, J.
%V 41
%P 207-233
%8 1999-08
%G eng

%0 Journal Article
%J Journal of Computational and Applied Mathematics
%D 1999
%T Numerical Linear Algebra Algorithms and Software
%A Jack Dongarra
%A Victor Eijkhout
%B Journal of Computational and Applied Mathematics
%V 123
%P 489-514
%8 1999-10
%G eng

%0 Journal Article
%J SIAM Annual Meeting
%D 1999
%T A Numerical Linear Algebra Problem Solving Environment Designer's Perspective (LAPACK Working Note 139)
%A Antoine Petitet
%A Henri Casanova
%A Clint Whaley
%A Jack Dongarra
%A Yves Robert
%B SIAM Annual Meeting
%C Atlanta, GA
%8 1999-05
%G eng

%0 Conference Proceedings
%B Proceedings of Department of Defense HPCMP Users Group Conference
%D 1999
%T PAPI: A Portable Interface to Hardware Performance Counters
%A Shirley Browne
%A Christine Deane
%A George Ho
%A Phil Mucci
%K papi
%B Proceedings of Department of Defense HPCMP Users Group Conference
%8 1999-06
%G eng

%0 Journal Article
%J Handbook on Parallel and Distributed Processing
%D 1999
%T Parallel and Distributed Scientific Computing: A Numerical Linear Algebra Problem Solving Environment Designer's Perspective
%A Antoine Petitet
%A Henri Casanova
%A Jack Dongarra
%A Yves Robert
%A Clint Whaley
%B Handbook on Parallel and Distributed Processing
%8 1999-01
%G eng

%0 Conference Proceedings
%B 4th Intl. Web Caching Workshop
%D 1999
%T Portable Representation of Internet Content Channels in I2-DSI
%A Micah Beck
%A Rajeev Chawla
%A Bert Dempsey
%A Terry Moore
%B 4th Intl. Web Caching Workshop
%C San Diego, CA
%8 1999-03
%G eng

%0 Journal Article
%J Journal on Future Generation Computer Systems
%D 1999
%T Scalable Networked Information Processing Environment (SNIPE)
%A Graham Fagg
%A Keith Moore
%A Jack Dongarra
%K harness
%B Journal on Future Generation Computer Systems
%V 15
%P 595-605
%8 1999-01
%G eng

%0 Journal Article
%J Parallel Computing
%D 1999
%T Static Tiling for Heterogeneous Computing Platforms
%A Pierre Boulet
%A Jack Dongarra
%A Yves Robert
%A Frederic Vivien
%B Parallel Computing
%V 25
%P 547-568
%8 1999-01
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 1999
%T Stochastic Performance Prediction for Iterative Algorithms in Distributed Environments
%A Henri Casanova
%A Myung Ho Kim
%A James Plank
%A Jack Dongarra
%B Journal of Parallel and Distributed Computing
%V 98
%P 68-91
%8 1999-01
%G eng

%0 Journal Article
%J Concurrency: Practice and Experience
%D 1999
%T Tiling on Systems with Communication/Computation Overlap
%A Pierre-Yves Calland
%A Jack Dongarra
%A Yves Robert
%B Concurrency: Practice and Experience
%V 11
%P 139-153
%8 1999-01
%G eng

%0 Generic
%D 1999
%T Top500 Supercomputer Sites (13th edition)
%A Jack Dongarra
%A Hans Meuer
%A Erich Strohmaier
%K top500
%B University of Tennessee Computer Science Department Technical Report
%8 1999-06
%G eng

%0 Generic
%D 1999
%T Top500 Supercomputer Sites (14th edition)
%A Jack Dongarra
%A Hans Meuer
%A Erich Strohmaier
%K top500
%B University of Tennessee Computer Science Department Technical Report
%8 1999-11
%G eng

%0 Generic
%D 1999
%T Towards An Efficient, Scalable Replication Mechanism for the I2-DSI Project
%A Bert Dempsey
%A Debra Weiss
%B University of North Carolina School of Library and Information Science Technical Report
%8 1999-01
%G eng

%0 Generic
%D 1999
%T The 'Weighted Modification' Incomplete Factorisation Method
%A Victor Eijkhout
%B University of Tennessee Computer Science Department Technical Report
%8 1999-12
%G eng

%0 Conference Paper
%B 1998 ACM/IEEE conference on Supercomputing (SC '98)
%D 1998
%T Automatically Tuned Linear Algebra Software
%A Clint Whaley
%A Jack Dongarra
%K BLAS
%K code generation
%K high performance
%K linear algebra
%K optimization
%K Tuning
%X This paper describes an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and time consuming process. The work described here can help in automating much of this process. We will concentrate our efforts on the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS). In particular, the work presented here is for general matrix multiply, DGEMM. However much of the technology and approach developed here can be applied to the other Level 3 BLAS and the general strategy can have an impact on basic linear algebra operations in general and may be extended to other important kernel operations.
%B 1998 ACM/IEEE conference on Supercomputing (SC '98)
%I IEEE Computer Society
%C Orlando, FL
%8 1998-11
%@ 0-89791-984-X
%G eng

%0 Book
%D 1998
%T MPI - The Complete Reference, Volume 1: The MPI Core
%A Marc Snir
%A Steve Otto
%A Steven Huss-Lederman
%A David Walker
%A Jack Dongarra
%X Since its release in summer 1994, the Message Passing Interface (MPI) specification has become a standard for message-passing libraries for parallel computations. There exist more than a dozen implementations on a variety of computing platforms, from the IBM SP-2 supercomputer to PCs running Windows NT. The initial MPI Standard, known as MPI-1, has been modified over the last two years. This volume, the definitive reference manual for the latest version of MPI-1, contains a complete specification of the MPI Standard. It is annotated with comments that clarify complicated issues, including why certain design choices were made, how users are intended to use the interface, and how they should construct their version of MPI. The volume also provides many detailed, illustrative programming examples.
%7 Second
%I MIT Press
%C Cambridge, MA, USA
%P 426
%8 1998-08
%@ 978-0-262-69215-1
%G eng

%0 Journal Article
%J D-Lib Magazine
%D 1998
%T National HPCC Software Exchange (NHSE): Uniting the High Performance Computing and Communications Community
%A Shirley Browne
%A Jack Dongarra
%A Jeff Horner
%A Paul McMahan
%A Scott Wells
%K rib
%B D-Lib Magazine
%8 1998-01
%G eng

%0 Book
%B Software, Environments and Tools
%D 1998
%T Numerical Linear Algebra for High-Performance Computers
%A Jack Dongarra
%A Iain Duff
%A Danny Sorensen
%A Henk van der Vorst
%X This book presents a unified treatment of recently developed techniques and current understanding about solving systems of linear equations and large scale eigenvalue problems on high-performance computers. It provides a rapid introduction to the world of vector and parallel processing for these linear algebra applications.    Topics include major elements of advanced-architecture computers and their performance, recent algorithmic development, and software for direct solution of dense matrix problems, direct solution of sparse systems of equations, iterative solution of sparse systems of equations, and solution of large sparse eigenvalue problems.    This book supersedes the SIAM publication Solving Linear Systems on Vector and Shared Memory Computers, which appeared in 1990. The new book includes a considerable amount of new material in addition to incorporating a substantial revision of existing text.
%B Software, Environments and Tools
%I SIAM
%G eng
%R https://doi.org/10.1137/1.9780898719611

%0 Journal Article
%J Computer Physics Communications
%D 1996
%T ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance
%A Jaeyoung Choi
%A Jim Demmel
%A Inderjit Dhillon
%A Jack Dongarra
%A Susan Ostrouchov
%A Antoine Petitet
%A Kendall Stanley
%A David Walker
%A Clint Whaley
%X This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different components and building blocks of ScaLAPACK. This paper outlines the difficulties inherent in producing correct codes for networks of heterogeneous processors. We define a theoretical model of parallel computers dedicated to linear algebra applications: the Distributed Linear Algebra Machine (DLAM). This model provides a convenient framework for developing parallel algorithms and investigating their scalability, performance and programmability. Extensive performance results on various platforms are presented and analyzed with the help of the DLAM. Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes by suggesting alternative approaches to mathematical libraries, explaining how ScaLAPACK could be integrated into efficient and user-friendly distributed systems.
%B Computer Physics Communications
%V 97
%P 1-15
%8 1996-08
%G eng
%N 1-2
%R https://doi.org/10.1016/0010-4655(96)00017-3