%0 Book Section
%B Fog Computing: Theory and Practice
%D 2020
%T Harnessing the Computing Continuum for Programming Our World
%A Pete Beckman
%A Jack Dongarra
%A Nicola Ferrier
%A Geoffrey Fox
%A Terry Moore
%A Dan Reed
%A Micah Beck
%X This chapter outlines a vision for how best to harness the computing continuum of interconnected sensors, actuators, instruments, and computing systems, from small numbers of very large devices to large numbers of very small devices. The hypothesis is that only via a continuum perspective one can intentionally specify desired continuum actions and effectively manage outcomes and systemic properties—adaptability and homeostasis, temporal constraints and deadlines—and elevate the discourse from device programming to intellectual goals and outcomes. Development of a framework for harnessing the computing continuum would catalyze new consumer services, business processes, social services, and scientific discovery. Realizing and implementing a continuum programming model requires balancing conflicting constraints and translating the high‐level specification into a form suitable for execution on a unifying abstract machine model. In turn, the abstract machine must implement the mapping of specification demands to end‐to‐end resources.
%B Fog Computing: Theory and Practice
%I John Wiley & Sons, Inc.
%@ 9781119551713
%G eng
%& 7
%R https://doi.org/10.1002/9781119551713.ch7
%0 Conference Paper
%B International Conference on Computational Science (ICCS 2020)
%D 2020
%T heFFTe: Highly Efficient FFT for Exascale
%A Alan Ayala
%A Stanimire Tomov
%A Azzam Haidar
%A Jack Dongarra
%K exascale
%K FFT
%K gpu
%K scalable algorithm
%X Exascale computing aspires to meet the increasing demands from large scientific applications. Software targeting exascale is typically designed for heterogeneous architectures; henceforth, it is not only important to develop well-designed software, but also make it aware of the hardware architecture and efficiently exploit its power. Currently, several and diverse applications, such as those part of the Exascale Computing Project (ECP) in the United States, rely on efficient computation of the Fast Fourier Transform (FFT). In this context, we present the design and implementation of heFFTe (Highly Efficient FFT for Exascale) library, which targets the upcoming exascale supercomputers. We provide highly (linearly) scalable GPU kernels that achieve more than 40× speedup with respect to local kernels from CPU state-of-the-art libraries, and over 2× speedup for the whole FFT computation. A communication model for parallel FFTs is also provided to analyze the bottleneck for large-scale problems. We show experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 24,576 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs.
%B International Conference on Computational Science (ICCS 2020)
%C Amsterdam, Netherlands
%8 2020-06
%G eng
%R https://doi.org/10.1007/978-3-030-50371-0_19
%0 Generic
%D 2020
%T hipMAGMA v1.0
%A Cade Brown
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%I Zenodo
%8 2020-03
%G eng
%U https://doi.org/10.5281/zenodo.3908549
%R 10.5281/zenodo.3908549
%0 Generic
%D 2020
%T hipMAGMA v2.0
%A Cade Brown
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%I Zenodo
%8 2020-07
%G eng
%U https://doi.org/10.5281/zenodo.3928667
%R 10.5281/zenodo.3928667
%0 Conference Paper
%B ISC High Performance
%D 2019
%T Hands-on Research and Training in High-Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments
%A Kwai Wong
%A Stanimire Tomov
%A Jack Dongarra
%B ISC High Performance
%I Springer International Publishing
%C Frankfurt, Germany
%8 2019-06
%G eng
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18)
%D 2018
%T Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%A Nicholas J. Higham
%X Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax = b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16-FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic.
%B The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18)
%I IEEE
%C Dallas, TX
%8 2018-11
%G eng
%R https://doi.org/10.1109/SC.2018.00050
%0 Generic
%D 2018
%T Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%I GPU Technology Conference (GTC), Poster
%C San Jose, CA
%8 2018-03
%G eng
%0 Conference Paper
%B 8th Workshop on Irregular Applications: Architectures and Algorithms
%D 2018
%T High-Performance GPU Implementation of PageRank with Reduced Precision based on Mantissa Segmentation
%A Anzt, Hartwig
%A Thomas Gruetzmacher
%A Enrique S. Quintana-Orti
%A Scheidegger, Florian
%B 8th Workshop on Irregular Applications: Architectures and Algorithms
%G eng
%0 Conference Paper
%B Proceedings of the General Purpose GPUs (GPGPU-10)
%D 2017
%T High-performance Cholesky Factorization for GPU-only Execution
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%X We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. We achieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8× faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.
%B Proceedings of the General Purpose GPUs (GPGPU-10)
%I ACM
%C Austin, TX
%8 2017-02
%G eng
%R https://doi.org/10.1145/3038228.3038237
%0 Conference Paper
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2016
%T Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures
%A Yulu Jia
%A Piotr Luszczek
%A Jack Dongarra
%X Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction.
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%D 2016
%T Heterogeneous Streaming
%A Chris J. Newburn
%A Gaurav Bansal
%A Michael Wood
%A Luis Crivelli
%A Judit Planas
%A Alejandro Duran
%A Paulo Souza
%A Leonardo Borges
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%A Hartwig Anzt
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Ichitaro Yamazaki
%A Jesus Labarta
%X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2016
%T High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems
%A Jack Dongarra
%A Michael A. Heroux
%A Piotr Luszczek
%B International Journal of High Performance Computing Applications
%V 30
%P 3 - 10
%8 2016-02
%G eng
%U http://hpc.sagepub.com/cgi/doi/10.1177/1094342015593158
%N 1
%! International Journal of High Performance Computing Applications
%R 10.1177/1094342015593158
%0 Generic
%D 2016
%T High Performance Realtime Convex Solver for Embedded Systems
%A Ichitaro Yamazaki
%A Saeid Nooshabadi
%A Stanimire Tomov
%A Jack Dongarra
%K KKT
%K Realtime embedded convex optimization solver
%X Convex optimization solvers for embedded systems find widespread use. This letter presents a novel technique to reduce the run-time of decomposition of KKT matrix for the convex optimization solver for an embedded system, by two orders of magnitude. We use the property that although the KKT matrix changes, some of its block sub-matrices are fixed during the solution iterations and the associated solving instances.
%B University of Tennessee Computer Science Technical Report
%8 2016-10
%G eng
%0 Conference Paper
%B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16)
%D 2016
%T High-performance Matrix-matrix Multiplications of Very Small Matrices
%A Ian Masliah
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Joël Falcou
%A Jack Dongarra
%X The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries.
%B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16)
%I Springer International Publishing
%C Grenoble, France
%8 2016-08
%G eng
%0 Generic
%D 2016
%T High-Performance Tensor Contractions for GPUs
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2016-01
%G eng
%0 Conference Paper
%B International Conference on Computational Science (ICCS'16)
%D 2016
%T High-Performance Tensor Contractions for GPUs
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%K Applications
%K Batched linear algebra
%K FEM
%K gpu
%K Tensor contractions
%K Tensor HPC
%X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
%B International Conference on Computational Science (ICCS'16)
%C San Diego, CA
%8 2016-06
%G eng
%0 Generic
%D 2016
%T The HPL Benchmark: Past, Present & Future
%A Jack Dongarra
%C ISC High Performance, Frankfurt, Germany
%8 2016-07
%G eng
%9 Conference Presentation
%0 Conference Paper
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2015
%T Hierarchical DAG scheduling for Hybrid Distributed Systems
%A Wei Wu
%A Aurelien Bouteiller
%A George Bosilca
%A Mathieu Faverge
%A Jack Dongarra
%K dense linear algebra
%K gpu
%K heterogeneous architecture
%K PaRSEC runtime
%X Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the pro-gramming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel effi-ciency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments.
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Hyderabad, India
%8 2015-05
%G eng
%0 Book Section
%B The Princeton Companion to Applied Mathematics
%D 2015
%T High-Performance Computing
%A Jack Dongarra
%A Nicholas J. Higham
%A Mark R. Dennis
%A Paul Glendinning
%A Paul A. Martin
%A Fadil Santosa
%A Jared Tanner
%B The Princeton Companion to Applied Mathematics
%I Princeton University Press
%C Princeton, New Jersey
%P 839-842
%@ 9781400874477
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2015
%T High-Performance Conjugate-Gradient Benchmark: A New Metric for Ranking High-Performance Computing Systems
%A Jack Dongarra
%A Michael A. Heroux
%A Piotr Luszczek
%K Additive Schwarz
%K HPC Benchmarking
%K Multigrid smoothing
%K Preconditioned Conjugate Gradient
%K Validation and Verification
%X We describe a new high-performance conjugate-gradient (HPCG) benchmark. HPCG is composed of computations and data-access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and to be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement.
%B The International Journal of High Performance Computing Applications
%G eng
%R 10.1177/1094342015593158
%0 Journal Article
%J Scientific Programming
%D 2015
%T HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
%A Azzam Haidar
%A Jack Dongarra
%A Khairul Kabir
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%A Yulu Jia
%K communication and computation overlap
%K dynamic runtime scheduling using dataflow dependences
%K hardware accelerators and coprocessors
%K Intel Xeon Phi processor
%K Many Integrated Cores
%K numerical linear algebra
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.
%B Scientific Programming
%V 23
%8 2015-01
%G eng
%N 1
%R 10.3233/SPR-140404
%0 Generic
%D 2015
%T HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems
%A Jack Dongarra
%A Michael A. Heroux
%A Piotr Luszczek
%K Additive Schwarz
%K HPC Benchmarking
%K Multigrid smoothing
%K Preconditioned Conjugate Gradient
%K Validation and Verification
%X We describe a new high performance conjugate gradient (HPCG) benchmark. HPCG is composed of computations and data access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2015-01
%G eng
%U http://www.eecs.utk.edu/resources/library/file/1047/ut-eecs-15-736.pdf
%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K Computer science
%K factorization
%K Heterogeneous systems
%K Intel Xeon Phi
%K linear algebra
%X We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs and coprocessors using a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components.
%B VECPAR 2014
%C Eugene, OR
%8 2014-06
%G eng
%0 Conference Paper
%B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014
%D 2014
%T Hybrid Multi-Elimination ILU Preconditioners on GPUs
%A Dimitar Lukarski
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X Abstract—Iterative solvers for sparse linear systems often benefit from using preconditioners. While there are implementations for many iterative methods that leverage the computing power of accelerators, porting the latest developments in preconditioners to accelerators has been challenging. In this paper we develop a selfadaptive multi-elimination preconditioner for graphics processing units (GPUs). The preconditioner is based on a multi-level incomplete LU factorization and uses a direct dense solver for the bottom-level system. For test matrices from the University of Florida matrix collection, we investigate the influence of handling the triangular solvers in the distinct iteration steps in either single or double precision arithmetic. Integrated into a Conjugate Gradient method, we show that our multi-elimination algorithm is highly competitive against popular preconditioners, including multi-colored symmetric Gauss-Seidel relaxation preconditioners, and (multi-colored symmetric) ILU for numerous problems.
%B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Journal Article
%J Parallel Computing
%D 2013
%T Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Mathias Jacquelin
%A Julien Langou
%A Yves Robert
%K Cluster
%K Distributed memory
%K Hierarchical architecture
%K multi-core
%K numerical linear algebra
%K QR factorization
%X This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ‘‘communication-avoiding’’), it is natural to consider hierarchical trees composed of an ‘‘inter-node’’ tree which acts on top of ‘‘intra-node’’ trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ‘‘TS level’’ for cache-friendliness, (1) ‘‘low-level’’ for decoupled highly parallel inter-node reductions, (2) ‘‘domino level’’ to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGUE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
%B Parallel Computing
%V 39
%P 212-232
%8 2013-05
%G eng
%N 4-5
%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2013
%T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K algorithms
%K bidiagional reduction
%K bulge chasing
%K data translation layer
%K dynamic scheduling
%K high performance kernels
%K performance
%K tile algorithms
%K two-stage approach
%X This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.
%B ACM Transactions on Mathematical Software (TOMS)
%V 39
%G eng
%N 3
%R 10.1145/2450153.2450154
%0 Book Section
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%D 2013
%T HPC Challenge: Design, History, and Implementation Highlights
%A Jack Dongarra
%A Piotr Luszczek
%K exascale
%K hpc challenge
%K hpcc
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%I Taylor and Francis
%C Boca Raton, FL
%@ 978-1-4665-6834-1
%G eng
%& 2
%0 Generic
%D 2013
%T Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters
%A Tingxing Dong
%A Veselin Dobrev
%A Tzanio Kolev
%A Robert Rieben
%A Stanimire Tomov
%A Jack Dongarra
%X The explosion of parallelism and heterogeneity in today's computer architectures has created opportunities as well as challenges for redesigning legacy numerical software to harness the power of new hardware. In this paper we address the main challenges in redesigning BLAST { a numerical library that solves the equations of compressible hydrodynamics using high order nite element methods (FEM) in a moving Lagrangian frame { to support CPU-GPU clusters. We use a hybrid MPI + OpenMP + CUDA programming model that includes two layers: domain decomposed MPI parallelization and OpenMP + CUDA acceleration in a given domain. To optimize the code, we implemented custom linear algebra kernels and introduced an auto-tuning technique to deal with heterogeneity and load balancing at runtime. Our tests show that 12 Intel Xeon cores and two M2050 GPUs deliver a 24x speedup compared to a single core, and a 2.5x speedup compared to 12 MPI tasks in one node. Further, we achieve perfect weak scaling, demonstrated on a cluster with up to 64 GPUs in 32 nodes. Our choice of programming model and proposed solutions, as related to parallelism and load balancing, specifically targets high order FEM discretizations, and can be used equally successfully for applications beyond hydrodynamics. A major accomplishment is that we further establish the appeal of high order FEMs, which despite their better approximation properties, are often avoided due to their high computational cost. GPUs, as we show, have the potential to make them the method of choice, as the increased computational cost is also localized, e.g., cast as Level 3 BLAS, and thus can be done very efficiently (close to \free" relative to the usual overheads inherent in sparse computations).
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng
%0 Conference Proceedings
%B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium
%D 2012
%T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Julien Langou
%A Yves Robert
%B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium
%I IEEE Computer Society Press
%C Shanghai, China
%8 2012-05
%G eng
%0 Journal Article
%J IPDPS 2012 (Best Paper)
%D 2012
%T HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters
%A Teng Ma
%A George Bosilca
%A Aurelien Bouteiller
%A Jack Dongarra
%B IPDPS 2012 (Best Paper)
%C Shanghai, China
%8 2012-05
%G eng
%0 Journal Article
%J Acta Numerica
%D 2012
%T High Performance Computing Systems: Status and Outlook
%A Jack Dongarra
%A Aad J. van der Steen
%B Acta Numerica
%I Cambridge University Press
%C Cambridge, UK
%V 21
%P 379-474
%8 2012-05
%G eng
%0 Journal Article
%J ICCS 2012
%D 2012
%T High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors
%A Peng Du
%A Piotr Luszczek
%A Jack Dongarra
%B ICCS 2012
%C Omaha, NE
%8 2012-06
%G eng
%0 Generic
%D 2012
%T How LAPACK library enables Microsoft Visual Studio support with CMake and LAPACKE
%A Julien Langou
%A Bill Hoffman
%A Brad King
%B University of Tennessee Computer Science Technical Report (also LAWN 270)
%8 2012-07
%G eng
%0 Journal Article
%J On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear)
%D 2012
%T HPC Challenge: Design, History, and Implementation Highlights
%A Jack Dongarra
%A Piotr Luszczek
%E Jeffrey Vetter
%B On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear)
%I Chapman & Hall/CRC Press
%8 2012-00
%G eng
%0 Generic
%D 2011
%T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Julien Langou
%A Yves Robert
%K magma
%K plasma
%B University of Tennessee Computer Science Technical Report (also Lawn 257)
%8 2011-10
%G eng
%0 Generic
%D 2011
%T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247)
%8 2011-05
%G eng
%0 Journal Article
%J IEEE Cluster 2011
%D 2011
%T High Performance Dense Linear System Solver with Soft Error Resilience
%A Peng Du
%A Piotr Luszczek
%A Jack Dongarra
%K ft-la
%B IEEE Cluster 2011
%C Austin, TX
%8 2011-09
%G eng
%0 Conference Proceedings
%B Proceedings of MTAGS11
%D 2011
%T High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%B Proceedings of MTAGS11
%C Seattle, WA
%8 2011-11
%G eng
%0 Journal Article
%J in GPU Computing Gems, Jade Edition
%D 2011
%T A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Samuel Thibault
%A Stanimire Tomov
%E Wen-mei W. Hwu
%K magma
%K morse
%B in GPU Computing Gems, Jade Edition
%I Elsevier
%V 2
%P 473-484
%8 2011-00
%G eng
%0 Journal Article
%J IEEE Transaction on Parallel and Distributed Systems (submitted)
%D 2010
%T Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators
%A Hatem Ltaeif
%A Stanimire Tomov
%A Rajib Nath
%A Jack Dongarra
%K magma
%K plasma
%B IEEE Transaction on Parallel and Distributed Systems (submitted)
%8 2010-03
%G eng
%0 Conference Proceedings
%B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing
%D 2009
%T A Holistic Approach for Performance Measurement and Analysis for Petascale Applications
%A Heike Jagode
%A Jack Dongarra
%A Sadaf Alam
%A Jeffrey Vetter
%A W. Spear
%A Allen D. Malony
%E Gabrielle Allen
%K point
%K test
%B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing
%I Springer-Verlag Berlin Heidelberg 2009
%C Baton Rouge, Louisiana
%V 2009
%P 686-695
%8 2009-05
%G eng
%0 Journal Article
%J Recent developments in Grid Technology and Applications
%D 2008
%T High Performance GridRPC Middleware
%A Yves Caniou
%A Eddy Caron
%A Frederic Desprez
%A Hidemoto Nakada
%A Yoshio Tanaka
%A Keith Seymour
%E George A. Gravvanis
%E John P. Morrison
%E Hamid R. Arabnia
%E D. A. Power
%K netsolve
%B Recent developments in Grid Technology and Applications
%I Nova Science Publishers
%8 2008-00
%G eng
%0 Journal Article
%J in Beautiful Code Leading Programmers Explain How They Think (Chapter 14)
%D 2008
%T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination
%A Jack Dongarra
%A Piotr Luszczek
%E Andy Oram
%E G. Wilson
%B in Beautiful Code Leading Programmers Explain How They Think (Chapter 14)
%P 243-282
%8 2008-01
%G eng
%0 Generic
%D 2008
%T HPCS Library Study Effort
%A Jack Dongarra
%A James Demmel
%A Parry Husbands
%A Piotr Luszczek
%B University of Tennessee Computer Science Technical Report, UT-CS-08-617
%8 2008-01
%G eng
%0 Journal Article
%J International Journal for High Performance Computer Applications
%D 2007
%T High Performance Development for High End Computing with Python Language Wrapper (PLW)
%A Jack Dongarra
%A Piotr Luszczek
%B International Journal for High Performance Computer Applications
%V 21
%P 360-369
%8 2007-00
%G eng
%0 Journal Article
%J in Beautiful Code Leading Programmers Explain How They Think
%D 2007
%T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination
%A Jack Dongarra
%A Piotr Luszczek
%E Andy Oram
%E Greg Wilson
%B in Beautiful Code Leading Programmers Explain How They Think
%I O'Reilly Media, Inc.
%8 2007-06
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications (to appear)
%D 2006
%T High Performance Development for High End Computing with Python Language Wrapper (PLW)
%A Piotr Luszczek
%K hpcc
%K lfc
%B International Journal of High Performance Computing Applications (to appear)
%8 2006-00
%G eng
%0 Journal Article
%J Euro PVM/MPI 2006
%D 2006
%T High Performance RDMA Protocols in HPC
%A Galen M. Shipman
%A George Bosilca
%A Maccabe, Arthur B.
%B Euro PVM/MPI 2006
%C Bonn, Germany
%8 2006-09
%G eng
%0 Journal Article
%J HeteroPar 2006
%D 2006
%T A High-Performance, Heterogeneous MPI
%A Richard L. Graham
%A Galen M. Shipman
%A Brian Barrett
%A Ralph Castain
%A George Bosilca
%A Andrew Lumsdaine
%B HeteroPar 2006
%C Barcelona, Spain
%8 2006-09
%G eng
%0 Conference Proceedings
%B SC06 Conference Tutorial
%D 2006
%T The HPC Challenge (HPCC) Benchmark Suite
%A Piotr Luszczek
%A David Bailey
%A Jack Dongarra
%A Jeremy Kepner
%A Robert Lucas
%A Rolf Rabenseifner
%A Daisuke Takahashi
%K hpcc
%K hpcchallenge
%B SC06 Conference Tutorial
%I IEEE
%C Tampa, Florida
%8 2006-11
%G eng
%0 Conference Proceedings
%B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI
%D 2005
%T Hash Functions for Datatype Signatures in MPI
%A George Bosilca
%A Jack Dongarra
%A Graham Fagg
%A Julien Langou
%E Beniamino Di Martino
%K ftmpi
%B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI
%I Springer-Verlag Berlin
%C Sorrento (Naples), Italy
%V 3666
%P 76-83
%8 2005-09
%G eng
%0 Journal Article
%J SC|05 Tutorial - S13
%D 2005
%T HPC Challenge v1.x Benchmark Suite
%A Piotr Luszczek
%A David Koester
%K hpcc
%B SC|05 Tutorial - S13
%C Seattle, Washington
%8 2005-01
%G eng
%0 Journal Article
%J Advances in Parallel Computing
%D 2003
%T Hardware-Counter Based Automatic Performance Analysis of Parallel Programs
%A Felix Wolf
%A Bernd Mohr
%K kojak
%K papi
%X The KOJAK performance-analysis environment identifies a large number of performance problems on parallel computers with SMP nodes. The current version concentrates on parallelism-related performance problems that arise from an inefficient usage of the parallel programming interfaces MPI and OpenMP, while ignoring individual CPU performance. This chapter describes an extended design of KOJAK capable of diagnosing low individual-CPU performance based on hardware-counter information and of integrating the results with those of the parallelism-centered analysis. The performance of parallel applications is determined by a variety of different factors. Performance of single components frequently influences the overall behavior in unexpected ways. Application programmers on current parallel machines have to deal with numerous performance-critical aspects: different modes of parallel execution, such as message passing, multi-threading or even a combination of the two, and performance on individual CPU that is determined by the interaction of different functional units. The KOJAK analysis process is composed of two parts: a semi-automatic instrumentation of the user application followed by an automatic analysis of the generated performance data. KOJAK's instrumentation software runs on most major UNlX platforms and works on multiple levels, including source-code, compiler, and linker.
%B Advances in Parallel Computing
%I Elsevier
%C Dresden, Germany
%V 13
%P 753-760
%8 2004-01
%G eng
%R https://doi.org/10.1016/S0927-5452(04)80092-3
%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T High Performance Computing for Computational Science
%A Jose Palma
%A Jack Dongarra
%A Vicente Hernández
%E Antonio Augusto Sousa
%B Lecture Notes in Computer Science
%I Springer-Verlag, Berlin
%C VECPAR 2002, 5th International Conference June 26-28, 2002
%V 2565
%8 2003-01
%G eng
%0 Conference Proceedings
%B Lecture Notes in Computer Science, High Performance Computing, 5th International Symposium ISHPC
%D 2003
%T High Performance Computing Trends and Self Adapting Numerial Software
%A Jack Dongarra
%B Lecture Notes in Computer Science, High Performance Computing, 5th International Symposium ISHPC
%I Springer-Verlag, Heidelberg
%C Tokyo-Odaiba, Japan
%V 2858
%P 1-9
%8 2003-01
%G eng
%0 Conference Proceedings
%B Information Processing Society of Japan Symposium Series
%D 2003
%T High Performance Computing Trends, Supercomputers, Clusters, and Grids
%A Jack Dongarra
%B Information Processing Society of Japan Symposium Series
%V 2003
%P 55-58
%8 2003-01
%G eng
%0 Generic
%D 2002
%T Hardware Software Server in NetSolve
%A Sudesh Agrawal
%K netsolve
%B ICL Technical Report
%8 2002-01
%G eng
%0 Journal Article
%J Future Generation Computer Systems
%D 2002
%T HARNESS Fault Tolerant MPI Design, Usage and Performance Issues
%A Graham Fagg
%A Jack Dongarra
%B Future Generation Computer Systems
%V 18
%P 1127-1142
%8 2002-01
%G eng
%0 Journal Article
%J Parallel Computing
%D 2001
%T HARNESS and Fault Tolerant MPI
%A Graham Fagg
%A Antonin Bukovsky
%A Jack Dongarra
%B Parallel Computing
%V 27
%P 1479-1496
%8 2001-01
%G eng
%0 Journal Article
%J HERMIS
%D 2001
%T High Performance Computing Trends
%A Jack Dongarra
%A Hans Meuer
%A Horst D. Simon
%A Erich Strohmaier
%B HERMIS
%V 2
%P 155-163
%8 2001-11
%G eng
%0 Conference Proceedings
%B FOMMS 2000: Foundations of Molecular Modeling and Simulation Conference (to appear)
%D 2000
%T High Performance Computing Today
%A Jack Dongarra
%A Hans Meuer
%A Horst D. Simon
%A Erich Strohmaier
%B FOMMS 2000: Foundations of Molecular Modeling and Simulation Conference (to appear)
%8 2000-01
%G eng
%0 Journal Article
%J International Journal on Future Generation Computer Systems
%D 1999
%T HARNESS: A Next Generation Distributed Virtual Machine
%A Micah Beck
%A Jack Dongarra
%A Graham Fagg
%A Al Geist
%A Paul Gray
%A James Kohl
%A Mauro Migliardi
%A Keith Moore
%A Terry Moore
%A Philip Papadopoulous
%A Stephen L. Scott
%A Vaidy Sunderam
%K harness
%B International Journal on Future Generation Computer Systems
%V 15
%P 571-582
%8 1999-01
%G eng