%0 Journal Article
%J ACM Transactions on Parallel Computing
%D 2020
%T Load-balancing Sparse Matrix Vector Product Kernels on GPUs
%A Hartwig Anzt
%A Yen-Chen Chen
%A Terry Cojean
%A Jack Dongarra
%A Goran Flegar
%A Ratik Nayak
%A Enrique S. Quintana-Orti
%A Yaohung Tsai
%A Weichung Wang
%X Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA's cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.
%B ACM Transactions on Parallel Computing
%8 2020-03
%G eng
%N 2
%R https://doi.org/10.1145/3380930
%0 Generic
%D 2020
%T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Erik Boman
%A Erin Carson
%A Terry Cojean
%A Jack Dongarra
%A Mark Gates
%A Thomas Gruetzmacher
%A Nicholas J. Higham
%A Sherry Li
%A Neil Lindquist
%A Yang Liu
%A Jennifer Loe
%A Piotr Luszczek
%A Pratik Nayak
%A Sri Pranesh
%A Siva Rajamanickam
%A Tobias Ribizel
%A Barry Smith
%A Kasia Swirydowicz
%A Stephen Thomas
%A Stanimire Tomov
%A Yaohung Tsai
%A Ichitaro Yamazaki
%A Urike Meier Yang
%B SLATE Working Notes
%I University of Tennessee
%8 2020-07
%G eng
%9 SLATE Working Notes
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2019
%T Adaptive Precision in Block-Jacobi Preconditioning for Iterative Sparse Linear System Solvers
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Nicholas J. Higham
%A Enrique S. Quintana-Orti
%K adaptive precision
%K block-Jacobi preconditioning
%K communication reduction
%K energy efficiency
%K Krylov subspace methods
%K sparse linear systems
%X Summary We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block-Jacobi preconditioner in different precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the effects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth-bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem's data in memory. Given this observation, we propose a model that quantifies the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a floating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential benefits of the adaptive block-Jacobi preconditioning scheme.
%B Concurrency and Computation: Practice and Experience
%V 31
%P e4460
%8 2019-03
%G eng
%U https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4460
%R https://doi.org/10.1002/cpe.4460
%0 Conference Paper
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops
%D 2019
%T Approximate and Exact Selection on GPUs
%A Tobias Ribizel
%A Hartwig Anzt
%X We present a novel algorithm for parallel selection on GPUs. The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always using the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for - and exploiting the characteristics of - "pleasant" data distributions. At the same time, as the SampleSelect does not work on the actual values but the ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. Additionally to the exact SampleSelect, we address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy.
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R 10.1109/IPDPSW.2019.00088
%0 Conference Paper
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2019
%T Are we Doing the Right Thing? – A Critical Analysis of the Academic HPC Community
%A Hartwig Anzt
%A Goran Flegar
%X Like in any other research field, academically surviving in the High Performance Computing (HPC) community generally requires to publish papers, in the bast case many of them and in high-ranked journals or at top-tier conferences. As a result, the number of scientific papers published each year in this relatively small community easily outnumbers what a single researcher can read. At the same time, many of the proposed and analyzed strategies, algorithms, and hardware-optimized implementations never make it beyond the prototype stage, as they are abandoned once they served the single purpose of yielding (another) publication. In a time and field where high-quality manpower is a scarce resource, this is extremely inefficient. In this position paper we promote a radical paradigm shift towards accepting high-quality software patches to community software packages as legitimate conference contributions. In consequence, the reputation and appointability of researchers is no longer based on the classical scientific metrics, but on the quality and documentation of open source software contributions - effectively improving and accelerating the collaborative development of community software.
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R 10.1109/IPDPSW.2019.00122
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2019
%T A Customized Precision Format Based on Mantissa Segmentation for Accelerating Sparse Linear Algebra
%A Thomas Gruetzmacher
%A Terry Cojean
%A Goran Flegar
%A Fritz Göbel
%A Hartwig Anzt
%B Concurrency and Computation: Practice and Experience
%V 40319
%8 2019-01
%G eng
%N 262
%R https://doi.org/10.1002/cpe.5418
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T PAPI Software-Defined Events for in-Depth Performance Analysis
%A Heike Jagode
%A Anthony Danalis
%A Hartwig Anzt
%A Jack Dongarra
%X The methodology and standardization layer provided by the Performance Application Programming Interface (PAPI) has played a vital role in application profiling for almost two decades. It has enabled sophisticated performance analysis tool designers and performance-conscious scientists to gain insights into their applications by simply instrumenting their code using a handful of PAPI functions that “just work” across different hardware components. In the past, PAPI development had focused primarily on hardware-specific performance metrics. However, the rapidly increasing complexity of software infrastructure poses new measurement and analysis challenges for the developers of large-scale applications. In particular, acquiring information regarding the behavior of libraries and runtimes—used by scientific applications—requires low-level binary instrumentation, or APIs specific to each library and runtime. No uniform API for monitoring events that originate from inside the software stack has emerged. In this article, we present our efforts to extend PAPI’s role so that it becomes the de facto standard for exposing performance-critical events, which we refer to as software-defined events (SDEs), from different software layers. Upgrading PAPI with SDEs enables monitoring of both types of performance events—hardware- and software-related events—in a uniform way, through the same consistent PAPI. The goal of this article is threefold. First, we motivate the need for SDEs and describe our design decisions regarding the functionality we offer through PAPI’s new SDE interface. Second, we illustrate how SDEs can be utilized by different software packages, specifically, by showcasing their use in the numerical linear algebra library MAGMA-Sparse, the tensor algebra library TAMM that is part of the NWChem suite, and the compiler-based performance analysis tool Byfl. Third, we provide a performance analysis of the overhead that results from monitoring SDEs and discuss the trade-offs between overhead and functionality.
%B The International Journal of High Performance Computing Applications
%V 33
%P 1113-1127
%8 2019-11
%G eng
%U https://doi.org/10.1177/1094342019846287
%N 6
%0 Journal Article
%J Parallel Computing
%D 2019
%T Parallel Selection on GPUs
%A Tobias Ribizel
%A Hartwig Anzt
%K approximate selection
%K gpu
%K kth order statistics
%K multiselection
%K parallel selection algorithm
%B Parallel Computing
%8 2019-11
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167819119301796
%! Parallel Computing
%R https://doi.org/10.1016/j.parco.2019.102588
%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2019
%T ParILUT – A Parallel Threshold ILU for GPUs
%A Hartwig Anzt
%A Tobias Ribizel
%A Goran Flegar
%A Edmond Chow
%A Jack Dongarra
%X In this paper, we present the first algorithm for computing threshold ILU factorizations on GPU architectures. The proposed ParILUT-GPU algorithm is based on interleaving parallel fixed-point iterations that approximate the incomplete factors for an existing nonzero pattern with a strategy that dynamically adapts the nonzero pattern to the problem characteristics. This requires the efficient selection of thresholds that separate the values to be dropped from the incomplete factors, and we design a novel selection algorithm tailored towards GPUs. All components of the ParILUT-GPU algorithm make heavy use of the features available in the latest NVIDIA GPU generations, and outperform existing multithreaded CPU implementations.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R https://doi.org/10.1109/IPDPS.2019.00033
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T Toward a Modular Precision Ecosystem for High-Performance Computing
%A Hartwig Anzt
%A Goran Flegar
%A Thomas Gruetzmacher
%A Enrique S. Quintana-Orti
%K conjugate gradient
%K GPUs
%K Jacobi method
%K Modular precision
%K multicore processors
%K PageRank
%K parallel numerical linear algebra
%X With the memory bandwidth of current computer architectures being significantly slower than the (floating point) arithmetic performance, many scientific computations only leverage a fraction of the computational power in today’s high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between floating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientific applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a “modular precision ecosystem” that allows for more flexibility in terms of customized data access. For memory-bounded scientific applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm—two applications that are popular in the communities and at the same characteristic representatives for the field of numerical linear algebra and data analytics, respectively.
%B The International Journal of High Performance Computing Applications
%V 33
%P 1069-1078
%8 2019-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342019846547
%0 Journal Article
%J Proceedings in Applied Mathematics and Mechanics
%D 2019
%T Towards a New Peer Review Concept for Scientific Computing ensuring Technical Quality, Software Sustainability, and Result Reproducibility
%A Hartwig Anzt
%A Terry Cojean
%A Eileen Kuhn
%X In this position paper we argue for implementing an alternative peer review process for scientific computing contributions that promotes high quality scientific software developments as fully‐recognized conference submission. The idea is based on leveraging the code reviewers' feedback on scientific software contributions to community software developments as a third‐party review involvement. Providing open access to this technical review would complement the scientific review of the contribution, efficiently reduce the workload of the undisclosed reviewers, improve the algorithm implementation quality and software sustainability, and ensure full reproducibility of the reported results. Using this process creates incentives to publish scientific algorithms in open source software – instead of designing prototype algorithms with the unique purpose of publishing a paper. In addition, the comments and suggestions of the community being archived in the versioning control systems ensure that also community reviewers are receiving credit for the review contributions – unlike reviewers in the traditional peer review process. Finally, it reflects the particularity of the scientific computing community using conferences rather than journals as the main publication venue.
%B Proceedings in Applied Mathematics and Mechanics
%V 19
%8 2019-11
%G eng
%N 1
%R https://doi.org/10.1002/pamm.201900490
%0 Conference Paper
%B Platform for Advanced Scientific Computing Conference (PASC 2019)
%D 2019
%T Towards Continuous Benchmarking
%A Hartwig Anzt
%A Yen Chen Chen
%A Terry Cojean
%A Jack Dongarra
%A Goran Flegar
%A Pratik Nayak
%A Enrique S. Quintana-Orti
%A Yuhsiang M. Tsai
%A Weichung Wang
%X We present an automated performance evaluation framework that enables an automated workflow for testing and performance evaluation of software libraries. Integrating this component into an ecosystem enables sustainable software development, as a community effort, via a web application for interactively evaluating the performance of individual software components. The performance evaluation tool is based exclusively on web technologies, which removes the burden of downloading performance data or installing additional software. We employ this framework for the Ginkgo software ecosystem, but the framework can be used with essentially any software project, including the comparison between different software libraries. The Continuous Integration (CI) framework of Ginkgo is also extended to automatically run a benchmark suite on predetermined HPC systems, store the state of the machine and the environment along with the compiled binaries, and collect results in a publicly accessible performance data repository based on Git. The Ginkgo performance explorer (GPE) can be used to retrieve the performance data from the repository, and visualizes it in a web browser. GPE also implements an interface that allows users to write scripts, archived in a Git repository, to extract particular data, compute particular metrics, and visualize them in many different formats (as specified by the script). The combination of these approaches creates a workflow which enables performance reproducibility and software sustainability of scientific software. In this paper, we present example scripts that extract and visualize performance data for Ginkgo’s SpMV kernels that allow users to identify the optimal kernel for specific problem characteristics.
%B Platform for Advanced Scientific Computing Conference (PASC 2019)
%I ACM Press
%C Zurich, Switzerland
%8 2019-06
%@ 9781450367707
%G eng
%R https://doi.org/10.1145/3324989.3325719
%0 Journal Article
%J Parallel Computing
%D 2019
%T Variable-Size Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K Batched algorithms
%K Block-Jacobi
%K Gauss–Jordan elimination
%K Graphics processor
%K matrix inversion
%K sparse linear systems
%X In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variable-size batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix–vector multiplication kernel that transforms the linear systems’ right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVIDIA’s K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver.
%B Parallel Computing
%V 81
%P 131-146
%8 2019-01
%G eng
%R https://doi.org/10.1016/j.parco.2017.12.006
%0 Conference Paper
%B 8th Workshop on Irregular Applications: Architectures and Algorithms
%D 2018
%T High-Performance GPU Implementation of PageRank with Reduced Precision based on Mantissa Segmentation
%A Anzt, Hartwig
%A Thomas Gruetzmacher
%A Enrique S. Quintana-Orti
%A Scheidegger, Florian
%B 8th Workshop on Irregular Applications: Architectures and Algorithms
%G eng
%0 Journal Article
%J Parallel Computing
%D 2018
%T Incomplete Sparse Approximate Inverses for Parallel Preconditioning
%A Hartwig Anzt
%A Thomas Huckle
%A Jürgen Bräckle
%A Jack Dongarra
%X In this paper, we propose a new preconditioning method that can be seen as a generalization of block-Jacobi methods, or as a simplification of the sparse approximate inverse (SAI) preconditioners. The “Incomplete Sparse Approximate Inverses” (ISAI) is in particular efficient in the solution of sparse triangular linear systems of equations. Those arise, for example, in the context of incomplete factorization preconditioning. ISAI preconditioners can be generated via an algorithm providing fine-grained parallelism, which makes them attractive for hardware with a high concurrency level. In a study covering a large number of matrices, we identify the ISAI preconditioner as an attractive alternative to exact triangular solves in the context of incomplete factorization preconditioning.
%B Parallel Computing
%V 71
%P 1–22
%8 2018-01
%G eng
%U http://www.sciencedirect.com/science/article/pii/S016781911730176X
%! Parallel Computing
%R 10.1016/j.parco.2017.10.003
%0 Conference Paper
%B SBAC-PAD
%D 2018
%T A Jaccard Weights Kernel Leveraging Independent Thread Scheduling on GPUs
%A Anzt, Hartwig
%A Jack Dongarra
%B SBAC-PAD
%I IEEE
%C Lyon, France
%G eng
%U https://ieeexplore.ieee.org/document/8645946
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T Optimization and Performance Evaluation of the IDR Iterative Krylov Solver on GPUs
%A Hartwig Anzt
%A Moritz Kreutzer
%A Eduardo Ponce
%A Gregory D. Peterson
%A Gerhard Wellein
%A Jack Dongarra
%K co-design
%K gpu
%K Induced dimension reduction (IDR)
%K kernel fusion
%K kernel overlap
%K roofline performance model
%X In this paper, we present an optimized GPU implementation for the induced dimension reduction algorithm. We improve data locality, combine it with an efficient sparse matrix vector kernel, and investigate the potential of overlapping computation with communication as well as the possibility of concurrent kernel execution. A comprehensive performance evaluation is conducted using a suitable performance model. The analysis reveals efficiency of up to 90%, which indicates that the implementation achieves performance close to the theoretically attainable bound.
%B The International Journal of High Performance Computing Applications
%V 32
%P 220–230
%8 2018-03
%G eng
%R https://doi.org/10.1177/1094342016646844
%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2018
%T ParILUT - A New Parallel Threshold ILU
%A Hartwig Anzt
%A Edmond Chow
%A Jack Dongarra
%X We propose a parallel algorithm for computing a threshold incomplete LU (ILU) factorization. The main idea is to interleave a parallel fixed-point iteration that approximates an incomplete factorization for a given sparsity pattern with a procedure that adjusts the pattern. We describe and test a strategy for identifying nonzeros to be added and nonzeros to be removed from the sparsity pattern. The resulting pattern may be different and more effective than that of existing threshold ILU algorithms. Also in contrast to other parallel threshold ILU algorithms, much of the new algorithm has fine-grained parallelism.
%B SIAM Journal on Scientific Computing
%I SIAM
%V 40
%P C503–C519
%8 2018-07
%G eng
%N 4
%R https://doi.org/10.1137/16M1079506
%0 Generic
%D 2018
%T Software-Defined Events (SDEs) in MAGMA-Sparse
%A Heike Jagode
%A Anthony Danalis
%A Hartwig Anzt
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Erik Boman
%A Stanimire Tomov
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-12
%G eng
%0 Generic
%D 2018
%T Solver Interface & Performance on Cori
%A Hartwig Anzt
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Erik Boman
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-06
%G eng
%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2018
%T Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning
%A Edmond Chow
%A Hartwig Anzt
%A Jennifer Scott
%A Jack Dongarra
%X When using incomplete factorization preconditioners with an iterative method to solve large sparse linear systems, each application of the preconditioner involves solving two sparse triangular systems. These triangular systems are challenging to solve efficiently on computers with high levels of concurrency. On such computers, it has recently been proposed to use Jacobi iterations, which are highly parallel, to approximately solve the triangular systems from incomplete factorizations. The effectiveness of this approach, however, is problem-dependent: the Jacobi iterations may not always converge quickly enough for all problems. Thus, as a necessary and important step to evaluate this approach, we experimentally test the approach on a large number of realistic symmetric positive definite problems. We also show that by using block Jacobi iterations, we can extend the range of problems for which such an approach can be effective. For block Jacobi iterations, it is essential for the blocking to be cognizant of the matrix structure.
%B Journal of Parallel and Distributed Computing
%V 119
%P 219–230
%8 2018-11
%G eng
%R https://doi.org/10.1016/j.jpdc.2018.04.017
%0 Conference Paper
%B SBAC-PAD
%D 2018
%T Variable-Size Batched Condition Number Calculation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Thomas Gruetzmacher
%B SBAC-PAD
%C Lyon, France
%8 2018-09
%G eng
%U https://ieeexplore.ieee.org/document/8645907
%0 Conference Proceedings
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%D 2017
%T Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K block-Jacobi preconditioner
%K Gauss-Jordan elimination
%K graphics processing units (GPUs)
%K iterative methods
%K matrix inversion
%K sparse linear systems
%X In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner. Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%S PMAM'17
%I ACM
%C New York, NY, USA
%P 1–10
%8 2017-02
%@ 978-1-4503-4883-6
%G eng
%U http://doi.acm.org/10.1145/3026937.3026940
%R 10.1145/3026937.3026940
%0 Book Section
%B Handbook of Big Data Technologies
%D 2017
%T Bringing High Performance Computing to Big Data Algorithms
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Handbook of Big Data Technologies
%I Springer
%@ 978-3-319-49339-8
%G eng
%R 10.1007/978-3-319-49340-4
%0 Generic
%D 2017
%T Flexible Batched Sparse Matrix Vector Product on GPUs
%A Hartwig Anzt
%A Collins, Gary
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%I ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C Denver, Colorado
%8 2017-11
%G eng
%0 Conference Paper
%B 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17)
%D 2017
%T Flexible Batched Sparse Matrix-Vector Product on GPUs
%A Hartwig Anzt
%A Gary Collins
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%X We propose a variety of batched routines for concurrently processing a large collection of small-size, independent sparse matrix-vector products (SpMV) on graphics processing units (GPUs). These batched SpMV kernels are designed to be flexible in order to handle a batch of matrices which differ in size, nonzero count, and nonzero distribution. Furthermore, they support three most commonly used sparse storage formats: CSR, COO and ELL. Our experimental results on a state-of-the-art GPU reveal performance improvements of up to 25X compared to non-batched SpMV routines.
%B 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17)
%I ACM Press
%C Denver, CO
%8 2017-11
%G eng
%R http://dx.doi.org/10.1145/3148226.3148230
%0 Generic
%D 2017
%T MAGMA-sparse Interface Design Whitepaper
%A Hartwig Anzt
%A Erik Boman
%A Jack Dongarra
%A Goran Flegar
%A Mark Gates
%A Mike Heroux
%A Mark Hoemmen
%A Jakub Kurzak
%A Piotr Luszczek
%A Sivasankaran Rajamanickam
%A Stanimire Tomov
%A Stephen Wood
%A Ichitaro Yamazaki
%X In this report we describe the logic and interface we develop for the MAGMA-sparse library to allow for easy integration as third-party library into a top-level software ecosystem. The design choices are based on extensive consultation with other software library developers, in particular the Trilinos software development team. The interface documentation is at this point not exhaustive, but a first proposal for setting a standard. Although the interface description targets the MAGMA-sparse software module, we hope that the design choices carry beyond this specific library, and are attractive for adoption in other packages. This report is not intended as static document, but will be updated over time to reflect the agile software development in the ECP 1.3.3.11 STMS11-PEEKS project.
%B Innovative Computing Laboratory Technical Report
%8 2017-09
%G eng
%9 Technical Report
%0 Journal Article
%J Parallel Computing
%D 2017
%T Preconditioned Krylov Solvers on GPUs
%A Hartwig Anzt
%A Mark Gates
%A Jack Dongarra
%A Moritz Kreutzer
%A Gerhard Wellein
%A Martin Kohler
%K gpu
%K ILU
%K Jacobi
%K Krylov solvers
%K Preconditioning
%X In this paper, we study the effect of enhancing GPU-accelerated Krylov solvers with preconditioners. We consider the BiCGSTAB, CGS, QMR, and IDR(s) Krylov solvers. For a large set of test matrices, we assess the impact of Jacobi and incomplete factorization preconditioning on the solvers’ numerical stability and time-to-solution performance. We also analyze how the use of a preconditioner impacts the choice of the fastest solver.
%B Parallel Computing
%8 2017-06
%G eng
%U http://www.sciencedirect.com/science/article/pii/S0167819117300777
%! Parallel Computing
%R 10.1016/j.parco.2017.05.006
%0 Generic
%D 2017
%T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Aurelien Bouteiller
%A Anthony Danalis
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Stephen Wood
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-06
%G eng
%9 SLATE Working Notes
%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%A Andres E. Thomas
%X In this work we present new kernels for the generation and application of block-Jacobi precon-ditioners that accelerate the iterative solution of sparse linear systems on graphics processing units (GPUs). Our approach departs from the conventional LU factorization and decomposes the diagonal blocks of the matrix using the Gauss-Huard method. When enhanced with column pivoting, this method is as stable as LU with partial/row pivoting. Due to extensive use of GPU registers and integration of implicit pivoting, our variable size batched Gauss-Huard implementation outperforms the batched version of LU factorization. In addition, the application kernel combines the conventional two-stage triangular solve procedure, consisting of a backward solve followed by a forward solve, into a single stage that performs both operations simultaneously.
%B International Conference on Computational Science (ICCS 2017)
%I Procedia Computer Science
%C Zurich, Switzerland
%V 108
%P 1783-1792
%8 2017-06
%G eng
%R https://doi.org/10.1016/j.procs.2017.05.186
%0 Conference Paper
%B 46th International Conference on Parallel Processing (ICPP)
%D 2017
%T Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K graphics processing units
%K Jacobian matrices
%K Kernel
%K linear systems
%K Parallel processing
%K Sparse matrices
%X We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. All kernels heavily exploit the registers of the graphics processing unit (GPU) in order to deliver high performance for small problems. The development of these kernels is motivated by the need for tackling this embarrassingly parallel scenario in the context of block-Jacobi preconditioning that is relevant for the iterative solution of sparse linear systems.
%B 46th International Conference on Parallel Processing (ICPP)
%I IEEE
%C Bristol, United Kingdom
%8 2017-08
%G eng
%U http://ieeexplore.ieee.org/abstract/document/8025283/?reload=true
%R 10.1109/ICPP.2017.18
%0 Journal Article
%J Computing in Science & Engineering
%D 2017
%T With Extreme Computing, the Rules Have Changed
%A Jack Dongarra
%A Stanimire Tomov
%A Piotr Luszczek
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Azzam Haidar
%A Ahmad Abdelfattah
%X On the eve of exascale computing, traditional wisdom no longer applies. High-performance computing is gone as we know it. This article discusses a range of new algorithmic techniques emerging in the context of exascale computing, many of which defy the common wisdom of high-performance computing and are considered unorthodox, but could turn out to be a necessity in near future.
%B Computing in Science & Engineering
%V 19
%P 52-62
%8 2017-05
%G eng
%N 3
%R https://doi.org/10.1109/MCSE.2017.48
%0 Journal Article
%J VECPAR
%D 2016
%T Accelerating the Conjugate Gradient Algorithm with GPU in CFD Simulations
%A Hartwig Anzt
%A Marc Baboulin
%A Jack Dongarra
%A Yvan Fournier
%A Frank Hulsemann
%A Amal Khabou
%A Yushan Wang
%X This paper illustrates how GPU computing can be used to accelerate computational fluid dynamics (CFD) simulations. For sparse linear systems arising from finite volume discretization, we evaluate and optimize the performance of Conjugate Gradient (CG) routines designed for manycore accelerators and compare against an industrial CPU-based implementation. We also investigate how the recent advances in preconditioning, such as iterative Incomplete Cholesky (IC, as symmetric case of ILU) preconditioning, match the requirements for solving real world problems.
%B VECPAR
%G eng
%U http://hgpu.org/?p=16264
%0 Conference Proceedings
%B Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2016
%T Batched Generation of Incomplete Sparse Approximate Inverses on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Thomas Huckle
%A Jack Dongarra
%X Incomplete Sparse Approximate Inverses (ISAI) have recently been shown to be an attractive alternative to exact sparse triangular solves in the context of incomplete factorization preconditioning. In this paper we propose a batched GPU-kernel for the efficient generation of ISAI matrices. Utilizing only thread-local memory allows for computing the ISAI matrix with very small memory footprint. We demonstrate that this strategy is faster than the existing strategy for generating ISAI matrices, and use a large number of test matrices to assess the algorithm's efficiency in an iterative solver setting.
%B Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%S ScalA '16
%P 49–56
%8 2016-11
%@ 978-1-5090-5222-6
%G eng
%R 10.1109/ScalA.2016.11
%0 Generic
%D 2016
%T On block-asynchronous execution on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Jack Dongarra
%X This paper experimentally investigates how GPUs execute instructions when used for general purpose computing (GPGPU). We use a light-weight realizing a vector operation to analyze which vector entries are updated subsequently, and identify regions where parallel execution can be expected. The results help us to understand how GPUs operate, and map this operation mode to the mathematical concept of asynchronism. In particular it helps to understand the effects that can occur when implementing a fixed-point method using in-place updates on GPU hardware.
%B LAPACK Working Note
%8 2016-11
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn291.pdf
%0 Conference Proceedings
%B Software for Exascale Computing - SPPEXA
%D 2016
%T Domain Overlap for Iterative Sparse Triangular Solves on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Daniel Szyld
%A Jack Dongarra
%E Hans-Joachim Bungartz
%E Philipp Neumann
%E Wolfgang E. Nagel
%X Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution.
%B Software for Exascale Computing - SPPEXA
%S Lecture Notes in Computer Science and Engineering
%I Springer International Publishing
%V 113
%P 527–545
%8 2016-09
%G eng
%R 10.1007/978-3-319-40528-5_24
%0 Conference Proceedings
%B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2016
%T Efficiency of General Krylov Methods on GPUs – An Experimental Study
%A Hartwig Anzt
%A Jack Dongarra
%A Moritz Kreutzer
%A Gerhard Wellein
%A Martin Kohler
%K algorithmic bombardment
%K BiCGSTAB
%K CGS
%K Convergence
%K Electric breakdown
%K gpu
%K graphics processing units
%K Hardware
%K IDR(s)
%K Krylov solver
%K Libraries
%K linear systems
%K QMR
%K Sparse matrices
%X This paper compares different Krylov methods based on short recurrences with respect to their efficiency whenimplemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergencecharacteristics. For a large set of test matrices taken from theUniversity of Florida Matrix Collection, we evaluate the methods'performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and executiontime. We also analyze whether the methods are "orthogonal"in terms of problem suitability. We propose best practicesfor choosing methods in a "black box" scenario, where noinformation about the optimal solver is available.
%B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%P 683-691
%8 2016-05
%G eng
%R 10.1109/IPDPSW.2016.45
%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES)
%D 2016
%T Efficiency of General Krylov Methods on GPUs – An Experimental Study
%A Hartwig Anzt
%A Jack Dongarra
%A Moritz Kreutzer
%A Gerhard Wellein
%A Martin Kohler
%K algorithmic bombardment
%K BiCGSTAB
%K CGS
%K gpu
%K IDR(s)
%K Krylov solver
%K QMR
%X This paper compares different Krylov methods based on short recurrences with respect to their efficiency when implemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergence characteristics. For a large set of test matrices taken from the University of Florida Matrix Collection, we evaluate the methods’ performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and execution time. We also analyze whether the methods are “orthogonal” in terms of problem suitability. We propose best practices for choosing methods in a “black box” scenario, where no information about the optimal solver is available.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES)
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%R 10.1109/IPDPSW.2016.45
%0 Journal Article
%J Journal of Computational Science
%D 2016
%T Fine-grained Bit-Flip Protection for Relaxation Methods
%A Hartwig Anzt
%A Jack Dongarra
%A Enrique S. Quintana-Orti
%K Bit flips
%K Fault tolerance
%K High Performance Computing
%K iterative solvers
%K Jacobi method
%K sparse linear systems
%X Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.
%B Journal of Computational Science
%8 2016-11
%G eng
%R https://doi.org/10.1016/j.jocs.2016.11.013
%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%D 2016
%T Heterogeneous Streaming
%A Chris J. Newburn
%A Gaurav Bansal
%A Michael Wood
%A Luis Crivelli
%A Judit Planas
%A Alejandro Duran
%A Paulo Souza
%A Leonardo Borges
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%A Hartwig Anzt
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Ichitaro Yamazaki
%A Jesus Labarta
%X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%0 Journal Article
%J Acta Numerica
%D 2016
%T Linear Algebra Software for Large-Scale Accelerated Multicore Computing
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A undefined
%A Asim YarKhan
%X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems.
%B Acta Numerica
%V 25
%P 1-160
%8 2016-05
%G eng
%R 10.1017/S0962492916000015
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2016
%T On the performance and energy efficiency of sparse linear algebra on GPUs
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based supercomputers. We compare the resource efficiency of different sparse matrix–vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel’s MKL for multicore CPUs, and develop a GPU sparse matrix–matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers.
%B International Journal of High Performance Computing Applications
%8 2016-10
%G eng
%U http://hpc.sagepub.com/content/early/2016/10/05/1094342016672081.abstract
%R 10.1177/1094342016672081
%0 Journal Article
%J Numerical Algorithms
%D 2016
%T Updating Incomplete Factorization Preconditioners for Model Order Reduction
%A Hartwig Anzt
%A Edmond Chow
%A Jens Saak
%A Jack Dongarra
%K key publication
%X When solving a sequence of related linear systems by iterative methods, it is common to reuse the preconditioner for several systems, and then to recompute the preconditioner when the matrix has changed significantly. Rather than recomputing the preconditioner from scratch, it is potentially more efficient to update the previous preconditioner. Unfortunately, it is not always known how to update a preconditioner, for example, when the preconditioner is an incomplete factorization. A recently proposed iterative algorithm for computing incomplete factorizations, however, is able to exploit an initial guess, unlike existing algorithms for incomplete factorizations. By treating a previous factorization as an initial guess to this algorithm, an incomplete factorization may thus be updated. We use a sequence of problems from model order reduction. Experimental results using an optimized GPU implementation show that updating a previous factorization can be inexpensive and effective, making solving sequences of linear systems a potential niche problem for the iterative incomplete factorization algorithm.
%B Numerical Algorithms
%V 73
%P 611–630
%8 2016-02
%G eng
%N 3
%R 10.1007/s11075-016-0110-2
%0 Conference Paper
%B 2015 IEEE International Conference on Big Data (IEEE BigData 2015)
%D 2015
%T Accelerating Collaborative Filtering for Implicit Feedback Datasets using GPUs
%A Mark Gates
%A Hartwig Anzt
%A Jakub Kurzak
%A Jack Dongarra
%X In this paper we accelerate the Alternating Least Squares (ALS) algorithm used for generating product recommendations on the basis of implicit feedback datasets. We approach the algorithm with concepts proven to be successful in High Performance Computing. This includes the formulation of the algorithm as a mix of cache-optimized algorithm-specific kernels and standard BLAS routines, acceleration via graphics processing units (GPUs), use of parallel batched kernels, and autotuning to identify performance winners. For benchmark datasets, the multi-threaded CPU implementation we propose achieves more than a 10 times speedup over the implementations available in the GraphLab and Spark MLlib software packages. For the GPU implementation, the parameters of an algorithm-specific kernel were optimized using a comprehensive autotuning sweep. This results in an additional 2 times speedup over our CPU implementation.
%B 2015 IEEE International Conference on Big Data (IEEE BigData 2015)
%I IEEE
%C Santa Clara, CA
%8 2015-11
%G eng
%0 Conference Paper
%B Spring Simulation Multi-Conference 2015 (SpringSim'15)
%D 2015
%T Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative eigensolver the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For the key routine generating the Krylov search spaces via the product of a sparse matrix and a block of vectors, we propose a GPU kernel based on a modied sliced ELLPACK format. Blocking a set of vectors and processing them simultaneously accelerates the computation of a set of consecutive SpMVs significantly. Comparing the performance against similar routines from Intel's MKL and NVIDIA's cuSPARSE library we identify appealing performance improvements. We integrate it into the highly optimized LOBPCG implementation. Compared to the BLOBEX CPU implementation running on two eight-core Intel Xeon E5-2690s, we accelerate the computation of a small set of eigenvectors using NVIDIA's K40 GPU by typically more than an order of magnitude.
%B Spring Simulation Multi-Conference 2015 (SpringSim'15)
%I SCS
%C Alexandria, VA
%8 2015-04
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2015
%T Acceleration of GPU-based Krylov solvers via Data Transfer Reduction
%A Hartwig Anzt
%A William Sawyer
%A Stanimire Tomov
%A Piotr Luszczek
%A Jack Dongarra
%B International Journal of High Performance Computing Applications
%G eng
%0 Conference Paper
%B 3rd International Workshop on Energy Efficient Supercomputing (E2SC '15)
%D 2015
%T Adaptive Precision Solvers for Sparse Linear Systems
%A Hartwig Anzt
%A Jack Dongarra
%A Enrique S. Quintana-Orti
%B 3rd International Workshop on Energy Efficient Supercomputing (E2SC '15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Conference Paper
%B International Supercomputing Conference (ISC 2015)
%D 2015
%T Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs
%A Edmond Chow
%A Hartwig Anzt
%A Jack Dongarra
%B International Supercomputing Conference (ISC 2015)
%C Frankfurt, Germany
%8 2015-07
%G eng
%0 Conference Paper
%B Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15)
%D 2015
%T Energy Efficiency and Performance Frontiers for Sparse Computations on GPU Supercomputers
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6x performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5x faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers.
%B Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15)
%I ACM
%C San Francisco, CA
%8 2015-02
%@ 978-1-4503-3404-4
%G eng
%R 10.1145/2712386.2712387
%0 Journal Article
%J Concurrency in Computation: Practice and Experience
%D 2015
%T Experiences in autotuning matrix multiplication for energy minimization on GPUs
%A Hartwig Anzt
%A Blake Haugen
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B Concurrency in Computation: Practice and Experience
%V 27
%P 5096-5113
%8 2015-12
%G eng
%N 17
%R 10.1002/cpe.3516
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T Experiences in Autotuning Matrix Multiplication for Energy Minimization on GPUs
%A Hartwig Anzt
%A Blake Haugen
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K Autotuning
%K energy efficiency
%K hardware accelerators
%K matrix multiplication
%K power
%X In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 5096 - 5113
%8 12-Oct
%G eng
%U http://doi.wiley.com/10.1002/cpe.3516https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2Fcpe.3516
%N 17
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.3516
%0 Conference Paper
%B 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing
%D 2015
%T GPU-accelerated Co-design of Induced Dimension Reduction: Algorithmic Fusion and Kernel Overlap
%A Hartwig Anzt
%A Eduardo Ponce
%A Gregory D. Peterson
%A Jack Dongarra
%X In this paper we present an optimized GPU co-design of the Induced Dimension Reduction (IDR) algorithm for solving linear systems. Starting from a baseline implementation based on the generic BLAS routines from the MAGMA software library, we apply optimizations that are based on kernel fusion and kernel overlap. Runtime experiments are used to investigate the benefit of the distinct optimization techniques for different variants of the IDR algorithm. A comparison to the reference implementation reveals that the interplay between them can succeed in cutting the overall runtime by up to about one third.
%B 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2015
%T Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs
%A Jakub Kurzak
%A Hartwig Anzt
%A Mark Gates
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%8 2015-11
%G eng
%0 Conference Paper
%B EuroPar 2015
%D 2015
%T Iterative Sparse Triangular Solves for Preconditioning
%A Hartwig Anzt
%A Edmond Chow
%A Jack Dongarra
%X Sparse triangular solvers are typically parallelized using level scheduling techniques, but parallel eciency is poor on high-throughput architectures like GPUs. We propose using an iterative approach for solving sparse triangular systems when an approximation is suitable. This approach will not work for all problems, but can be successful for sparse triangular matrices arising from incomplete factorizations, where an approximate solution is acceptable. We demonstrate the performance gains that this approach can have on GPUs in the context of solving sparse linear systems with a preconditioned Krylov subspace method. We also illustrate the effect of using asynchronous iterations.
%B EuroPar 2015
%I Springer Berlin
%C Vienna, Austria
%8 2015-08
%G eng
%U http://dx.doi.org/10.1007/978-3-662-48096-0_50
%R 10.1007/978-3-662-48096-0_50
%0 Generic
%D 2015
%T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I ISC High Performance (ISC15), Intel Booth Presentation
%C Frankfurt, Germany
%8 2015-06
%G eng
%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%D 2015
%T Random-Order Alternating Schwarz for Sparse Triangular Solves
%A Hartwig Anzt
%A Edmond Chow
%A Daniel Szyld
%A Jack Dongarra
%X Block-asynchronous Jacobi is an iteration method where a locally synchronous iteration is embedded in an asynchronous global iteration. The unknowns are partitioned into small subsets, and while the components within the same subset are iterated in Jacobi fashion, no update order in-between the subsets is enforced. The values of the non-local entries remain constant during the local iterations, which can result in slow inter-subset information propagation and slow convergence. Interpreting of the subsets as subdomains allows to transfer the concept of domain overlap typically enhancing the information propagation to block-asynchronous solvers. In this talk we explore the impact of overlapping domains to convergence and performance of block-asynchronous Jacobi iterations, and present results obtained by running this solver class on state-of-the-art HPC systems.
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng
%0 Conference Paper
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15)
%D 2015
%T Tuning Stationary Iterative Solvers for Fault Resilience
%A Hartwig Anzt
%A Jack Dongarra
%A Enrique S. Quintana-Orti
%X As the transistor’s feature size decreases following Moore’s Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community. In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation.
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Generic
%D 2014
%T Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iterative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU data structures and kernels to the higher-level algorithmic choices and overall heterogeneous design. Most notably, the eigensolver leverages the high-performance of a new GPU kernel developed for the simultaneous multiplication of a sparse matrix and a set of vectors (SpMM). This is a building block that serves as a backbone for not only block-Krylov, but also for other methods relying on blocking for acceleration in general. The heterogeneous LOBPCG developed here reveals the potential of this type of eigensolver by highly optimizing all of its components, and can be viewed as a benchmark for other SpMM-dependent applications. Compared to non-blocked algorithms, we show that the performance speedup factor of SpMM vs. SpMV-based algorithms is up to six on GPUs like NVIDIA’s K40. In particular, a typical SpMV performance range in double precision is 20 to 25 GFlop/s, while the SpMM is in the range of 100 to 120 GFlop/s. Compared to highly-optimized CPU implementations, e.g., the SpMM from MKL on two eight-core Intel Xeon E5-2690s, our kernel is 3 to 5x. faster on a K40 GPU. For comparison to other computational loads, the same GPU to CPU performance acceleration is observed for the SpMV product, as well as dense linear algebra, e.g., matrix-matrix multiplication and factorizations like LU, QR, and Cholesky. Thus, the modeled GPU (vs. CPU) acceleration for the entire solver is also 3 to 5x. In practice though, currently available CPU implementations are much slower due to missed optimization opportunities, as we show.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2014-10
%G eng
%0 Conference Paper
%B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014
%D 2014
%T Hybrid Multi-Elimination ILU Preconditioners on GPUs
%A Dimitar Lukarski
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X Abstract—Iterative solvers for sparse linear systems often benefit from using preconditioners. While there are implementations for many iterative methods that leverage the computing power of accelerators, porting the latest developments in preconditioners to accelerators has been challenging. In this paper we develop a selfadaptive multi-elimination preconditioner for graphics processing units (GPUs). The preconditioner is based on a multi-level incomplete LU factorization and uses a direct dense solver for the bottom-level system. For test matrices from the University of Florida matrix collection, we investigate the influence of handling the triangular solvers in the distinct iteration steps in either single or double precision arithmetic. Integrated into a Conjugate Gradient method, we show that our multi-elimination algorithm is highly competitive against popular preconditioners, including multi-colored symmetric Gauss-Seidel relaxation preconditioners, and (multi-colored symmetric) ILU for numerous problems.
%B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Generic
%D 2014
%T Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X Numerical methods in sparse linear algebra typically rely on a fast and efficient matrix vector product, as this usually is the backbone of iterative algorithms for solving eigenvalue problems or linear systems. Against the background of a large diversity in the characteristics of high performance computer architectures, it is a challenge to derive a cross-platform efficient storage format along with fast matrix vector kernels. Recently, attention focused on the SELL-C- format, a sliced ELLPACK format enhanced by row-sorting to reduce the fill in when padding rows with zeros. In this paper we propose an additional modification resulting in the padded sliced ELLPACK (SELLP) format, for which we develop a sparse matrix vector CUDA kernel that is able to efficiently exploit the computing power of NVIDIA GPUs. We show that the kernel we developed outperforms straight-forward implementations for the widespread CSR and ELLPACK formats, and is highly competitive to the implementations in the highly optimized CUSPARSE library.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2014-04
%G eng
%0 Journal Article
%J Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences
%D 2014
%T Improving the Energy Efficiency of Sparse Linear System Solvers on Multicore and Manycore Systems
%A Hartwig Anzt
%A Enrique S. Quintana-Orti
%K energy efficiency
%K graphics processing units
%K High Performance Computing
%K iterative solvers
%K multicore processors
%K sparse linear systems
%X While most recent breakthroughs in scientific research rely on complex simulations carried out in large-scale supercomputers, the power draft and energy spent for this purpose is increasingly becoming a limiting factor to this trend. In this paper, we provide an overview of the current status in energy-efficient scientific computing by reviewing different technologies used to monitor power draft as well as power- and energy-saving mechanisms available in commodity hardware. For the particular domain of sparse linear algebra, we analyze the energy efficiency of a broad collection of hardware architectures and investigate how algorithmic and implementation modifications can improve the energy performance of sparse linear system solvers, without negatively impacting their performance.
%B Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences
%V 372
%8 2014-07
%G eng
%N 2018
%R 10.1098/rsta.2013.0279
%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Improving the performance of CA-GMRES on multicores with multiple GPUs
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Hoemmen
%A Jack Dongarra
%X Abstract—The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Conference Paper
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%D 2014
%T Optimizing Krylov Subspace Solvers on Graphics Processing Units
%A Stanimire Tomov
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Hartwig Anzt
%A William Sawyer
%X Krylov subspace solvers are often the method of choice when solving sparse linear systems iteratively. At the same time, hardware accelerators such as graphics processing units (GPUs) continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to leverage the full potential of the accelerator. In this paper we target the acceleration of the BiCGSTAB solver for GPUs, showing that significant improvement can be achieved by reformulating the method and developing application-specific kernels instead of using the generic CUBLAS library provided by NVIDIA. We propose an implementation that benefits from a significantly reduced number of kernel launches and GPUhost communication events, by means of increased data locality and a simultaneous reduction of multiple scalar products. Using experimental data, we show that, depending on the dominance of the untouched sparse matrix vector products, significant performance improvements can be achieved compared to a reference implementation based on the CUBLAS library. We feel that such optimizations are crucial for the subsequent development of highlevel sparse linear algebra libraries.
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures
%A Hartwig Anzt
%A Dimitar Lukarski
%A Stanimire Tomov
%A Jack Dongarra
%X Based on the premise that preconditioners needed for scientific computing are not only required to be robust in the numerical sense, but also scalable for up to thousands of light-weight cores, we argue that this two-fold goal is achieved for the recently developed self-adaptive multi-elimination preconditioner. For this purpose, we revise the underlying idea and analyze the performance of implementations realized in the PARALUTION and MAGMA open-source software libraries on GPU architectures (using either CUDA or OpenCL), Intel’s Many Integrated Core Architecture, and Intel’s Sandy Bridge processor. The comparison with other well-established preconditioners like multi-coloured Gauss-Seidel, ILU(0) and multi-colored ILU(0), shows that the twofold goal of a numerically stable cross-platform performant algorithm is achieved.
%B VECPAR 2014
%C Eugene, OR
%8 2014-06
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2014
%T Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors
%A José I. Aliaga
%A Hartwig Anzt
%A Maribel Castillo
%A Juan C. Fernández
%A Germán León
%A Joaquín Pérez
%A Enrique S. Quintana-Orti
%K CG
%K CPUs
%K energy efficiency
%K GPUs
%K low-power architectures
%X In this paper, we analyze the interactions occurring in the triangle performance-power-energy for the execution of a pivotal numerical algorithm, the iterative conjugate gradient (CG) method, on a diverse collection of parallel multithreaded architectures. This analysis is especially timely in a decade where the power wall has arisen as a major obstacle to build faster processors. Moreover, the CG method has recently been proposed as a complement to the LINPACK benchmark, as this iterative method is argued to be more archetypical of the performance of today's scientific and engineering applications. To gain insights about the benefits of hands-on optimizations we include runtime and energy efficiency results for both out-of-the-box usage relying exclusively on compiler optimizations, and implementations manually optimized for target architectures, that range from general-purpose and digital signal multicore processors to manycore graphics processing units, all representative of current multithreaded systems.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 885-904
%8 2014-09
%G eng
%U http://dx.doi.org/10.1002/cpe.3341
%N 4
%& 885
%R 10.1002/cpe.3341
%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2013
%T A Block-Asynchronous Relaxation Method for Graphics Processing Units
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%A Vincent Heuveline
%X In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the ‘‘subdomain’’ handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing.
%B Journal of Parallel and Distributed Computing
%V 73
%P 1613–1626
%8 2013-12
%G eng
%N 12
%R http://dx.doi.org/10.1016/j.jpdc.2013.05.008
%0 Journal Article
%J ICCS 2012
%D 2012
%T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Gates
%A Jack Dongarra
%A Vincent Heuveline
%B ICCS 2012
%C Omaha, NE
%8 2012-06
%G eng
%0 Journal Article
%J EuroPar 2012 (also LAWN 260)
%D 2012
%T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement
%A Hartwig Anzt
%A Piotr Luszczek
%A Jack Dongarra
%A Vincent Heuveline
%B EuroPar 2012 (also LAWN 260)
%C Rhodes Island, Greece
%8 2012-08
%G eng
%0 Conference Proceedings
%B Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper)
%D 2012
%T Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%A Vincent Heuveline
%B Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper)
%C Rhodes Island, Greece
%8 2012-08
%G eng
%0 Journal Article
%J SIAM Journal on Computing (submitted)
%D 2012
%T Weighted Block-Asynchronous Relaxation for GPU-Accelerated Systems
%A Hartwig Anzt
%A Jack Dongarra
%A Vincent Heuveline
%B SIAM Journal on Computing (submitted)
%8 2012-03
%G eng
%0 Journal Article
%D 2011
%T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Gates
%A Jack Dongarra
%A Vincent Heuveline
%K magma
%8 2011-12
%G eng
%0 Generic
%D 2011
%T A Block-Asynchronous Relaxation Method for Graphics Processing Units
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%A Vincent Heuveline
%K magma
%B University of Tennessee Computer Science Technical Report
%8 2011-11
%G eng
%0 Generic
%D 2011
%T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement
%A Hartwig Anzt
%A Piotr Luszczek
%A Jack Dongarra
%A Vincent Heuveline
%K magma
%B University of Tennessee Computer Science Technical Report UT-CS-11-690 (also Lawn 260)
%8 2011-12
%G eng