%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2019
%T Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers
%A Anzt, Hartwig
%A Dongarra, Jack
%A Flegar, Goran
%A Higham, Nicholas J.
%A Quintana-Orti, Enrique S.
%K adaptive precision
%K block-Jacobi preconditioning
%K communication reduction
%K energy efficiency
%K Krylov subspace methods
%K sparse linear systems
%X Summary We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block-Jacobi preconditioner in different precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the effects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth-bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem's data in memory. Given this observation, we propose a model that quantifies the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a floating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential benefits of the adaptive block-Jacobi preconditioning scheme.
%B Concurrency and Computation: Practice and Experience
%V 31
%P e4460
%G eng
%U https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4460
%R 10.1002/cpe.4460
%0 Conference Paper
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2019
%T Are we Doing the Right Thing? — A Critical Analysis of the Academic HPC Community
%A Anzt, Hartwig
%A Flegar, Goran
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Rio de Janeiro, Brazil
%G eng
%U https://ieeexplore.ieee.org/document/8778421/http://xplorestaging.ieee.org/ielx7/8765855/8778198/08778421.pdf?arnumber=8778421
%R 10.1109/IPDPSW.2019.00122
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2019
%T A Customized Precision Format Based on Mantissa Segmentation for Accelerating Sparse Linear Algebra
%A Thomas Gruetzmacher
%A Terry Cojean
%A Goran Flegar
%A Fritz Göbel
%A Hartwig Anzt
%B Concurrency and Computation: Practice and Experience
%V 40319
%8 01-2019
%G eng
%N 262
%R 10.1002/cpe.5418
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T Toward a Modular Precision Ecosystem for High-Performance Computing
%A Hartwig Anzt
%A Goran Flegar
%A Thomas Grützmacher
%A Enrique S. Quintana-Ortí
%B The International Journal of High Performance Computing Applications
%8 09-2019
%G eng
%R 10.1177/1094342019846547
%0 Conference Paper
%B the Platform for Advanced Scientific Computing ConferenceProceedings of the Platform for Advanced Scientific Computing Conference on - PASC '19
%D 2019
%T Towards Continuous Benchmarking
%A Anzt, Hartwig
%A Chen, Yen-Chen
%A Cojean, Terry
%A Dongarra, Jack
%A Flegar, Goran
%A Nayak, Pratik
%A Quintana-Orti, Enrique S.
%A Tsai, Yuhsiang M.
%A Wang, Weichung
%A Unknown
%B the Platform for Advanced Scientific Computing ConferenceProceedings of the Platform for Advanced Scientific Computing Conference on - PASC '19
%I ACM Press
%C Zurich, SwitzerlandNew York, New York, USA
%@ 9781450367707
%G eng
%U http://dl.acm.org/citation.cfm?doid=3324989http://dl.acm.org/citation.cfm?doid=3324989.3325719http://dl.acm.org/ft_gateway.cfm?id=3325719&ftid=2064796&dwn=1
%R 10.1145/332498910.1145/3324989.3325719
%0 Journal Article
%J Concurrency Computation: Practice and Experience
%D 2018
%T Adaptive Precision in Block‐Jacobi Preconditioning for Iterative Sparse Linear System Solvers
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Nicholas J. Higham
%A Enrique S. Quintana-Ortí
%K adaptive precision
%K block‐Jacobi preconditioning
%K communication reduction
%K energy efficiency
%K Krylov subspace methods
%K sparse linear systems
%X We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block‐Jacobi preconditioner in different precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the effects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth‐bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem's data in memory. Given this observation, we propose a model that quantifies the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a floating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential benefits of the adaptive block‐Jacobi preconditioning scheme.
%B Concurrency Computation: Practice and Experience
%8 03-2018
%G eng
%U http://www.netlib.org/utk/people/JackDongarra/PAPERS/Anzt_et_al-2018-Concurrency.pdf
%R https://doi.org/10.1002/cpe.4460
%0 Conference Paper
%B SBAC-PAD
%D 2018
%T Variable-Size Batched Condition Number Calculation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Thomas Gruetzmacher
%B SBAC-PAD
%C Lyon, France
%8 09-2018
%G eng
%0 Journal Article
%J Parallel Computing
%D 2018
%T Variable-Size Batched Gauss–Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%K Batched algorithms
%K Block-Jacobi
%K Gauss–Jordan elimination
%K Graphics processor
%K matrix inversion
%K sparse linear systems
%X In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variable-size batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix–vector multiplication kernel that transforms the linear systems’ right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVIDIA’s K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver.
%B Parallel Computing
%8 01-2018
%G eng
%R https://doi.org/10.1016/j.parco.2017.12.006
%0 Conference Proceedings
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%D 2017
%T Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%K block-Jacobi preconditioner
%K Gauss-Jordan elimination
%K graphics processing units (GPUs)
%K iterative methods
%K matrix inversion
%K sparse linear systems
%X In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner. Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%S PMAM'17
%I ACM
%C New York, NY, USA
%P 1–10
%8 02-2017
%@ 978-1-4503-4883-6
%G eng
%U http://doi.acm.org/10.1145/3026937.3026940
%R 10.1145/3026937.3026940
%0 Generic
%D 2017
%T Flexible Batched Sparse Matrix Vector Product on GPUs
%A Hartwig Anzt
%A Collins, Gary
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%I ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C Denver, Colorado
%8 11-2017
%G eng
%0 Conference Paper
%B Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17)
%D 2017
%T Flexible Batched Sparse Matrix-Vector Product on GPUs
%A Hartwig Anzt
%A Collins, Gary
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%Y Alexandrov, Vassil
%Y Al Geist
%Y Jack Dongarra
%X We propose a variety of batched routines for concurrently processing a large collection of small-size, independent sparse matrix-vector products (SpMV) on graphics processing units (GPUs). These batched SpMV kernels are designed to be flexible in order to handle a batch of matrices which differ in size, nonzero count, and nonzero distribution. Furthermore, they support three most commonly used sparse storage formats: CSR, COO and ELL. Our experimental results on a state-of-the-art GPU reveal performance improvements of up to 25X compared to non-batched SpMV routines.
%B Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17)
%I ACM Press
%C Denver, Colorado
%8 11-2017
%@ 9781450351256
%G eng
%U https://dl.acm.org/citation.cfm?id=3148230&CFID=1011405188&CFTOKEN=85357784
%R 10.1145/3148226.3148230
%0 Generic
%D 2017
%T MAGMA-sparse Interface Design Whitepaper
%A Hartwig Anzt
%A Erik Boman
%A Jack Dongarra
%A Goran Flegar
%A Mark Gates
%A Michael Heroux
%A Mark Hoemmen
%A Jakub Kurzak
%A Piotr Luszczek
%A Sivasankaran Rajamanickam
%A Stanimire Tomov
%A Stephen Wood
%A Ichitaro Yamazaki
%X In this report we describe the logic and interface we develop for the MAGMA-sparse library to allow for easy integration as third-party library into a top-level software ecosystem. The design choices are based on extensive consultation with other software library developers, in particular the Trilinos software development team. The interface documentation is at this point not exhaustive, but a first proposal for setting a standard. Although the interface description targets the MAGMA-sparse software module, we hope that the design choices carry beyond this specific library, and are attractive for adoption in other packages. This report is not intended as static document, but will be updated over time to reflect the agile software development in the ECP 1.3.3.11 STMS11-PEEKS project.
%B Innovative Computing Laboratory Technical Report
%8 09-2017
%G eng
%9 Technical Report
%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%A Andres E. Thomas
%X In this work we present new kernels for the generation and application of block-Jacobi precon-ditioners that accelerate the iterative solution of sparse linear systems on graphics processing units (GPUs). Our approach departs from the conventional LU factorization and decomposes the diagonal blocks of the matrix using the Gauss-Huard method. When enhanced with column pivoting, this method is as stable as LU with partial/row pivoting. Due to extensive use of GPU registers and integration of implicit pivoting, our variable size batched Gauss-Huard implementation outperforms the batched version of LU factorization. In addition, the application kernel combines the conventional two-stage triangular solve procedure, consisting of a backward solve followed by a forward solve, into a single stage that performs both operations simultaneously.
%B International Conference on Computational Science (ICCS 2017)
%I Procedia Computer Science
%C Zurich, Switzerland
%V 108
%P 1783-1792
%8 06-2017
%G eng
%U https://doi.org/10.1016/j.procs.2017.05.186
%0 Conference Paper
%B 46th International Conference on Parallel Processing (ICPP)
%D 2017
%T Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%K graphics processing units
%K Jacobian matrices
%K Kernel
%K linear systems
%K Parallel processing
%K Sparse matrices
%X We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. All kernels heavily exploit the registers of the graphics processing unit (GPU) in order to deliver high performance for small problems. The development of these kernels is motivated by the need for tackling this embarrassingly parallel scenario in the context of block-Jacobi preconditioning that is relevant for the iterative solution of sparse linear systems.
%B 46th International Conference on Parallel Processing (ICPP)
%I IEEE
%C Bristol, United Kingdom
%8 08-2017
%G eng
%U http://ieeexplore.ieee.org/abstract/document/8025283/?reload=true
%R 10.1109/ICPP.2017.18