%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T Toward a Modular Precision Ecosystem for High-Performance Computing
%A Hartwig Anzt
%A Goran Flegar
%A Thomas Grützmacher
%A Enrique S. Quintana-Ortí
%B The International Journal of High Performance Computing Applications
%8 09-2019
%G eng
%R 10.1177/1094342019846547
%0 Journal Article
%J Concurrency Computation: Practice and Experience
%D 2018
%T Adaptive Precision in Block‐Jacobi Preconditioning for Iterative Sparse Linear System Solvers
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Nicholas J. Higham
%A Enrique S. Quintana-Ortí
%K adaptive precision
%K block‐Jacobi preconditioning
%K communication reduction
%K energy efficiency
%K Krylov subspace methods
%K sparse linear systems
%X We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block‐Jacobi preconditioner in different precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the effects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth‐bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem's data in memory. Given this observation, we propose a model that quantifies the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a floating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential benefits of the adaptive block‐Jacobi preconditioning scheme.
%B Concurrency Computation: Practice and Experience
%8 03-2018
%G eng
%U http://www.netlib.org/utk/people/JackDongarra/PAPERS/Anzt_et_al-2018-Concurrency.pdf
%R https://doi.org/10.1002/cpe.4460
%0 Journal Article
%J Parallel Computing
%D 2018
%T Variable-Size Batched Gauss–Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%K Batched algorithms
%K Block-Jacobi
%K Gauss–Jordan elimination
%K Graphics processor
%K matrix inversion
%K sparse linear systems
%X In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variable-size batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix–vector multiplication kernel that transforms the linear systems’ right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVIDIA’s K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver.
%B Parallel Computing
%8 01-2018
%G eng
%R https://doi.org/10.1016/j.parco.2017.12.006
%0 Conference Proceedings
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%D 2017
%T Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%K block-Jacobi preconditioner
%K Gauss-Jordan elimination
%K graphics processing units (GPUs)
%K iterative methods
%K matrix inversion
%K sparse linear systems
%X In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner. Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%S PMAM'17
%I ACM
%C New York, NY, USA
%P 1–10
%8 02-2017
%@ 978-1-4503-4883-6
%G eng
%U http://doi.acm.org/10.1145/3026937.3026940
%R 10.1145/3026937.3026940
%0 Generic
%D 2017
%T Flexible Batched Sparse Matrix Vector Product on GPUs
%A Hartwig Anzt
%A Collins, Gary
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%I ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C Denver, Colorado
%8 11-2017
%G eng
%0 Conference Paper
%B Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17)
%D 2017
%T Flexible Batched Sparse Matrix-Vector Product on GPUs
%A Hartwig Anzt
%A Collins, Gary
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%Y Alexandrov, Vassil
%Y Al Geist
%Y Jack Dongarra
%X We propose a variety of batched routines for concurrently processing a large collection of small-size, independent sparse matrix-vector products (SpMV) on graphics processing units (GPUs). These batched SpMV kernels are designed to be flexible in order to handle a batch of matrices which differ in size, nonzero count, and nonzero distribution. Furthermore, they support three most commonly used sparse storage formats: CSR, COO and ELL. Our experimental results on a state-of-the-art GPU reveal performance improvements of up to 25X compared to non-batched SpMV routines.
%B Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17)
%I ACM Press
%C Denver, Colorado
%8 11-2017
%@ 9781450351256
%G eng
%U https://dl.acm.org/citation.cfm?id=3148230&CFID=1011405188&CFTOKEN=85357784
%R 10.1145/3148226.3148230
%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%A Andres E. Thomas
%X In this work we present new kernels for the generation and application of block-Jacobi precon-ditioners that accelerate the iterative solution of sparse linear systems on graphics processing units (GPUs). Our approach departs from the conventional LU factorization and decomposes the diagonal blocks of the matrix using the Gauss-Huard method. When enhanced with column pivoting, this method is as stable as LU with partial/row pivoting. Due to extensive use of GPU registers and integration of implicit pivoting, our variable size batched Gauss-Huard implementation outperforms the batched version of LU factorization. In addition, the application kernel combines the conventional two-stage triangular solve procedure, consisting of a backward solve followed by a forward solve, into a single stage that performs both operations simultaneously.
%B International Conference on Computational Science (ICCS 2017)
%I Procedia Computer Science
%C Zurich, Switzerland
%V 108
%P 1783-1792
%8 06-2017
%G eng
%U https://doi.org/10.1016/j.procs.2017.05.186
%0 Conference Paper
%B 46th International Conference on Parallel Processing (ICPP)
%D 2017
%T Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Ortí
%K graphics processing units
%K Jacobian matrices
%K Kernel
%K linear systems
%K Parallel processing
%K Sparse matrices
%X We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. All kernels heavily exploit the registers of the graphics processing unit (GPU) in order to deliver high performance for small problems. The development of these kernels is motivated by the need for tackling this embarrassingly parallel scenario in the context of block-Jacobi preconditioning that is relevant for the iterative solution of sparse linear systems.
%B 46th International Conference on Parallel Processing (ICPP)
%I IEEE
%C Bristol, United Kingdom
%8 08-2017
%G eng
%U http://ieeexplore.ieee.org/abstract/document/8025283/?reload=true
%R 10.1109/ICPP.2017.18
%0 Journal Article
%J Journal of Computational Science
%D 2016
%T Fine-grained Bit-Flip Protection for Relaxation Methods
%A Hartwig Anzt
%A Jack
%A Enrique S. Quintana-Ortí
%K Bit flips
%K Fault tolerance
%K High Performance Computing
%K iterative solvers
%K Jacobi method
%K sparse linear systems
%X Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.
%B Journal of Computational Science
%8 11-2016
%G eng
%R 10.1016/j.jocs.2016.11.013
%0 Conference Paper
%B 3rd International Workshop on Energy Efficient Supercomputing (E2SC '15)
%D 2015
%T Adaptive Precision Solvers for Sparse Linear Systems
%A Hartwig Anzt
%A Jack Dongarra
%A Enrique S. Quintana-Ortí
%B 3rd International Workshop on Energy Efficient Supercomputing (E2SC '15)
%I ACM
%C Austin, TX
%8 11-2015
%G eng
%0 Conference Paper
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15)
%D 2015
%T Tuning Stationary Iterative Solvers for Fault Resilience
%A Hartwig Anzt
%A Jack Dongarra
%A Enrique S. Quintana-Ortí
%X As the transistor’s feature size decreases following Moore’s Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community. In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation.
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15)
%I ACM
%C Austin, TX
%8 11-2015
%G eng
%0 Journal Article
%J Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences
%D 2014
%T Improving the Energy Efficiency of Sparse Linear System Solvers on Multicore and Manycore Systems
%A Hartwig Anzt
%A Enrique S. Quintana-Ortí
%K energy efficiency
%K graphics processing units
%K High Performance Computing
%K iterative solvers
%K multicore processors
%K sparse linear systems
%X While most recent breakthroughs in scientific research rely on complex simulations carried out in large-scale supercomputers, the power draft and energy spent for this purpose is increasingly becoming a limiting factor to this trend. In this paper, we provide an overview of the current status in energy-efficient scientific computing by reviewing different technologies used to monitor power draft as well as power- and energy-saving mechanisms available in commodity hardware. For the particular domain of sparse linear algebra, we analyze the energy efficiency of a broad collection of hardware architectures and investigate how algorithmic and implementation modifications can improve the energy performance of sparse linear system solvers, without negatively impacting their performance.
%B Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences
%V 372
%8 07-2014
%G eng
%N 2018
%R 10.1098/rsta.2013.0279
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2014
%T Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors
%A José I. Aliaga
%A Hartwig Anzt
%A Maribel Castillo
%A Juan C. Fernández
%A Germán León
%A Joaquín Pérez
%A Enrique S. Quintana-Ortí
%K CG
%K CPUs
%K energy efficiency
%K GPUs
%K low-power architectures
%X In this paper, we analyze the interactions occurring in the triangle performance-power-energy for the execution of a pivotal numerical algorithm, the iterative conjugate gradient (CG) method, on a diverse collection of parallel multithreaded architectures. This analysis is especially timely in a decade where the power wall has arisen as a major obstacle to build faster processors. Moreover, the CG method has recently been proposed as a complement to the LINPACK benchmark, as this iterative method is argued to be more archetypical of the performance of today's scientific and engineering applications. To gain insights about the benefits of hands-on optimizations we include runtime and energy efficiency results for both out-of-the-box usage relying exclusively on compiler optimizations, and implementations manually optimized for target architectures, that range from general-purpose and digital signal multicore processors to manycore graphics processing units, all representative of current multithreaded systems.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 885-904
%8 09-2014
%G eng
%U http://dx.doi.org/10.1002/cpe.3341
%N 4
%& 885
%R 10.1002/cpe.3341