%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2021
%T Accelerating Restarted GMRES with Mixed Precision Arithmetic
%A Lindquist, Neil
%A Luszczek, Piotr
%A Dongarra, Jack
%K Convergence
%K Error correction
%K iterative methods
%K Kernel
%K Lifting equipment
%K linear systems
%K Stability analysis
%X The generalized minimum residual method (GMRES) is a commonly used iterative Krylov solver for sparse, non-symmetric systems of linear equations. Like other iterative solvers, data movement dominates its run time. To improve this performance, we propose running GMRES in reduced precision with key operations remaining in full precision. Additionally, we provide theoretical results linking the convergence of finite precision GMRES with classical Gram-Schmidt with reorthogonalization (CGSR) and its infinite precision counterpart which helps justify the convergence of this method to double-precision accuracy. We tested the mixed-precision approach with a variety of matrices and preconditioners on a GPU-accelerated node. Excluding the incomplete LU factorization without fill in (ILU(0)) preconditioner, we achieved average speedups ranging from 8 to 61 percent relative to comparable double-precision implementations, with the simpler preconditioners achieving the higher speedups.
%B IEEE Transactions on Parallel and Distributed Systems
%G eng
%R 10.1109/TPDS.2021.3090757
%0 Conference Paper
%B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
%D 2020
%T Replacing Pivoting in Distributed Gaussian Elimination with Randomized Techniques
%A Neil Lindquist
%A Piotr Luszczek
%A Jack Dongarra
%K linear systems
%K Randomized algorithms
%X Gaussian elimination is a key technique for solving dense, non-symmetric systems of linear equations. Pivoting is used to ensure numerical stability but can introduce significant overheads. We propose replacing pivoting with recursive butterfly transforms (RBTs) and iterative refinement. RBTs use an FFT-like structure and randomized elements to provide an efficient, two-sided preconditioner for factoring. This approach was implemented and tested using Software for Linear Algebra Targeting Exascale (SLATE). In numerical experiments, our implementation was more robust than Gaussian elimination with no pivoting (GENP) but failed to solve all the problems solvable with Gaussian elimination with partial pivoting (GEPP). Furthermore, the proposed solver was able to outperform GEPP when distributed on GPU-accelerated nodes.
%B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
%I IEEE
%C Atlanta, GA
%8 2020-11
%G eng
%0 Conference Paper
%B 46th International Conference on Parallel Processing (ICPP)
%D 2017
%T Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K graphics processing units
%K Jacobian matrices
%K Kernel
%K linear systems
%K Parallel processing
%K Sparse matrices
%X We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. All kernels heavily exploit the registers of the graphics processing unit (GPU) in order to deliver high performance for small problems. The development of these kernels is motivated by the need for tackling this embarrassingly parallel scenario in the context of block-Jacobi preconditioning that is relevant for the iterative solution of sparse linear systems.
%B 46th International Conference on Parallel Processing (ICPP)
%I IEEE
%C Bristol, United Kingdom
%8 2017-08
%G eng
%U http://ieeexplore.ieee.org/abstract/document/8025283/?reload=true
%R 10.1109/ICPP.2017.18
%0 Conference Proceedings
%B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2016
%T Efficiency of General Krylov Methods on GPUs – An Experimental Study
%A Hartwig Anzt
%A Jack Dongarra
%A Moritz Kreutzer
%A Gerhard Wellein
%A Martin Kohler
%K algorithmic bombardment
%K BiCGSTAB
%K CGS
%K Convergence
%K Electric breakdown
%K gpu
%K graphics processing units
%K Hardware
%K IDR(s)
%K Krylov solver
%K Libraries
%K linear systems
%K QMR
%K Sparse matrices
%X This paper compares different Krylov methods based on short recurrences with respect to their efficiency whenimplemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergencecharacteristics. For a large set of test matrices taken from theUniversity of Florida Matrix Collection, we evaluate the methods'performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and executiontime. We also analyze whether the methods are "orthogonal"in terms of problem suitability. We propose best practicesfor choosing methods in a "black box" scenario, where noinformation about the optimal solver is available.
%B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%P 683-691
%8 2016-05
%G eng
%R 10.1109/IPDPSW.2016.45
%0 Journal Article
%J ACM Transactions on Mathematical Software (also LAWN 246)
%D 2013
%T Accelerating Linear System Solutions Using Randomization Techniques
%A Marc Baboulin
%A Jack Dongarra
%A Julien Herrmann
%A Stanimire Tomov
%K algorithms
%K dense linear algebra
%K experimentation
%K graphics processing units
%K linear systems
%K lu factorization
%K multiplicative preconditioning
%K numerical linear algebra
%K performance
%K randomization
%X We illustrate how linear algebra calculations can be enhanced by statistical techniques in the case of a square linear system Ax = b. We study a random transformation of A that enables us to avoid pivoting and then to reduce the amount of communication. Numerical experiments show that this randomization can be performed at a very affordable computational price while providing us with a satisfying accuracy when compared to partial pivoting. This random transformation called Partial Random Butterfly Transformation (PRBT) is optimized in terms of data storage and flops count. We propose a solver where PRBT and the LU factorization with no pivoting take advantage of the current hybrid multicore/GPU machines and we compare its Gflop/s performance with a solver implemented in a current parallel library.
%B ACM Transactions on Mathematical Software (also LAWN 246)
%V 39
%8 2013-02
%G eng
%U http://dl.acm.org/citation.cfm?id=2427025
%N 2
%R 10.1145/2427023.2427025