%0 Journal Article
%J Parallel Computing
%D 2019
%T Variable-Size Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K Batched algorithms
%K Block-Jacobi
%K Gauss–Jordan elimination
%K Graphics processor
%K matrix inversion
%K sparse linear systems
%X In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variable-size batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix–vector multiplication kernel that transforms the linear systems’ right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVIDIA’s K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver.
%B Parallel Computing
%V 81
%P 131-146
%8 2019-01
%G eng
%R https://doi.org/10.1016/j.parco.2017.12.006
%0 Conference Paper
%B SBAC-PAD
%D 2018
%T Variable-Size Batched Condition Number Calculation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Thomas Gruetzmacher
%B SBAC-PAD
%C Lyon, France
%8 2018-09
%G eng
%U https://ieeexplore.ieee.org/document/8645907
%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%A Andres E. Thomas
%X In this work we present new kernels for the generation and application of block-Jacobi precon-ditioners that accelerate the iterative solution of sparse linear systems on graphics processing units (GPUs). Our approach departs from the conventional LU factorization and decomposes the diagonal blocks of the matrix using the Gauss-Huard method. When enhanced with column pivoting, this method is as stable as LU with partial/row pivoting. Due to extensive use of GPU registers and integration of implicit pivoting, our variable size batched Gauss-Huard implementation outperforms the batched version of LU factorization. In addition, the application kernel combines the conventional two-stage triangular solve procedure, consisting of a backward solve followed by a forward solve, into a single stage that performs both operations simultaneously.
%B International Conference on Computational Science (ICCS 2017)
%I Procedia Computer Science
%C Zurich, Switzerland
%V 108
%P 1783-1792
%8 2017-06
%G eng
%R https://doi.org/10.1016/j.procs.2017.05.186
%0 Conference Paper
%B 46th International Conference on Parallel Processing (ICPP)
%D 2017
%T Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K graphics processing units
%K Jacobian matrices
%K Kernel
%K linear systems
%K Parallel processing
%K Sparse matrices
%X We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. All kernels heavily exploit the registers of the graphics processing unit (GPU) in order to deliver high performance for small problems. The development of these kernels is motivated by the need for tackling this embarrassingly parallel scenario in the context of block-Jacobi preconditioning that is relevant for the iterative solution of sparse linear systems.
%B 46th International Conference on Parallel Processing (ICPP)
%I IEEE
%C Bristol, United Kingdom
%8 2017-08
%G eng
%U http://ieeexplore.ieee.org/abstract/document/8025283/?reload=true
%R 10.1109/ICPP.2017.18
%0 Conference Paper
%B 2nd Workshop on Visual Performance Analysis (VPA '15)
%D 2015
%T Visualizing Execution Traces with Task Dependencies
%A Blake Haugen
%A Stephen Richmond
%A Jakub Kurzak
%A Chad A. Steed
%A Jack Dongarra
%X Task-based scheduling has emerged as one method to reduce the complexity of parallel computing. When using task-based schedulers, developers must frame their computation as a series of tasks with various data dependencies. The scheduler can take these tasks, along with their input and output dependencies, and schedule the task in parallel across a node or cluster. While these schedulers simplify the process of parallel software development, they can obfuscate the performance characteristics of the execution of an algorithm. The execution trace has been used for many years to give developers a visual representation of how their computations are performed. These methods can be employed to visualize when and where each of the tasks in a task-based algorithm is scheduled. In addition, the task dependencies can be used to create a directed acyclic graph (DAG) that can also be visualized to demonstrate the dependencies of the various tasks that make up a workload. The work presented here aims to combine these two data sets and extend execution trace visualization to better suit task-based workloads. This paper presents a brief description of task-based schedulers and the performance data they produce. It will then describe an interactive extension to the current trace visualization methods that combines the trace and DAG data sets. This new tool allows users to gain a greater understanding of how their tasks are scheduled. It also provides a simplified way for developers to evaluate and debug the performance of their scheduler.
%B 2nd Workshop on Visual Performance Analysis (VPA '15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Conference Paper
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%D 2013
%T Virtual Systolic Array for QR Decomposition
%A Jakub Kurzak
%A Piotr Luszczek
%A Mark Gates
%A Ichitaro Yamazaki
%A Jack Dongarra
%K dataflow programming
%K message passing
%K multi-core
%K QR decomposition
%K roofline model
%K systolic array
%X Systolic arrays offer a very attractive, data-centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for the QR decomposition is developed and a virtualization layer is used for mapping of the algorithm to a large distributed memory system. Strong scaling properties are discovered, superior to existing solutions.
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%I IEEE
%C Boston, MA
%8 2013-05
%G eng
%R 10.1109/IPDPS.2013.119
%0 Conference Proceedings
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%D 2009
%T VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance
%A Lavanya Ramakrishan
%A Daniel Nurmi
%A Anirban Mandal
%A Charles Koelbel
%A Dennis Gannon
%A Mark Huang
%A Yang-Suk Kee
%A Graziano Obertelli
%A Kiran Thyagaraja
%A Rich Wolski
%A Asim YarKhan
%A Dmitrii Zagorodnov
%K grads
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%C Portland, OR
%8 2009-00
%G eng
%0 Conference Proceedings
%B Proc. 4th International Workshop on OpenMP (IWOMP 2008)
%D 2008
%T Visualizing the Program Execution Control Flow of OpenMP Applications
%A Karl Fürlinger
%A Shirley Moore
%B Proc. 4th International Workshop on OpenMP (IWOMP 2008)
%I Lecture Notes in Computer Science 5004
%C West Lafayette, Indiana
%P 181-190
%8 2008-01
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2004
%T The Virtual Instrument: Support for Grid-enabled Scientific Simulations
%A Henri Casanova
%A Thomas Bartol
%A Francine Berman
%A Adam Birnbaum
%A Jack Dongarra
%A Mark Ellisman
%A Marcio Faerman
%A Erhan Gockay
%A Michelle Miller
%A Graziano Obertelli
%A Stuart Pomerantz
%A Terry Sejnowski
%A Joel Stiles
%A Rich Wolski
%B International Journal of High Performance Computing Applications
%V 18
%P 3-17
%8 2004-01
%G eng
%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T VisPerf: Monitoring Tool for Grid Computing
%A DongWoo Lee
%A Jack Dongarra
%E R. S. Ramakrishna
%K netsolve
%B Lecture Notes in Computer Science
%I Springer Verlag, Heidelberg
%V 2659
%P 233-243
%8 2003-00
%G eng
%0 Journal Article
%J Journal of Parallel and Distributed Computing (submitted)
%D 2002
%T The Virtual Instrument: Support for Grid-enabled Scientific Simulations
%A Henri Casanova
%A Thomas Bartol
%A Francine Berman
%A Adam Birnbaum
%A Jack Dongarra
%A Mark Ellisman
%A Marcio Faerman
%A Erhan Gockay
%A Michelle Miller
%A Graziano Obertelli
%A Stuart Pomerantz
%A Terry Sejnowski
%A Joel Stiles
%A Rich Wolski
%B Journal of Parallel and Distributed Computing (submitted)
%8 2002-10
%G eng