%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2019
%T Solving Linear Diophantine Systems on Parallel Architectures
%A Dmitry Zaitsev
%A Stanimire Tomov
%A Jack Dongarra
%K Mathematical model
%K Matrix decomposition
%K Parallel architectures
%K Petri nets
%K Software algorithms
%K Sparse matrices
%K Task analysis
%X Solving linear Diophantine systems of equations is applied in discrete-event systems, model checking, formal languages and automata, logic programming, cryptography, networking, signal processing, and chemistry. For modeling discrete systems with Petri nets, a solution in non-negative integer numbers is required, which represents an intractable problem. For this reason, solving such kinds of tasks with significant speedup is highly appreciated. In this paper we design a new solver of linear Diophantine systems based on the parallel-sequential composition of the system clans. The solver is studied and implemented to run on parallel architectures using a two-level parallelization concept based on MPI and OpenMP. A decomposable system is usually represented by a sparse matrix; a minimal clan size of the decomposition restricts the granulation of the technique. MPI is applied for solving systems for clans using a parallel-sequential composition on distributed-memory computing nodes, while OpenMP is applied in solving a single indecomposable system on a single node using multiple cores. A dynamic task-dispatching subsystem is developed for distributing systems on nodes in the process of compositional solution. Computational speedups are obtained on a series of test examples, e.g., illustrating that the best value constitutes up to 45 times speedup obtained on 5 nodes with 20 cores each.
%B IEEE Transactions on Parallel and Distributed Systems
%V 30
%P 1158-1169
%8 2019-05
%G eng
%U https://ieeexplore.ieee.org/document/8482295
%N 5
%R http://dx.doi.org/10.1109/TPDS.2018.2873354
%0 Conference Paper
%B 46th International Conference on Parallel Processing (ICPP)
%D 2017
%T Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K graphics processing units
%K Jacobian matrices
%K Kernel
%K linear systems
%K Parallel processing
%K Sparse matrices
%X We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. All kernels heavily exploit the registers of the graphics processing unit (GPU) in order to deliver high performance for small problems. The development of these kernels is motivated by the need for tackling this embarrassingly parallel scenario in the context of block-Jacobi preconditioning that is relevant for the iterative solution of sparse linear systems.
%B 46th International Conference on Parallel Processing (ICPP)
%I IEEE
%C Bristol, United Kingdom
%8 2017-08
%G eng
%U http://ieeexplore.ieee.org/abstract/document/8025283/?reload=true
%R 10.1109/ICPP.2017.18
%0 Conference Proceedings
%B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2016
%T Efficiency of General Krylov Methods on GPUs – An Experimental Study
%A Hartwig Anzt
%A Jack Dongarra
%A Moritz Kreutzer
%A Gerhard Wellein
%A Martin Kohler
%K algorithmic bombardment
%K BiCGSTAB
%K CGS
%K Convergence
%K Electric breakdown
%K gpu
%K graphics processing units
%K Hardware
%K IDR(s)
%K Krylov solver
%K Libraries
%K linear systems
%K QMR
%K Sparse matrices
%X This paper compares different Krylov methods based on short recurrences with respect to their efficiency whenimplemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergencecharacteristics. For a large set of test matrices taken from theUniversity of Florida Matrix Collection, we evaluate the methods'performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and executiontime. We also analyze whether the methods are "orthogonal"in terms of problem suitability. We propose best practicesfor choosing methods in a "black box" scenario, where noinformation about the optimal solver is available.
%B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%P 683-691
%8 2016-05
%G eng
%R 10.1109/IPDPSW.2016.45