%0 Generic
%D 2020
%T Asynchronous SGD for DNN Training on Shared-Memory Parallel Architectures
%A Florent Lopez
%A Edmond Chow
%A Stanimire Tomov
%A Jack Dongarra
%K Asynchronous iterative methods
%K Deep learning
%K gpu
%K multicore CPU
%K Stochastic Gradient Descent
%X We present a parallel asynchronous Stochastic Gradient Descent algorithm for shared memory architectures. Different from previous asynchronous algorithms, we consider the case where the gradient updates are not particularly sparse. In the context of the MagmaDNN framework, we compare the parallel efficiency of the asynchronous implementation with that of the traditional synchronous implementation. Tests are performed for training deep neural networks on multicore CPUs and GPU devices.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2020-03
%G eng
%0 Conference Paper
%B Workshop on Scalable Deep Learning over Parallel And Distributed Infrastructures (ScaDL 2020)
%D 2020
%T Asynchronous SGD for DNN Training on Shared-Memory Parallel Architectures
%A Florent Lopez
%A Edmond Chow
%A Stanimire Tomov
%A Jack Dongarra
%B Workshop on Scalable Deep Learning over Parallel And Distributed Infrastructures (ScaDL 2020)
%8 2020-05
%G eng
%0 Generic
%D 2020
%T Integrating Deep Learning in Domain Sciences at Exascale
%A Rick Archibald
%A Edmond Chow
%A Eduardo D'Azevedo
%A Jack Dongarra
%A Markus Eisenbach
%A Rocco Febbo
%A Florent Lopez
%A Daniel Nichols
%A Stanimire Tomov
%A Kwai Wong
%A Junqi Yin
%X This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems e ciently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-08
%G eng
%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2019
%T ParILUT – A Parallel Threshold ILU for GPUs
%A Hartwig Anzt
%A Tobias Ribizel
%A Goran Flegar
%A Edmond Chow
%A Jack Dongarra
%X In this paper, we present the first algorithm for computing threshold ILU factorizations on GPU architectures. The proposed ParILUT-GPU algorithm is based on interleaving parallel fixed-point iterations that approximate the incomplete factors for an existing nonzero pattern with a strategy that dynamically adapts the nonzero pattern to the problem characteristics. This requires the efficient selection of thresholds that separate the values to be dropped from the incomplete factors, and we design a novel selection algorithm tailored towards GPUs. All components of the ParILUT-GPU algorithm make heavy use of the features available in the latest NVIDIA GPU generations, and outperform existing multithreaded CPU implementations.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R https://doi.org/10.1109/IPDPS.2019.00033
%0 Journal Article
%J Parallel Computing
%D 2019
%T Performance of Asynchronous Optimized Schwarz with One-sided Communication
%A Ichitaro Yamazaki
%A Edmond Chow
%A Aurelien Bouteiller
%A Jack Dongarra
%X In asynchronous iterative methods on distributed-memory computers, processes update their local solutions using data from other processes without an implicit or explicit global synchronization that corresponds to advancing the global iteration counter. In this work, we test the asynchronous optimized Schwarz domain-decomposition iterative method using various one-sided (remote direct memory access) communication schemes with passive target completion. The results show that when one-sided communication is well-supported, the asynchronous version of optimized Schwarz can outperform the synchronous version even for perfectly balanced partitionings of the problem on a supercomputer with uniform nodes.
%B Parallel Computing
%V 86
%P 66-81
%8 2019-08
%G eng
%U http://www.sciencedirect.com/science/article/pii/S0167819118301261
%R https://doi.org/10.1016/j.parco.2019.05.004
%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2018
%T ParILUT - A New Parallel Threshold ILU
%A Hartwig Anzt
%A Edmond Chow
%A Jack Dongarra
%X We propose a parallel algorithm for computing a threshold incomplete LU (ILU) factorization. The main idea is to interleave a parallel fixed-point iteration that approximates an incomplete factorization for a given sparsity pattern with a procedure that adjusts the pattern. We describe and test a strategy for identifying nonzeros to be added and nonzeros to be removed from the sparsity pattern. The resulting pattern may be different and more effective than that of existing threshold ILU algorithms. Also in contrast to other parallel threshold ILU algorithms, much of the new algorithm has fine-grained parallelism.
%B SIAM Journal on Scientific Computing
%I SIAM
%V 40
%P C503–C519
%8 2018-07
%G eng
%N 4
%R https://doi.org/10.1137/16M1079506
%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2018
%T Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning
%A Edmond Chow
%A Hartwig Anzt
%A Jennifer Scott
%A Jack Dongarra
%X When using incomplete factorization preconditioners with an iterative method to solve large sparse linear systems, each application of the preconditioner involves solving two sparse triangular systems. These triangular systems are challenging to solve efficiently on computers with high levels of concurrency. On such computers, it has recently been proposed to use Jacobi iterations, which are highly parallel, to approximately solve the triangular systems from incomplete factorizations. The effectiveness of this approach, however, is problem-dependent: the Jacobi iterations may not always converge quickly enough for all problems. Thus, as a necessary and important step to evaluate this approach, we experimentally test the approach on a large number of realistic symmetric positive definite problems. We also show that by using block Jacobi iterations, we can extend the range of problems for which such an approach can be effective. For block Jacobi iterations, it is essential for the blocking to be cognizant of the matrix structure.
%B Journal of Parallel and Distributed Computing
%V 119
%P 219–230
%8 2018-11
%G eng
%R https://doi.org/10.1016/j.jpdc.2018.04.017
%0 Conference Proceedings
%B Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2016
%T Batched Generation of Incomplete Sparse Approximate Inverses on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Thomas Huckle
%A Jack Dongarra
%X Incomplete Sparse Approximate Inverses (ISAI) have recently been shown to be an attractive alternative to exact sparse triangular solves in the context of incomplete factorization preconditioning. In this paper we propose a batched GPU-kernel for the efficient generation of ISAI matrices. Utilizing only thread-local memory allows for computing the ISAI matrix with very small memory footprint. We demonstrate that this strategy is faster than the existing strategy for generating ISAI matrices, and use a large number of test matrices to assess the algorithm's efficiency in an iterative solver setting.
%B Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%S ScalA '16
%P 49–56
%8 2016-11
%@ 978-1-5090-5222-6
%G eng
%R 10.1109/ScalA.2016.11
%0 Generic
%D 2016
%T On block-asynchronous execution on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Jack Dongarra
%X This paper experimentally investigates how GPUs execute instructions when used for general purpose computing (GPGPU). We use a light-weight realizing a vector operation to analyze which vector entries are updated subsequently, and identify regions where parallel execution can be expected. The results help us to understand how GPUs operate, and map this operation mode to the mathematical concept of asynchronism. In particular it helps to understand the effects that can occur when implementing a fixed-point method using in-place updates on GPU hardware.
%B LAPACK Working Note
%8 2016-11
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn291.pdf
%0 Conference Proceedings
%B Software for Exascale Computing - SPPEXA
%D 2016
%T Domain Overlap for Iterative Sparse Triangular Solves on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Daniel Szyld
%A Jack Dongarra
%E Hans-Joachim Bungartz
%E Philipp Neumann
%E Wolfgang E. Nagel
%X Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution.
%B Software for Exascale Computing - SPPEXA
%S Lecture Notes in Computer Science and Engineering
%I Springer International Publishing
%V 113
%P 527–545
%8 2016-09
%G eng
%R 10.1007/978-3-319-40528-5_24
%0 Journal Article
%J Numerical Algorithms
%D 2016
%T Updating Incomplete Factorization Preconditioners for Model Order Reduction
%A Hartwig Anzt
%A Edmond Chow
%A Jens Saak
%A Jack Dongarra
%K key publication
%X When solving a sequence of related linear systems by iterative methods, it is common to reuse the preconditioner for several systems, and then to recompute the preconditioner when the matrix has changed significantly. Rather than recomputing the preconditioner from scratch, it is potentially more efficient to update the previous preconditioner. Unfortunately, it is not always known how to update a preconditioner, for example, when the preconditioner is an incomplete factorization. A recently proposed iterative algorithm for computing incomplete factorizations, however, is able to exploit an initial guess, unlike existing algorithms for incomplete factorizations. By treating a previous factorization as an initial guess to this algorithm, an incomplete factorization may thus be updated. We use a sequence of problems from model order reduction. Experimental results using an optimized GPU implementation show that updating a previous factorization can be inexpensive and effective, making solving sequences of linear systems a potential niche problem for the iterative incomplete factorization algorithm.
%B Numerical Algorithms
%V 73
%P 611–630
%8 2016-02
%G eng
%N 3
%R 10.1007/s11075-016-0110-2
%0 Conference Paper
%B International Supercomputing Conference (ISC 2015)
%D 2015
%T Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs
%A Edmond Chow
%A Hartwig Anzt
%A Jack Dongarra
%B International Supercomputing Conference (ISC 2015)
%C Frankfurt, Germany
%8 2015-07
%G eng
%0 Conference Paper
%B EuroPar 2015
%D 2015
%T Iterative Sparse Triangular Solves for Preconditioning
%A Hartwig Anzt
%A Edmond Chow
%A Jack Dongarra
%X Sparse triangular solvers are typically parallelized using level scheduling techniques, but parallel eciency is poor on high-throughput architectures like GPUs. We propose using an iterative approach for solving sparse triangular systems when an approximation is suitable. This approach will not work for all problems, but can be successful for sparse triangular matrices arising from incomplete factorizations, where an approximate solution is acceptable. We demonstrate the performance gains that this approach can have on GPUs in the context of solving sparse linear systems with a preconditioned Krylov subspace method. We also illustrate the effect of using asynchronous iterations.
%B EuroPar 2015
%I Springer Berlin
%C Vienna, Austria
%8 2015-08
%G eng
%U http://dx.doi.org/10.1007/978-3-662-48096-0_50
%R 10.1007/978-3-662-48096-0_50
%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%D 2015
%T Random-Order Alternating Schwarz for Sparse Triangular Solves
%A Hartwig Anzt
%A Edmond Chow
%A Daniel Szyld
%A Jack Dongarra
%X Block-asynchronous Jacobi is an iteration method where a locally synchronous iteration is embedded in an asynchronous global iteration. The unknowns are partitioned into small subsets, and while the components within the same subset are iterated in Jacobi fashion, no update order in-between the subsets is enforced. The values of the non-local entries remain constant during the local iterations, which can result in slow inter-subset information propagation and slow convergence. Interpreting of the subsets as subdomains allows to transfer the concept of domain overlap typically enhancing the information propagation to block-asynchronous solvers. In this talk we explore the impact of overlapping domains to convergence and performance of block-asynchronous Jacobi iterations, and present results obtained by running this solver class on state-of-the-art HPC systems.
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng