%0 Journal Article %J Parallel Computing %D 2019 %T Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Marc Baboulin %A Joël Falcou %A Jack Dongarra %K Autotuning %K Batched GEMM %K HPC %K Matrix-matrix product %K optimization %K Small matrices %X Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. %B Parallel Computing %V 81 %P 1–21 %8 2019-01 %G eng %R https://doi.org/10.1016/j.parco.2018.10.003 %0 Generic %D 2018 %T Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Marc Baboulin %A Joël Falcou %A Jack Dongarra %X Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1; 600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. %B Innovative Computing Laboratory Technical Report %I Innovative Computing Laboratory, University of Tennessee %8 2018-09 %G eng %0 Generic %D 2017 %T Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %I SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation %C Atlanta, GA %8 2017-03 %G eng %0 Generic %D 2017 %T Small Tensor Operations on Advanced Architectures for High-Order Applications %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %B University of Tennessee Computer Science Technical Report %I Innovative Computing Laboratory, University of Tennessee %8 2017-04 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2017 %T Solving Dense Symmetric Indefinite Systems using GPUs %A Marc Baboulin %A Jack Dongarra %A Adrien Remy %A Stanimire Tomov %A Ichitaro Yamazaki %X This paper studies the performance of different algorithms for solving a dense symmetric indefinite linear system of equations on multicore CPUs with a Graphics Processing Unit (GPU). To ensure the numerical stability of the factorization, pivoting is required. Obtaining high performance of such algorithms on the GPU is difficult because all the existing pivoting strategies lead to frequent synchronizations and irregular data accesses. Until recently, there has not been any implementation of these algorithms on a hybrid CPU/GPU architecture. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive data transfer and synchronization between the CPU and GPU, or on the GPU (e.g., factorizing the matrix entirely on the GPU or in a communication-avoiding fashion). We also study the performance of the solver using iterative refinements along with the factorization without pivoting combined with the preprocessing technique based on random butterfly transformations, or with the mixed-precision algorithm where the matrix is factorized in single precision. This randomization algorithm only has a probabilistic proof on the numerical stability, and for this paper, we only focused on the mixed-precision algorithm without pivoting. However, they demonstrate that we can obtain good performance on the GPU by avoiding the pivoting and using the lower precision arithmetics, respectively. As illustrated with the application in acoustics studied in this paper, in many practical cases, the matrices can be factorized without pivoting. Because the componentwise backward error computed in the iterative refinement signals when the algorithm failed to obtain the desired accuracy, the user can use these potentially unstable but efficient algorithms in most of the cases and fall back to a more stable algorithm with pivoting only in the case of the failure. %B Concurrency and Computation: Practice and Experience %V 29 %8 2017-03 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/cpe.4055/full %N 9 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.4055 %0 Journal Article %J VECPAR %D 2016 %T Accelerating the Conjugate Gradient Algorithm with GPU in CFD Simulations %A Hartwig Anzt %A Marc Baboulin %A Jack Dongarra %A Yvan Fournier %A Frank Hulsemann %A Amal Khabou %A Yushan Wang %X This paper illustrates how GPU computing can be used to accelerate computational fluid dynamics (CFD) simulations. For sparse linear systems arising from finite volume discretization, we evaluate and optimize the performance of Conjugate Gradient (CG) routines designed for manycore accelerators and compare against an industrial CPU-based implementation. We also investigate how the recent advances in preconditioning, such as iterative Incomplete Cholesky (IC, as symmetric case of ILU) preconditioning, match the requirements for solving real world problems. %B VECPAR %G eng %U http://hgpu.org/?p=16264 %0 Book Section %B Lecture Notes in Computer Science %D 2016 %T Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures %A Marc Baboulin %A Jack Dongarra %A Adrien Remy %A Stanimire Tomov %A Ichitaro Yamazaki %E Roman Wyrzykowski %E Ewa Deelman %E Konrad Karczewski %E Jacek Kitowski %E Kazimierz Wiatr %K Communication-avoiding %K Dense symmetric indefinite factorization %K gpu computation %K randomization %X We study the performance of dense symmetric indefinite factorizations (Bunch-Kaufman and Aasen’s algorithms) on multicore CPUs with a Graphics Processing Unit (GPU). Though such algorithms are needed in many scientific and engineering simulations, obtaining high performance of the factorization on the GPU is difficult because the pivoting that is required to ensure the numerical stability of the factorization leads to frequent synchronizations and irregular data accesses. As a result, until recently, there has not been any implementation of these algorithms on hybrid CPU/GPU architectures. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive communication and synchronization between the CPU and GPU, or on the GPU. We also study the performance of an LDL^T factorization with no pivoting combined with the preprocessing technique based on Random Butterfly Transformations. Though such transformations only have probabilistic results on the numerical stability, they avoid the pivoting and obtain a great performance on the GPU. %B Lecture Notes in Computer Science %S 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I %I Springer International Publishing %V 9573 %P 86-95 %8 2015-09 %@ 978-3-319-32149-3 %G eng %& Parallel Processing and Applied Mathematics %R 10.1007/978-3-319-32149-3_9 %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %K Applications %K Batched linear algebra %K FEM %K gpu %K Tensor contractions %K Tensor HPC %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Generic %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-01 %G eng %0 Generic %D 2015 %T Towards a High-Performance Tensor Algebra Package for Accelerators %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %I moky Mountains Computational Sciences and Engineering Conference (SMC15) %C Gatlinburg, TN %8 2015-09 %G eng %0 Conference Paper %B International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS) %D 2014 %T Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems %A Marc Baboulin %A Jack Dongarra %A Remi Lacroix %X This paper presents an efficient computation for least squares conditioning or estimates of it. We propose performance results using new routines on top of the multicore-GPU library MAGMA. This set of routines is based on an efficient computation of the variance-covariance matrix for which, to our knowledge, there is no implementation in current public domain libraries LAPACK and ScaLAPACK. %B International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS) %C Waterloo, Ontario, CA %8 2014-08 %G eng %0 Journal Article %J Parallel Computing %D 2014 %T An Efficient Distributed Randomized Algorithm for Solving Large Dense Symmetric Indefinite Linear Systems %A Marc Baboulin %A Du Becker %A George Bosilca %A Anthony Danalis %A Jack Dongarra %K Distributed linear algebra solvers %K LDLT factorization %K PaRSEC runtime %K plasma %K Randomized algorithms %K Symmetric indefinite systems %X Randomized algorithms are gaining ground in high-performance computing applications as they have the potential to outperform deterministic methods, while still providing accurate results. We propose a randomized solver for distributed multicore architectures to efficiently solve large dense symmetric indefinite linear systems that are encountered, for instance, in parameter estimation problems or electromagnetism simulations. The contribution of this paper is to propose efficient kernels for applying random butterfly transformations and a new distributed implementation combined with a runtime (PaRSEC) that automatically adjusts data structures, data mappings, and the scheduling as systems scale up. Both the parallel distributed solver and the supporting runtime environment are innovative. To our knowledge, the randomization approach associated with this solver has never been used in public domain software for symmetric indefinite systems. The underlying runtime framework allows seamless data mapping and task scheduling, mapping its capabilities to the underlying hardware features of heterogeneous distributed architectures. The performance of our software is similar to that obtained for symmetric positive definite systems, but requires only half the execution time and half the amount of data storage of a general dense solver. %B Parallel Computing %V 40 %P 213-223 %8 2014-07 %G eng %N 7 %R 10.1016/j.parco.2013.12.003 %0 Journal Article %J ACM Transactions on Mathematical Software (also LAWN 246) %D 2013 %T Accelerating Linear System Solutions Using Randomization Techniques %A Marc Baboulin %A Jack Dongarra %A Julien Herrmann %A Stanimire Tomov %K algorithms %K dense linear algebra %K experimentation %K graphics processing units %K linear systems %K lu factorization %K multiplicative preconditioning %K numerical linear algebra %K performance %K plasma %K randomization %X We illustrate how linear algebra calculations can be enhanced by statistical techniques in the case of a square linear system Ax = b. We study a random transformation of A that enables us to avoid pivoting and then to reduce the amount of communication. Numerical experiments show that this randomization can be performed at a very affordable computational price while providing us with a satisfying accuracy when compared to partial pivoting. This random transformation called Partial Random Butterfly Transformation (PRBT) is optimized in terms of data storage and flops count. We propose a solver where PRBT and the LU factorization with no pivoting take advantage of the current hybrid multicore/GPU machines and we compare its Gflop/s performance with a solver implemented in a current parallel library. %B ACM Transactions on Mathematical Software (also LAWN 246) %V 39 %8 2013-02 %G eng %U http://dl.acm.org/citation.cfm?id=2427025 %N 2 %R 10.1145/2427023.2427025 %0 Conference Paper %B International Conference on Computational Science (ICCS 2013) %D 2013 %T A Parallel Solver for Incompressible Fluid Flows %A Yushan Wang %A Marc Baboulin %A Joël Falcou %A Yann Fraigneau %A Olivier Le Maître %K ADI %K Navier-Stokes equations %K Parallel computing %K Partial diagonalization %K Prediction-projection %K SIMD %X The Navier-Stokes equations describe a large class of fluid flows but are difficult to solve analytically because of their nonlin- earity. We present in this paper a parallel solver for the 3-D Navier-Stokes equations of incompressible unsteady flows with constant coefficients, discretized by the finite difference method. We apply the prediction-projection method which transforms the Navier-Stokes equations into three Helmholtz equations and one Poisson equation. For each Helmholtz system, we apply the Alternating Direction Implicit (ADI) method resulting in three tridiagonal systems. The Poisson equation is solved using partial diagonalization which transforms the Laplacian operator into a tridiagonal one. We describe an implementation based on MPI where the computations are performed on each subdomain and information is exchanged on the interfaces, and where the tridiagonal system solutions are accelerated using vectorization techniques. We present performance results on a current multicore system. %B International Conference on Computational Science (ICCS 2013) %I Elsevier B.V. %C Barcelona, Spain %8 2013-06 %G eng %R DOI: 10.1016/j.procs.2013.05.207 %0 Conference Proceedings %B Proc. of the International Conference on Computational Science (ICCS) %D 2012 %T A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines %A Marc Baboulin %A Simplice Donfack %A Jack Dongarra %A Laura Grigori %A Adrien Remi %A Stanimire Tomov %K magma %B Proc. of the International Conference on Computational Science (ICCS) %V 9 %P 17-26 %8 2012-06 %G eng %0 Generic %D 2012 %T An efficient distributed randomized solver with application to large dense linear systems %A Marc Baboulin %A Dulceneia Becker %A George Bosilca %A Anthony Danalis %A Jack Dongarra %K dague %K dplasma %K parsec %K plasma %B ICL Technical Report %8 2012-07 %G eng %0 Journal Article %J IPDPS 2012 %D 2012 %T A Parallel Tiled Solver for Symmetric Indefinite Systems On Multicore Architectures %A Marc Baboulin %A Dulceneia Becker %A Jack Dongarra %B IPDPS 2012 %C Shanghai, China %8 2012-05 %G eng %0 Journal Article %J Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011) %D 2012 %T Reducing the Amount of Pivoting in Symmetric Indefinite Systems %A Dulceneia Becker %A Marc Baboulin %A Jack Dongarra %E Roman Wyrzykowski %E Jack Dongarra %E Konrad Karczewski %E Jerzy Wasniewski %B Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011) %I Springer-Verlag Berlin Heidelberg %V 7203 %P 133-142 %8 2012-00 %G eng %0 Journal Article %J INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11) %D 2011 %T Accelerating Linear System Solutions Using Randomization Techniques %A Marc Baboulin %A Jack Dongarra %A Julien Herrmann %A Stanimire Tomov %K magma %B INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11) %C Waterloo, Ontario, Canada %8 2011-07 %G eng %0 Generic %D 2011 %T A parallel tiled solver for dense symmetric indefinite systems on multicore architectures %A Marc Baboulin %A Dulceneia Becker %A Jack Dongarra %K plasma %K quark %B University of Tennessee Computer Science Technical Report %8 2011-10 %G eng %0 Generic %D 2011 %T Reducing the Amount of Pivoting in Symmetric Indefinite Systems %A Dulceneia Becker %A Marc Baboulin %A Jack Dongarra %B University of Tennessee Innovative Computing Laboratory Technical Report %I Submitted to PPAM 2011 %C Knoxville, TN %8 2011-05 %G eng %0 Journal Article %J Parallel Computing %D 2010 %T Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems %A Stanimire Tomov %A Jack Dongarra %A Marc Baboulin %K magma %B Parallel Computing %V 36 %P 232-240 %8 2010-00 %G eng %0 Journal Article %J Computer Physics Communications %D 2009 %T Accelerating Scientific Computations with Mixed Precision Algorithms %A Marc Baboulin %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julie Langou %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %X On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. %B Computer Physics Communications %V 180 %P 2526-2533 %8 2009-12 %G eng %N 12 %R https://doi.org/10.1016/j.cpc.2008.11.005 %0 Journal Article %J Numerical Linear Algebra with Applications %D 2009 %T Computing the Conditioning of the Components of a Linear Least-squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B Numerical Linear Algebra with Applications %V 16 %P 517-533 %8 2009-00 %G eng %0 Journal Article %J VECPAR '08, High Performance Computing for Computational Science %D 2008 %T Computing the Conditioning of the Components of a Linear Least Squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B VECPAR '08, High Performance Computing for Computational Science %C Toulouse, France %8 2008-01 %G eng %0 Generic %D 2008 %T Enhancing the Performance of Dense Linear Algebra Solvers on GPUs (in the MAGMA Project) %A Marc Baboulin %A James Demmel %A Jack Dongarra %A Stanimire Tomov %A Vasily Volkov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC08) %C Austin, TX %8 2008-11 %G eng %0 Conference Proceedings %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %D 2008 %T Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures %A Marc Baboulin %A Stanimire Tomov %A Jack Dongarra %K magma %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %C Trondheim Norway %8 2008-05 %G eng %0 Generic %D 2008 %T Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures %A Marc Baboulin %A Jack Dongarra %A Stanimire Tomov %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-08-615 (also LAPACK Working Note 200) %8 2008-01 %G eng %0 Generic %D 2008 %T Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems %A Stanimire Tomov %A Jack Dongarra %A Marc Baboulin %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-08-632 (also LAPACK Working Note 210) %8 2008-01 %G eng %0 Generic %D 2008 %T Using dual techniques to derive componentwise and mixed condition numbers for a linear functional of a linear least squares solution %A Marc Baboulin %A Serge Gratton %B University of Tennessee Computer Science Technical Report, UT-CS-08-622 (also LAPACK Working Note 207) %8 2008-01 %G eng %0 Generic %D 2007 %T Computing the Conditioning of the Components of a Linear Least Squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B University of Tennessee Computer Science Technical Report %8 2007-01 %G eng