%0 Conference Proceedings %B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) %D 2022 %T Addressing Irregular Patterns of Matrix Computations on GPUs and Their Impact on Applications Powered by Sparse Direct Solvers %A Ahmad Abdelfattah %A Pieter Ghysels %A Wajih Boukaram %A Stanimire Tomov %A Xiaoye Sherry Li %A Jack Dongarra %K GPU computing %K irregular computational workloads %K lu factorization %K multifrontal solvers %K sparse direct solvers %X Many scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are different in size. Matrix decompositions on such irregular workloads are rarely addressed on GPUs. This paper addresses irregular workloads of matrix computations on GPUs, and their application to accelerate sparse direct solvers. We design an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse direct LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on a multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high-frequency indefinite Maxwell problem. %B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) %I IEEE Computer Society %C Dallas, TX %P 354-367 %8 2022-11 %G eng %U https://dl.acm.org/doi/abs/10.5555/3571885.3571919 %0 Generic %D 2022 %T PAQR: Pivoting Avoiding QR factorization %A Wissam M. Sid-Lakhdar %A Sebastien Cayrols %A Daniel Bielich %A Ahmad Abdelfattah %A Piotr Luszczek %A Mark Gates %A Stanimire Tomov %A Hans Johansen %A David Williams-Young %A Timothy A. Davis %A Jack Dongarra %X The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is deficient. On the other hand, QR with column pivoting (QRCP) is able to produce smaller forward errors on deficient matrices, but its cost is prohibitive compared to QR. The aim of this paper is to propose PAQR, an alternative solution method with the same cost (or smaller) as QR and as accurate as QRCP in practical cases, for the solution of rank-deficient linear least-squares problems. After presenting the algorithm and its implementations on different architectures, we compare its accuracy and performance results on a variety of application problems. %B ICL Technical Report %8 2022-06 %G eng %0 Journal Article %J Journal of Open Source Software %D 2021 %T libCEED: Fast algebra for high-order element-based discretizations %A Jed Brown %A Ahmad Abdelfattah %A Valeria Barra %A Natalie Beams %A Jean-Sylvain Camier %A Veselin Dobrev %A Yohann Dudouit %A Leila Ghaffari %A Tzanio Kolev %A David Medina %A Will Pazner %A Thilina Ratnayaka %A Jeremy Thompson %A Stanimire Tomov %K finite elements %K high-order methods %K High-performance computing %K matrix-free %K spectral elements %X Finite element methods are widely used to solve partial differential equations (PDE) in science and engineering, but their standard implementation (Arndt et al., 2020; Kirk et al., 2006; Logg et al., 2012) relies on assembling sparse matrices. Sparse matrix multiplication and triangular operations perform a scalar multiply and add for each nonzero entry, just 2 floating point operations (flops) per scalar that must be loaded from memory (Williams et al., 2009). Modern hardware is capable of nearly 100 flops per scalar streamed from memory (Rupp, 2020) so sparse matrix operations cannot achieve more than about 2% utilization of arithmetic units. Matrix assembly becomes even more problematic when the polynomial degree p of the basis functions is increased, resulting in O(pd) storage and O(p2d) compute per degree of freedom (DoF) in d dimensions. Methods pioneered by the spectral element community (Deville et al., 2002; Orszag, 1980) exploit problem structure to reduce costs to O(1) storage and O(p) compute per DoF, with very high utilization of modern CPUs and GPUs. Unfortunately, highquality implementations have been relegated to applications and intrusive frameworks that are often difficult to extend to new problems or incorporate into legacy applications, especially when strong preconditioners are required. libCEED, the Code for Efficient Extensible Discretization (Abdelfattah et al., 2021), is a lightweight library that provides a purely algebraic interface for linear and nonlinear operators and preconditioners with element-based discretizations. libCEED provides portable performance via run-time selection of implementations optimized for CPUs and GPUs, including support for just-in-time (JIT) compilation. It is designed for convenient use in new and legacy software, and offers interfaces in C99 (International Standards Organisation, 1999), Fortran77 (ANSI, 1978), Python (Python, 2021), Julia (Bezanson et al., 2017), and Rust (Rust, 2021). Users and library developers can integrate libCEED at a low level into existing applications in place of existing matrix-vector products without significant refactoring of their own discretization infrastructure. Alternatively, users can utilize integrated libCEED support in MFEM (Anderson et al., 2020; MFEM, 2021). In addition to supporting applications and discretization libraries, libCEED provides a platform for performance engineering and co-design, as well as an algebraic interface for solvers research like adaptive p-multigrid, much like how sparse matrix libraries enable development and deployment of algebraic multigrid solvers %B Journal of Open Source Software %V 6 %P 2945 %G eng %U https://doi.org/10.21105/joss.02945 %R 10.21105/joss.02945 %0 Generic %D 2021 %T SLATE Port to AMD and Intel Platforms %A Ahmad Abdelfattah %A Mohammed Al Farhan %A Cade Brown %A Mark Gates %A Dalal Sukkari %A Asim YarKhan %A Jack Dongarra %B SLATE Working Notes %8 2021-04 %G eng %0 Generic %D 2020 %T Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs %A Cade Brown %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %K AMD GPUs %K GPU computing %K HIP Runtime %K HPC %K numerical linear algebra %K Portability %X Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the HeterogeneousComputing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2020-08 %G eng %0 Conference Paper %B 2020 IEEE High Performance Extreme Computing Virtual Conference %D 2020 %T Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs %A Cade Brown %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %X Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the HeterogeneousComputing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements. %B 2020 IEEE High Performance Extreme Computing Virtual Conference %I IEEE %8 2020-09 %G eng %0 Conference Paper %B 2020 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) %D 2020 %T Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse and Batched Computations %A Hartwig Anzt %A Yuhsiang M. Tsai %A Ahmad Abdelfattah %A Terry Cojean %A Jack Dongarra %K Batched linear algebra %K NVIDIA A100 GPU %K sparse linear algebra %K Sparse Matrix Vector Product %X GPU accelerators have become an important backbone for scientific high performance-computing, and the performance advances obtained from adopting new GPU hardware are significant. In this paper we take a first look at NVIDIA’s newest server-line GPU, the A100 architecture, part of the Ampere generation. Specifically, we assess its performance for sparse and batch computations, as these routines are relied upon in many scientific applications, and compare to the p %B 2020 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) %I IEEE %8 2020-11 %G eng %0 Conference Paper %B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) %D 2020 %T High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs %A Natalie Beams %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %A Tzanio Kolev %A Yohann Dudouit %K Batched linear algebra %K finite elements %K gpu %K high-order methods %K matrix-free FEM %K Tensor contractions %X We present new GPU implementations of the tensor contractions arising from basis-related computations for highorder finite element methods. We consider both tensor and nontensor bases. In the case of tensor bases, we introduce new kernels based on a series of fused device-level matrix multiplications (GEMMs), specifically designed to utilize the fast memory of the GPU. For non-tensor bases, we develop a tuned framework for choosing standard batch-BLAS GEMMs that will maximize performance across groups of elements. The implementations are included in a backend of the libCEED library. We present benchmark results for the diffusion and mass operators using libCEED integration through the MFEM finite element library and compare to those of the previously best-performing GPU backends for stand-alone basis computations. In tensor cases, we see improvements of approximately 10-30% for some cases, particularly for higher basis orders. For the non-tensor tests, the new batch-GEMMs implementation is twice as fast as what was previously available for basis function order greater than five and greater than approximately 105 degrees of freedom in the mesh; up to ten times speedup is seen for eighth-order basis functions. %B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) %I IEEE %8 2020-11 %G eng %0 Generic %D 2020 %T hipMAGMA v1.0 %A Cade Brown %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %I Zenodo %8 2020-03 %G eng %U https://doi.org/10.5281/zenodo.3908549 %R 10.5281/zenodo.3908549 %0 Generic %D 2020 %T hipMAGMA v2.0 %A Cade Brown %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %I Zenodo %8 2020-07 %G eng %U https://doi.org/10.5281/zenodo.3928667 %R 10.5281/zenodo.3928667 %0 Conference Paper %B International Conference on Computational Science (ICCS 2020) %D 2020 %T Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices using GPUs %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %X Half-precision computation refers to performing floating-point operations in a 16-bit format. While half-precision has been driven largely by machine learning applications, recent algorithmic advances in numerical linear algebra have discovered beneficial use cases for half precision in accelerating the solution of linear systems of equations at higher precisions. In this paper, we present a high-performance, mixed-precision linear solver (Ax = b) for symmetric positive definite systems in double-precision using graphics processing units (GPUs). The solver is based on a mixed-precision Cholesky factorization that utilizes the high-performance tensor core units in CUDA-enabled GPUs. Since the Cholesky factors are affected by the low precision, an iterative refinement (IR) solver is required to recover the solution back to double-precision accuracy. Two different types of IR solvers are discussed on a wide range of test matrices. A preprocessing step is also developed, which scales and shifts the matrix, if necessary, in order to preserve its positive-definiteness in lower precisions. Our experiments on the V100 GPU show that performance speedups are up to 4.7× against a direct double-precision solver. However, matrix properties such as the condition number and the eigenvalue distribution can affect the convergence rate, which would consequently affect the overall performance. %B International Conference on Computational Science (ICCS 2020) %I Springer, Cham %C Amsterdam, Netherlands %8 2020-06 %G eng %R https://doi.org/10.1007/978-3-030-50417-5_18 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2020 %T MAGMA Templates for Scalable Linear Algebra on Emerging Architectures %A Mohammed Al Farhan %A Ahmad Abdelfattah %A Stanimire Tomov %A Mark Gates %A Dalal Sukkari %A Azzam Haidar %A Robert Rosenberg %A Jack Dongarra %X With the acquisition and widespread use of more resources that rely on accelerator/wide vector–based computing, there has been a strong demand for science and engineering applications to take advantage of these latest assets. This, however, has been extremely challenging due to the diversity of systems to support their extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. To address these challenges, we design a programming model and describe its ease of use in the development of a new MAGMA Templates library that delivers high-performance scalable linear algebra portable on current and emerging architectures. MAGMA Templates derives its performance and portability by (1) building on existing state-of-the-art linear algebra libraries, like MAGMA, SLATE, Trilinos, and vendor-optimized math libraries, and (2) providing access (seamlessly to the users) to the latest algorithms and architecture-specific optimizations through a single, easy-to-use C++-based API. %B The International Journal of High Performance Computing Applications %V 34 %P 645-658 %8 2020-11 %G eng %N 6 %R https://doi.org/10.1177/1094342020938421 %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2020 %T Matrix Multiplication on Batches of Small Matrices in Half and Half-Complex Precisions %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %X Machine learning and artificial intelligence (AI) applications often rely on performing many small matrix operations—in particular general matrix–matrix multiplication (GEMM). These operations are usually performed in a reduced precision, such as the 16-bit floating-point format (i.e., half precision or FP16). The GEMM operation is also very important for dense linear algebra algorithms, and half-precision GEMM operations can be used in mixed-precision linear solvers. Therefore, high-performance batched GEMM operations in reduced precision are significantly important, not only for deep learning frameworks, but also for scientific applications that rely on batched linear algebra, such as tensor contractions and sparse direct solvers. This paper presents optimized batched GEMM kernels for graphics processing units (GPUs) in FP16 arithmetic. The paper addresses both real and complex half-precision computations on the GPU. The proposed design takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. With eight tuning parameters introduced in the design, the developed kernels have a high degree of flexibility that overcomes the limitations imposed by the hardware and software (in the form of discrete configurations for the Tensor Core APIs). For real FP16 arithmetic, performance speedups are observed against cuBLAS for sizes up to 128, and range between and . For the complex FP16 GEMM kernel, the speedups are between and thanks to a design that uses the standard interleaved matrix layout, in contrast with the planar layout required by the vendor’s solution. The paper also discusses special optimizations for extremely small matrices, where even higher performance gains are achievable. %B Journal of Parallel and Distributed Computing %V 145 %P 188-201 %8 2020-11 %G eng %R https://doi.org/10.1016/j.jpdc.2020.07.001 %0 Journal Article %J ACM Transactions on Mathematical Software %D 2020 %T A Set of Batched Basic Linear Algebra Subprograms %A Ahmad Abdelfattah %A Timothy Costa %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Sven Hammarling %A Nicholas J. Higham %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Mawussi Zounon %X This paper describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular half precision is used in many very large scale applications, such as those associated with machine learning. %B ACM Transactions on Mathematical Software %8 2020-10 %G eng %0 Generic %D 2020 %T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic %A Ahmad Abdelfattah %A Hartwig Anzt %A Erik Boman %A Erin Carson %A Terry Cojean %A Jack Dongarra %A Mark Gates %A Thomas Gruetzmacher %A Nicholas J. Higham %A Sherry Li %A Neil Lindquist %A Yang Liu %A Jennifer Loe %A Piotr Luszczek %A Pratik Nayak %A Sri Pranesh %A Siva Rajamanickam %A Tobias Ribizel %A Barry Smith %A Kasia Swirydowicz %A Stephen Thomas %A Stanimire Tomov %A Yaohung Tsai %A Ichitaro Yamazaki %A Urike Meier Yang %B SLATE Working Notes %I University of Tennessee %8 2020-07 %G eng %9 SLATE Working Notes %0 Journal Article %J Parallel Computing %D 2019 %T Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Marc Baboulin %A Joël Falcou %A Jack Dongarra %K Autotuning %K Batched GEMM %K HPC %K Matrix-matrix product %K optimization %K Small matrices %X Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. %B Parallel Computing %V 81 %P 1–21 %8 2019-01 %G eng %R https://doi.org/10.1016/j.parco.2018.10.003 %0 Generic %D 2019 %T CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps %A Stanimire Tomov %A Ahmad Abdelfattah %A Valeria Barra %A Natalie Beams %A Jed Brown %A Jean-Sylvain Camier %A Veselin Dobrev %A Jack Dongarra %A Yohann Dudouit %A Paul Fischer %A Ali Karakus %A Stefan Kerkemeier %A Tzanio Kolev %A YuHsiang Lan %A Elia Merzari %A Misun Min %A Aleks Obabko %A Scott Parker %A Thilina Ratnayaka %A Jeremy Thompson %A Ananias Tomboulides %A Vladimir Tomov %A Tim Warburton %I Zenodo %8 2019-10 %G eng %R https://doi.org/10.5281/zenodo.3477618 %0 Generic %D 2019 %T CEED ECP Milestone Report: Public release of CEED 2.0 %A Jed Brown %A Ahmad Abdelfattah %A Valeria Barra %A Veselin Dobrev %A Yohann Dudouit %A Paul Fischer %A Tzanio Kolev %A David Medina %A Misun Min %A Thilina Ratnayaka %A Cameron Smith %A Jeremy Thompson %A Stanimire Tomov %A Vladimir Tomov %A Tim Warburton %I Zenodo %8 2019-04 %G eng %U https://doi.org/10.5281/zenodo.2641316 %R 10.5281/zenodo.2641316 %0 Generic %D 2019 %T An Empirical View of SLATE Algorithms on Scalable Hybrid System %A Asim YarKhan %A Jakub Kurzak %A Ahmad Abdelfattah %A Jack Dongarra %B Innovative Computing Laboratory Technical Report %I University of Tennessee, Knoxville %8 2019-09 %G eng %0 Conference Paper %B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2019 %T Fast Batched Matrix Multiplication for Small Sizes using Half Precision Arithmetic on GPUs %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %B 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Rio de Janeiro, Brazil %8 2019-05 %G eng %0 Conference Paper %B 48th International Conference on Parallel Processing (ICPP 2019) %D 2019 %T Massively Parallel Automated Software Tuning %A Jakub Kurzak %A Yaohung Tsai %A Mark Gates %A Ahmad Abdelfattah %A Jack Dongarra %X This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined. %B 48th International Conference on Parallel Processing (ICPP 2019) %I ACM Press %C Kyoto, Japan %8 2019-08 %G eng %R https://doi.org/10.1145/3337821.3337908 %0 Generic %D 2019 %T Optimizing Batch HGEMM on Small Sizes Using Tensor Cores %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %I GPU Technology Conference (GTC) %C San Jose, CA %8 2019-03 %G eng %0 Conference Paper %B IEEE High Performance Extreme Computing Conference (HPEC’19) %D 2019 %T Progressive Optimization of Batched LU Factorization on GPUs %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %B IEEE High Performance Extreme Computing Conference (HPEC’19) %I IEEE %C Waltham, MA %8 2019-09 %G eng %0 Conference Paper %B ScalA19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2019 %T Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %K Half precision %K mixed-precision solvers %K Tensor cores FP16 arithmetic %B ScalA19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %I IEEE %C Denver, CO %8 2019-11 %G eng %0 Generic %D 2018 %T Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Marc Baboulin %A Joël Falcou %A Jack Dongarra %X Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1; 600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. %B Innovative Computing Laboratory Technical Report %I Innovative Computing Laboratory, University of Tennessee %8 2018-09 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2018 %T Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Dense linear solvers %K energy efficiency %K GPU computing %X Graphics Processing Units (GPUs) are widely used in accelerating dense linear solvers. The matrix factorizations, which dominate the runtime for these solvers, are often designed using a hybrid scheme, where GPUs perform trailing matrix updates, while the CPUs perform the panel factorizations. Consequently, hybrid solutions require high-end CPUs and optimized CPU software in order to deliver high performance. Furthermore, they lack the energy efficiency inherent for GPUs due to the use of less energy-efficient CPUs, as well as CPU-GPU communications. This paper presents analysis and design techniques that overcome the shortcomings of the hybrid algorithms, and allow the design of high-performance and energy-efficient dense LU and Cholesky factorizations that use GPUs only. The full GPU solution eliminates the need for a high-end CPU and optimized CPU software, which leads to a better energy efficiency. We discuss different design choices, and introduce optimized GPU kernels for panel factorizations. The developed solutions achieve 90+ percent of the performance of optimized hybrid solutions, while improving the energy efficiency by 50 percent. They outperform the vendor library by 30-50 percent in single precision, and 15-50 percent in double precision. We also show that hybrid designs trail the proposed solutions in performance when optimized CPU software is not available. %B IEEE Transactions on Parallel and Distributed Systems %V 29 %P 2700–2712 %8 2018-12 %G eng %N 12 %R 10.1109/TPDS.2018.2842785 %0 Conference Paper %B IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2018 %T Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters %A Ichitaro Yamazaki %A Ahmad Abdelfattah %A Akihiro Ida %A Satoshi Ohshima %A Stanimire Tomov %A Rio Yokota %A Jack Dongarra %X ppohBEM is an open-source software package im- plementing the boundary element method. One of its main software tasks is the solution of the dense linear system of equations, for which, ppohBEM relies on another software package called HACApK. To reduce the cost of solving the linear system, HACApK hierarchically compresses the coefficient matrix using adaptive cross approximation. This hierarchical compression greatly reduces the storage and time complexities of the solver and enables the solution of large-scale boundary value problems. To extend the capability of ppohBEM, in this paper, we carefully port the HACApK’s linear solver onto GPU clusters. Though the potential of the GPUs has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUs for a solver, like HACApK’s, that requires fine-grained computation and global communication. First, to utilize the GPUs, we integrate the batched GPU kernel that was recently released in the MAGMA software package. We discuss several techniques to improve the performance of the batched kernel. We then study various techniques to address the inter-GPU communication and study their effects on state-of- the-art GPU clusters. We believe that the techniques studied in this paper are of interest to a wide range of software packages running on GPUs, especially with the increasingly complex node architectures and the growing costs of the communication. We also hope that our efforts to integrate the GPU kernel or to setup the inter-GPU communication will influence the design of the future-generation batched kernels or the communication layer within a software stack. %B IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Vancouver, BC, Canada %8 2018-05 %G eng %0 Journal Article %J Journal of Computational Science %D 2018 %T Batched One-Sided Factorizations of Tiny Matrices Using GPUs: Challenges and Countermeasures %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K batch computation %K GPU computing %K matrix factorization %X The use of batched matrix computations recently gained a lot of interest for applications, where the same operation is applied to many small independent matrices. The batched computational pattern is frequently encountered in applications of data analytics, direct/iterative solvers and preconditioners, computer vision, astrophysics, and more, and often requires specific designs for vectorization and extreme parallelism to map well on today's high-end many-core architectures. This has led to the development of optimized software for batch computations, and to an ongoing community effort to develop standard interfaces for batched linear algebra software. Furthering these developments, we present GPU design and optimization techniques for high-performance batched one-sided factorizations of millions of tiny matrices (of size 32 and less). We quantify the effects and relevance of different techniques in order to select the best-performing LU, QR, and Cholesky factorization designs. While we adapt common optimization techniques, such as optimal memory traffic, register blocking, and concurrency control, we also show that a different mindset and techniques are needed when matrices are tiny, and in particular, sub-vector/warp in size. The proposed routines are part of the MAGMA library and deliver significant speedups compared to their counterparts in currently available vendor-optimized libraries. Notably, we tune the developments for the newest V100 GPU from NVIDIA to show speedups of up to 11.8×. %B Journal of Computational Science %V 26 %P 226–236 %8 2018-05 %G eng %R https://doi.org/10.1016/j.jocs.2018.01.005 %0 Conference Proceedings %B International Conference on Computational Science (ICCS 2018) %D 2018 %T The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques %A Azzam Haidar %A Ahmad Abdelfattah %A Mawussi Zounon %A Panruo Wu %A Srikara Pranesh %A Stanimire Tomov %A Jack Dongarra %X As parallel computers approach exascale, power efficiency in high-performance computing (HPC) systems is of increasing concern. Exploiting both the hardware features and algorithms is an effective solution to achieve power efficiency, and to address the energy constraints in modern and future HPC systems. In this work, we present a novel design and implementation of an energy-efficient solution for dense linear systems of equations, which are at the heart of large-scale HPC applications. The proposed energy-efficient linear system solvers are based on two main components: (1) iterative refinement techniques, and (2) reduced-precision computing features in modern accelerators and coprocessors. While most of the energy efficiency approaches aim to reduce the consumption with a minimal performance penalty, our method improves both the performance and the energy efficiency. Compared to highly-optimized linear system solvers, our kernels deliver the same accuracy solution up to 2× faster and reduce the energy consumption up to half on Intel Knights Landing (KNL) architectures. By efficiently using the Tensor Cores available in the NVIDIA V100 PCIe GPUs, the speedups can be up to 4× , with more than 80% reduction in the energy consumption. %B International Conference on Computational Science (ICCS 2018) %I Springer %C Wuxi, China %V 10860 %P 586–600 %8 2018-06 %G eng %U https://rdcu.be/bcKSC %R https://doi.org/10.1007/978-3-319-93698-7_45 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2018 %T A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations %A Azzam Haidar %A Ahmad Abdelfattah %A Mawussi Zounon %A Stanimire Tomov %A Jack Dongarra %X We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6x for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU. %B IEEE Transactions on Parallel and Distributed Systems %V 29 %P 973–984 %8 2018-05 %G eng %N 5 %R 10.1109/TPDS.2017.2783929 %0 Generic %D 2018 %T Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100 %A Azzam Haidar %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %I GPU Technology Conference (GTC), Poster %C San Jose, CA %8 2018-03 %G eng %0 Generic %D 2018 %T Implementation of the C++ API for Batch BLAS %A Ahmad Abdelfattah %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2018-06 %G eng %1 07 %0 Generic %D 2018 %T MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines %A Ahmad Abdelfattah %A Jack Dongarra %A Azzam Haidar %A Stanimire Tomov %A Ichitaro Yamazaki %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster %C Dallas, TX %8 2018-11 %G eng %0 Generic %D 2018 %T MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR) %A Azzam Haidar %A Stanimire Tomov %A Ahmad Abdelfattah %A Ichitaro Yamazaki %A Jack Dongarra %I NSF PI Meeting, Poster %C Washington, DC %8 2018-04 %G eng %R https://doi.org/10.6084/m9.figshare.6174143.v3 %0 Conference Paper %B IEEE High Performance Extreme Computing Conference (HPEC’18) %D 2018 %T Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %X This paper introduces several frameworks for the design and implementation of high performance GPU kernels that target batch workloads with irregular sizes. Such workloads are ubiquitous in many scientific applications, including sparse direct solvers, astrophysics, and quantum chemistry. The paper addresses two main categories of frameworks, taking the Cholesky factorization as a case study. The first uses hostside kernel launches, and the second uses device-side launches. Within each category, different design options are introduced, with an emphasis on the advantages and the disadvantages of each approach. Our best performing design outperforms the state-of-the-art CPU implementation, scoring up to 4.7× speedup in double precision on a Pascal P100 GPU. %B IEEE High Performance Extreme Computing Conference (HPEC’18) %I IEEE %C Waltham, MA %8 2018-09 %G eng %0 Generic %D 2018 %T Tensor Contractions using Optimized Batch GEMM Routines %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %I GPU Technology Conference (GTC), Poster %C San Jose, CA %8 2018-03 %G eng %0 Generic %D 2018 %T Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption %A Azzam Haidar %A Stanimire Tomov %A Ahmad Abdelfattah %A Mawussi Zounon %A Jack Dongarra %I ISC High Performance (ISC18), Best Poster Award %C Frankfurt, Germany %8 2018-06 %G eng %0 Conference Paper %B ISC High Performance (ISC'18), Best Poster %D 2018 %T Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption %A Azzam Haidar %A Stanimire Tomov %A Ahmad Abdelfattah %A Mawussi Zounon %A Jack Dongarra %B ISC High Performance (ISC'18), Best Poster %C Frankfurt, Germany %8 2018-06 %G eng %0 Generic %D 2017 %T Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %I SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation %C Atlanta, GA %8 2017-03 %G eng %0 Generic %D 2017 %T C++ API for Batch BLAS %A Ahmad Abdelfattah %A Konstantin Arturov %A Cris Cecka %A Jack Dongarra %A Chip Freitag %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Panruo Wu %B SLATE Working Notes %I University of Tennessee %8 2017-12 %G eng %1 04 %0 Generic %D 2017 %T C++ API for BLAS and LAPACK %A Mark Gates %A Piotr Luszczek %A Ahmad Abdelfattah %A Jakub Kurzak %A Jack Dongarra %A Konstantin Arturov %A Cris Cecka %A Chip Freitag %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2017-06 %G eng %1 02 %0 Journal Article %J Procedia Computer Science %D 2017 %T Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %X This paper presents new algorithmic approaches and optimization techniques for LU factorization and matrix inversion of millions of very small matrices using GPUs. These problems appear in many scientific applications including astrophysics and generation of block-Jacobi preconditioners. We show that, for very small problem sizes, design and optimization of GPU kernels require a mindset different from the one usually used when designing LAPACK algorithms for GPUs. Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. We also take advantage of the small matrix sizes to eliminate the intermediate row interchanges in both the factorization and inversion kernels. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorization, and 14× for the inversion, using double precision arithmetic on a Pascal P100 GPU. %B Procedia Computer Science %V 108 %P 606–615 %8 2017-06 %G eng %R https://doi.org/10.1016/j.procs.2017.05.250 %0 Journal Article %J Journal of Computational Science %D 2017 %T Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K GPU computing; Cholesky factorization; Batched execution %X This paper presents a GPU-accelerated Cholesky factorization for two different modes of operation. The first one is the batch mode, where many independent factorizations on small matrices can be performed concurrently. This mode supports fixed size and variable size problems, and is found in many scientific applications. The second mode is the native mode, where one factorization is performed on a large matrix without any CPU involvement, which allows the CPU do other useful work. We show that, despite the different workloads, both modes of operation share a common code-base that uses the GPU only. We also show that the developed routines achieve significant speedups against a multicore CPU using the MKL library, and against a GPU implementation by cuSOLVER. This work is part of the MAGMA library. %B Journal of Computational Science %V 20 %P 85–93 %8 2017-05 %G eng %R https://doi.org/10.1016/j.jocs.2016.12.009 %0 Conference Paper %B Proceedings of the General Purpose GPUs (GPGPU-10) %D 2017 %T High-performance Cholesky Factorization for GPU-only Execution %A Azzam Haidar %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %X We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. We achieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8× faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library. %B Proceedings of the General Purpose GPUs (GPGPU-10) %I ACM %C Austin, TX %8 2017-02 %G eng %R https://doi.org/10.1145/3038228.3038237 %0 Conference Paper %B International Conference on Supercomputing (ICS '17) %D 2017 %T Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %B International Conference on Supercomputing (ICS '17) %I ACM %C Chicago, Illinois %8 2017-06 %G eng %U http://dl.acm.org/citation.cfm?id=3079103 %R 10.1145/3079079.3079103 %0 Generic %D 2017 %T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale %A Ahmad Abdelfattah %A Hartwig Anzt %A Aurelien Bouteiller %A Anthony Danalis %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Stephen Wood %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2017-06 %G eng %9 SLATE Working Notes %1 01 %0 Generic %D 2017 %T Small Tensor Operations on Advanced Architectures for High-Order Applications %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %B University of Tennessee Computer Science Technical Report %I Innovative Computing Laboratory, University of Tennessee %8 2017-04 %G eng %0 Journal Article %J Computing in Science & Engineering %D 2017 %T With Extreme Computing, the Rules Have Changed %A Jack Dongarra %A Stanimire Tomov %A Piotr Luszczek %A Jakub Kurzak %A Mark Gates %A Ichitaro Yamazaki %A Hartwig Anzt %A Azzam Haidar %A Ahmad Abdelfattah %X On the eve of exascale computing, traditional wisdom no longer applies. High-performance computing is gone as we know it. This article discusses a range of new algorithmic techniques emerging in the context of exascale computing, many of which defy the common wisdom of high-performance computing and are considered unorthodox, but could turn out to be a necessity in near future. %B Computing in Science & Engineering %V 19 %P 52-62 %8 2017-05 %G eng %N 3 %R https://doi.org/10.1109/MCSE.2017.48 %0 Generic %D 2016 %T Accelerating Tensor Contractions for High-Order FEM on CPUs, GPUs, and KNLs %A Azzam Haidar %A Ahmad Abdelfattah %A Veselin Dobrev %A Ian Karlin %A Tzanio Kolev %A Stanimire Tomov %A Jack Dongarra %I moky Mountains Computational Sciences and Engineering Conference (SMC16), Poster %C Gatlinburg, TN %8 2016-09 %G eng %0 Generic %D 2016 %T Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %I GPU Technology Conference (GTC16), Poster %C San Jose, CA %8 2016-04 %G eng %0 Conference Paper %B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016 %D 2016 %T On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K batched computation %K GPUs %K variable small sizes %X
Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required.
This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors’ knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2:5 against two Sandy Bridge CPUs (8-core each) running Intel MKL library.
%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016 %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Conference Paper %B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16) %D 2016 %T High-performance Matrix-matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Joël Falcou %A Jack Dongarra %X The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries. %B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16) %I Springer International Publishing %C Grenoble, France %8 2016-08 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %K Applications %K Batched linear algebra %K FEM %K gpu %K Tensor contractions %K Tensor HPC %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Generic %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-01 %G eng %0 Journal Article %J Acta Numerica %D 2016 %T Linear Algebra Software for Large-Scale Accelerated Multicore Computing %A Ahmad Abdelfattah %A Hartwig Anzt %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A undefined %A Asim YarKhan %X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems. %B Acta Numerica %V 25 %P 1-160 %8 2016-05 %G eng %R 10.1017/S0962492916000015 %0 Generic %D 2016 %T MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs %A Tingxing Dong %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Ahmad Abdelfattah %A Jack Dongarra %X A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how batched GEMV and GEMM to be able to assist batched advance factorization (e.g. bi-diagonalization) and other BLAS routines (e.g. triangular solve) to achieve optimal performance on GPUs. Our solutions achieved up to 2.8-3× speedups compared to CUBLAS and MKL solutions, wherever possible. We illustrated the batched methodology on a real-world Hydrodynamic application by reformulating the tensor operations into batched BLAS GEMV and GEMM operations. A 2.5× speedup and a 1.4× greenup are obtained by changing 10% of the code. We accelerated and scaled it on Titan supercomputer to 4096 nodes. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2016-08 %G eng %0 Generic %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Autotuning %K Batched GEMM %K GEMM %K GPU computing %K HPC %X Abstract. The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra. It is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, there becomes a need to have a high performance GEMM kernel for a batch of small matrices. Such kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both xed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance test reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-02 %G eng %0 Conference Paper %B The International Supercomputing Conference (ISC High Performance 2016) %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Autotuning %K Batched GEMM %K GEMM %K GPU computing %K HPC %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B The International Supercomputing Conference (ISC High Performance 2016) %C Frankfurt, Germany %8 2016-06 %G eng %0 Book Section %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %E Julian M. Kunkel %E Pavan Balaji %E Jack Dongarra %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %I Springer International Publishing %P 21–38 %@ 978-3-319-41321-1 %G eng %U http://dx.doi.org/10.1007/978-3-319-41321-1_2 %R 10.1007/978-3-319-41321-1_2 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2016 %T Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs %A Ahmad Abdelfattah %A Hatem Ltaeif %A David Keyes %A Jack Dongarra %X Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications. %B Concurrency and Computation: Practice and Experience %V 28 %P 3447 - 3465 %8 2016-05 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/cpe.3874/full %N 12 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.v28.1210.1002/cpe.3874 %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K batched computation %K Cholesky Factorization %K GPUs %K Tuning %XSolving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually.
This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2 faster.
%B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %D 2015 %T Batched Matrix Computations on Hardware Accelerators Based on GPUs %A Azzam Haidar %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %X We will present techniques for small matrix computations on GPUs and their use for energy efficient, high-performance solvers. Work on small problems delivers high performance through improved data reuse. Many numerical libraries and applications need this functionality further developed. We describe the main factorizations LU, QR, and Cholesky for a set of small dense matrices in parallel. We achieve significant acceleration and reduced energy consumption against other solutions. Our techniques are of interest to GPU application developers in general. %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2015 %T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems %A Maksims Abalenkovs %A Ahmad Abdelfattah %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %A Asim YarKhan %K dense linear algebra %K gpu %K HPC %K Multicore %K plasma %K Programming models %K runtime %X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. %B Supercomputing Frontiers and Innovations %V 2 %8 2015-10 %G eng %R 10.14529/jsfi1504 %0 Journal Article %J VECPAR 2012 %D 2012 %T Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators %A Ahmad Abdelfattah %A Jack Dongarra %A David Keyes %A Hatem Ltaeif %B VECPAR 2012 %C Kobe, Japan %8 2012-07 %G eng