%0 Generic
%D 2017
%T Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%I SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation
%C Atlanta, GA
%8 2017-03
%G eng
%0 Generic
%D 2017
%T Small Tensor Operations on Advanced Architectures for High-Order Applications
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%B University of Tennessee Computer Science Technical Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-04
%G eng
%0 Generic
%D 2016
%T Accelerating Tensor Contractions for High-Order FEM on CPUs, GPUs, and KNLs
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Veselin Dobrev
%A Ian Karlin
%A Tzanio Kolev
%A Stanimire Tomov
%A Jack Dongarra
%I moky Mountains Computational Sciences and Engineering Conference (SMC16), Poster
%C Gatlinburg, TN
%8 2016-09
%G eng
%0 Generic
%D 2016
%T High-Performance Tensor Contractions for GPUs
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2016-01
%G eng
%0 Conference Paper
%B International Conference on Computational Science (ICCS'16)
%D 2016
%T High-Performance Tensor Contractions for GPUs
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%K Applications
%K Batched linear algebra
%K FEM
%K gpu
%K Tensor contractions
%K Tensor HPC
%X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
%B International Conference on Computational Science (ICCS'16)
%C San Diego, CA
%8 2016-06
%G eng
%0 Generic
%D 2015
%T Towards a High-Performance Tensor Algebra Package for Accelerators
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%I moky Mountains Computational Sciences and Engineering Conference (SMC15)
%C Gatlinburg, TN
%8 2015-09
%G eng
%0 Conference Paper
%B IPDPS 2014
%D 2014
%T A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU
%A Tingxing Dong
%A Veselin Dobrev
%A Tzanio Kolev
%A Robert Rieben
%A Stanimire Tomov
%A Jack Dongarra
%K Computer science
%K CUDA
%K FEM
%K Finite element method
%K linear algebra
%K nVidia
%K Tesla K20
%X Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs [10, 1]. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor’s library routines substantially in performance. We proposed an autotuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2:5x overall speedup and 1:42x energy efficiency (greenup) using 4th order (Q4) finite elements, and a 1:9x speedup and 1:27x greenup using 2nd order (Q2) finite elements.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Generic
%D 2013
%T Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters
%A Tingxing Dong
%A Veselin Dobrev
%A Tzanio Kolev
%A Robert Rieben
%A Stanimire Tomov
%A Jack Dongarra
%X The explosion of parallelism and heterogeneity in today's computer architectures has created opportunities as well as challenges for redesigning legacy numerical software to harness the power of new hardware. In this paper we address the main challenges in redesigning BLAST { a numerical library that solves the equations of compressible hydrodynamics using high order nite element methods (FEM) in a moving Lagrangian frame { to support CPU-GPU clusters. We use a hybrid MPI + OpenMP + CUDA programming model that includes two layers: domain decomposed MPI parallelization and OpenMP + CUDA acceleration in a given domain. To optimize the code, we implemented custom linear algebra kernels and introduced an auto-tuning technique to deal with heterogeneity and load balancing at runtime. Our tests show that 12 Intel Xeon cores and two M2050 GPUs deliver a 24x speedup compared to a single core, and a 2.5x speedup compared to 12 MPI tasks in one node. Further, we achieve perfect weak scaling, demonstrated on a cluster with up to 64 GPUs in 32 nodes. Our choice of programming model and proposed solutions, as related to parallelism and load balancing, specifically targets high order FEM discretizations, and can be used equally successfully for applications beyond hydrodynamics. A major accomplishment is that we further establish the appeal of high order FEMs, which despite their better approximation properties, are often avoided due to their high computational cost. GPUs, as we show, have the potential to make them the method of choice, as the increased computational cost is also localized, e.g., cast as Level 3 BLAS, and thus can be done very efficiently (close to \free" relative to the usual overheads inherent in sparse computations).
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng
%0 Generic
%D 2012
%T Acceleration of the BLAST Hydro Code on GPU
%A Tingxing Dong
%A Tzanio Kolev
%A Robert Rieben
%A Veselin Dobrev
%A Stanimire Tomov
%A Jack Dongarra
%B Supercomputing '12 (poster)
%I SC12
%C Salt Lake City, Utah
%8 2012-11
%G eng