%0 Generic
%D 2021
%T Gingko: A Sparse Linear Algebrea Library for HPC
%A Hartwig Anzt
%A Natalie Beams
%A Terry Cojean
%A Fritz Göbel
%A Thomas Grützmacher
%A Aditya Kashi
%A Pratik Nayak
%A Tobias Ribizel
%A Yuhsiang M. Tsai
%I 2021 ECP Annual Meeting
%8 2021-04
%G eng
%0 Journal Article
%J Journal of Open Source Software
%D 2021
%T libCEED: Fast algebra for high-order element-based discretizations
%A Jed Brown
%A Ahmad Abdelfattah
%A Valeria Barra
%A Natalie Beams
%A Jean-Sylvain Camier
%A Veselin Dobrev
%A Yohann Dudouit
%A Leila Ghaffari
%A Tzanio Kolev
%A David Medina
%A Will Pazner
%A Thilina Ratnayaka
%A Jeremy Thompson
%A Stan Tomov
%K finite elements
%K high-order methods
%K High-performance computing
%K matrix-free
%K spectral elements
%X Finite element methods are widely used to solve partial differential equations (PDE) in science and engineering, but their standard implementation (Arndt et al., 2020; Kirk et al., 2006; Logg et al., 2012) relies on assembling sparse matrices. Sparse matrix multiplication and triangular operations perform a scalar multiply and add for each nonzero entry, just 2 floating point operations (flops) per scalar that must be loaded from memory (Williams et al., 2009). Modern hardware is capable of nearly 100 flops per scalar streamed from memory (Rupp, 2020) so sparse matrix operations cannot achieve more than about 2% utilization of arithmetic units. Matrix assembly becomes even more problematic when the polynomial degree p of the basis functions is increased, resulting in O(pd) storage and O(p2d) compute per degree of freedom (DoF) in d dimensions. Methods pioneered by the spectral element community (Deville et al., 2002; Orszag, 1980) exploit problem structure to reduce costs to O(1) storage and O(p) compute per DoF, with very high utilization of modern CPUs and GPUs. Unfortunately, highquality implementations have been relegated to applications and intrusive frameworks that are often difficult to extend to new problems or incorporate into legacy applications, especially when strong preconditioners are required. libCEED, the Code for Efficient Extensible Discretization (Abdelfattah et al., 2021), is a lightweight library that provides a purely algebraic interface for linear and nonlinear operators and preconditioners with element-based discretizations. libCEED provides portable performance via run-time selection of implementations optimized for CPUs and GPUs, including support for just-in-time (JIT) compilation. It is designed for convenient use in new and legacy software, and offers interfaces in C99 (International Standards Organisation, 1999), Fortran77 (ANSI, 1978), Python (Python, 2021), Julia (Bezanson et al., 2017), and Rust (Rust, 2021). Users and library developers can integrate libCEED at a low level into existing applications in place of existing matrix-vector products without significant refactoring of their own discretization infrastructure. Alternatively, users can utilize integrated libCEED support in MFEM (Anderson et al., 2020; MFEM, 2021). In addition to supporting applications and discretization libraries, libCEED provides a platform for performance engineering and co-design, as well as an algebraic interface for solvers research like adaptive p-multigrid, much like how sparse matrix libraries enable development and deployment of algebraic multigrid solvers
%B Journal of Open Source Software
%V 6
%P 2945
%G eng
%U https://doi.org/10.21105/joss.02945
%R 10.21105/joss.02945
%0 Conference Paper
%B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
%D 2020
%T High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs
%A Natalie Beams
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%A Tzanio Kolev
%A Yohann Dudouit
%K Batched linear algebra
%K finite elements
%K gpu
%K high-order methods
%K matrix-free FEM
%K Tensor contractions
%X We present new GPU implementations of the tensor contractions arising from basis-related computations for highorder finite element methods. We consider both tensor and nontensor bases. In the case of tensor bases, we introduce new kernels based on a series of fused device-level matrix multiplications (GEMMs), specifically designed to utilize the fast memory of the GPU. For non-tensor bases, we develop a tuned framework for choosing standard batch-BLAS GEMMs that will maximize performance across groups of elements. The implementations are included in a backend of the libCEED library. We present benchmark results for the diffusion and mass operators using libCEED integration through the MFEM finite element library and compare to those of the previously best-performing GPU backends for stand-alone basis computations. In tensor cases, we see improvements of approximately 10-30% for some cases, particularly for higher basis orders. For the non-tensor tests, the new batch-GEMMs implementation is twice as fast as what was previously available for basis function order greater than five and greater than approximately 105 degrees of freedom in the mesh; up to ten times speedup is seen for eighth-order basis functions.
%B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
%I IEEE
%8 2020-11
%G eng
%0 Generic
%D 2019
%T CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Valeria Barra
%A Natalie Beams
%A Jed Brown
%A Jean-Sylvain Camier
%A Veselin Dobrev
%A Jack Dongarra
%A Yohann Dudouit
%A Paul Fischer
%A Ali Karakus
%A Stefan Kerkemeier
%A Tzanio Kolev
%A YuHsiang Lan
%A Elia Merzari
%A Misun Min
%A Aleks Obabko
%A Scott Parker
%A Thilina Ratnayaka
%A Jeremy Thompson
%A Ananias Tomboulides
%A Vladimir Tomov
%A Tim Warburton
%I Zenodo
%8 2019-10
%G eng
%R https://doi.org/10.5281/zenodo.3477618