%0 Generic
%D 2021
%T Gingko: A Sparse Linear Algebrea Library for HPC
%A Hartwig Anzt
%A Natalie Beams
%A Terry Cojean
%A Fritz Göbel
%A Thomas Grützmacher
%A Aditya Kashi
%A Pratik Nayak
%A Tobias Ribizel
%A Yuhsiang M. Tsai
%I 2021 ECP Annual Meeting
%8 2021-04
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2020
%T Evaluating Asynchronous Schwarz Solvers on GPUs
%A Pratik Nayak
%A Terry Cojean
%A Hartwig Anzt
%K abstract Schwarz methods
%K Asynchronous solvers
%K exascale
%K GPUs
%K multicore processors
%K parallel numerical linear algebra
%X With the commencement of the exascale computing era, we realize that the majority of the leadership supercomputers are heterogeneous and massively parallel. Even a single node can contain multiple co-processors such as GPUs and multiple CPU cores. For example, ORNL’s Summit accumulates six NVIDIA Tesla V100 GPUs and 42 IBM Power9 cores on each node. Synchronizing across compute resources of multiple nodes can be prohibitively expensive. Hence, it is necessary to develop and study asynchronous algorithms that circumvent this issue of bulk-synchronous computing. In this study, we examine the asynchronous version of the abstract Restricted Additive Schwarz method as a solver. We do not explicitly synchronize, but allow the communication between the sub-domains to be completely asynchronous, thereby removing the bulk synchronous nature of the algorithm.    We accomplish this by using the one-sided Remote Memory Access (RMA) functions of the MPI standard. We study the benefits of using such an asynchronous solver over its synchronous counterpart. We also study the communication patterns governed by the partitioning and the overlap between the sub-domains on the global solver. Finally, we show that this concept can render attractive performance benefits over the synchronous counterparts even for a well-balanced problem.
%B International Journal of High Performance Computing Applications
%8 2020-08
%G eng
%R https://doi.org/10.1177/1094342020946814

%0 Journal Article
%J Journal of Open Source Software
%D 2020
%T Ginkgo: A High Performance Numerical Linear Algebra Library
%A Hartwig Anzt
%A Terry Cojean
%A Yen-Chen Chen
%A Fritz Goebel
%A Thomas Gruetzmacher
%A Pratik Nayak
%A Tobias Ribizel
%A Yu-Hsiang Tsai
%X Ginkgo is a production-ready sparse linear algebra library for high performance computing on GPU-centric architectures with a high level of performance portability and focuses on software sustainability.    The library focuses on solving sparse linear systems and accommodates a large variety of matrix formats, state-of-the-art iterative (Krylov) solvers and preconditioners, which make the library suitable for a variety of scientific applications. Ginkgo supports many architectures such as multi-threaded CPU, NVIDIA GPUs, and AMD GPUs. The heavy use of modern C++ features simplifies the addition of new executor paradigms and algorithmic functionality without introducing significant performance overhead.    Solving linear systems is usually one of the most computationally and memory intensive aspects of any application. Hence there has been a significant amount of effort in this direction with software libraries such as UMFPACK (Davis, 2004) and CHOLMOD (Chen, Davis, Hager, & Rajamanickam, 2008) for solving linear systems with direct methods and PETSc (Balay et al., 2020), Trilinos (“The Trilinos Project Website,” 2020), Eigen (Guennebaud, Jacob, & others, 2010) and many more to solve linear systems with iterative methods. With Ginkgo, we aim to ensure high performance while not compromising portability. Hence, we provide very efficient low level kernels optimized for different architectures and separate these kernels from the algorithms thereby ensuring extensibility and ease of use.    Ginkgo is also a part of the xSDK effort (Bartlett et al., 2017) and available as a Spack (Gamblin et al., 2015) package. xSDK aims to provide infrastructure for and interoperability between a collection of related and complementary software elements to foster rapid and efficient development of scientific applications using High Performance Computing. Within this effort, we provide interoperability with application libraries such as deal.ii (Arndt et al., 2019) and mfem (Anderson et al., 2020). Ginkgo provides wrappers within these two libraries so that they can take advantage of the features of Ginkgo.
%B Journal of Open Source Software
%V 5
%8 2020-08
%G eng
%N 52
%R https://doi.org/10.21105/joss.02260

%0 Generic
%D 2020
%T Ginkgo: A Node-Level Sparse Linear Algebra Library for HPC (Poster)
%A Hartwig Anzt
%A Terry Cojean
%A Yen-Chen Chen
%A Fritz Goebel
%A Thomas Gruetzmacher
%A Pratik Nayak
%A Tobias Ribizel
%A Yu-Hsiang Tsai
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Journal Article
%J ACM Transactions on Parallel Computing
%D 2020
%T Load-Balancing Sparse Matrix Vector Product Kernels on GPUs
%A Hartwig Anzt
%A Terry Cojean
%A Chen Yen-Chen
%A Jack Dongarra
%A Goran Flegar
%A Pratik Nayak
%A Stanimire Tomov
%A Yuhsiang M. Tsai
%A Weichung Wang
%X Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA’s cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.
%B ACM Transactions on Parallel Computing
%V 7
%8 2020-03
%G eng
%N 1
%R https://doi.org/10.1145/3380930

%0 Generic
%D 2020
%T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Erik Boman
%A Erin Carson
%A Terry Cojean
%A Jack Dongarra
%A Mark Gates
%A Thomas Gruetzmacher
%A Nicholas J. Higham
%A Sherry Li
%A Neil Lindquist
%A Yang Liu
%A Jennifer Loe
%A Piotr Luszczek
%A Pratik Nayak
%A Sri Pranesh
%A Siva Rajamanickam
%A Tobias Ribizel
%A Barry Smith
%A Kasia Swirydowicz
%A Stephen Thomas
%A Stanimire Tomov
%A Yaohung Tsai
%A Ichitaro Yamazaki
%A Urike Meier Yang
%B SLATE Working Notes
%I University of Tennessee
%8 2020-07
%G eng
%9 SLATE Working Notes

%0 Conference Paper
%B Platform for Advanced Scientific Computing Conference (PASC 2019)
%D 2019
%T Towards Continuous Benchmarking
%A Hartwig Anzt
%A Yen Chen Chen
%A Terry Cojean
%A Jack Dongarra
%A Goran Flegar
%A Pratik Nayak
%A Enrique S. Quintana-Orti
%A Yuhsiang M. Tsai
%A Weichung Wang
%X We present an automated performance evaluation framework that enables an automated workflow for testing and performance evaluation of software libraries. Integrating this component into an ecosystem enables sustainable software development, as a community effort, via a web application for interactively evaluating the performance of individual software components. The performance evaluation tool is based exclusively on web technologies, which removes the burden of downloading performance data or installing additional software. We employ this framework for the Ginkgo software ecosystem, but the framework can be used with essentially any software project, including the comparison between different software libraries. The Continuous Integration (CI) framework of Ginkgo is also extended to automatically run a benchmark suite on predetermined HPC systems, store the state of the machine and the environment along with the compiled binaries, and collect results in a publicly accessible performance data repository based on Git. The Ginkgo performance explorer (GPE) can be used to retrieve the performance data from the repository, and visualizes it in a web browser. GPE also implements an interface that allows users to write scripts, archived in a Git repository, to extract particular data, compute particular metrics, and visualize them in many different formats (as specified by the script). The combination of these approaches creates a workflow which enables performance reproducibility and software sustainability of scientific software. In this paper, we present example scripts that extract and visualize performance data for Ginkgo’s SpMV kernels that allow users to identify the optimal kernel for specific problem characteristics.
%B Platform for Advanced Scientific Computing Conference (PASC 2019)
%I ACM Press
%C Zurich, Switzerland
%8 2019-06
%@ 9781450367707
%G eng
%R https://doi.org/10.1145/3324989.3325719