%0 Journal Article %J International Journal of High Performance Computing Applications %D 2020 %T Evaluating Asynchronous Schwarz Solvers on GPUs %A Pratik Nayak %A Terry Cojean %A Hartwig Anzt %K abstract Schwarz methods %K Asynchronous solvers %K exascale %K GPUs %K multicore processors %K parallel numerical linear algebra %X With the commencement of the exascale computing era, we realize that the majority of the leadership supercomputers are heterogeneous and massively parallel. Even a single node can contain multiple co-processors such as GPUs and multiple CPU cores. For example, ORNL’s Summit accumulates six NVIDIA Tesla V100 GPUs and 42 IBM Power9 cores on each node. Synchronizing across compute resources of multiple nodes can be prohibitively expensive. Hence, it is necessary to develop and study asynchronous algorithms that circumvent this issue of bulk-synchronous computing. In this study, we examine the asynchronous version of the abstract Restricted Additive Schwarz method as a solver. We do not explicitly synchronize, but allow the communication between the sub-domains to be completely asynchronous, thereby removing the bulk synchronous nature of the algorithm. We accomplish this by using the one-sided Remote Memory Access (RMA) functions of the MPI standard. We study the benefits of using such an asynchronous solver over its synchronous counterpart. We also study the communication patterns governed by the partitioning and the overlap between the sub-domains on the global solver. Finally, we show that this concept can render attractive performance benefits over the synchronous counterparts even for a well-balanced problem. %B International Journal of High Performance Computing Applications %8 2020-08 %G eng %R https://doi.org/10.1177/1094342020946814 %0 Conference Paper %B ISC High Performance %D 2020 %T Sparse Linear Algebra on AMD and NVIDIA GPUs—The Race is On %A Yuhsiang M. Tsai %A Terry Cojean %A Hartwig Anzt %K AMD %K GPUs %K nVidia %K sparse matrix vector product (SpMV) %X Efficiently processing sparse matrices is a central and performance-critical part of many scientific simulation codes. Recognizing the adoption of manycore accelerators in HPC, we evaluate in this paper the performance of the currently best sparse matrix-vector product (SpMV) implementations on high-end GPUs from AMD and NVIDIA. Specifically, we optimize SpMV kernels for the CSR, COO, ELL, and HYB format taking the hardware characteristics of the latest GPU technologies into account. We compare for 2,800 test matrices the performance of our kernels against AMD’s hipSPARSE library and NVIDIA’s cuSPARSE library, and ultimately assess how the GPU technologies from AMD and NVIDIA compare in terms of SpMV performance. %B ISC High Performance %I Springer %8 2020-06 %G eng %R https://doi.org/10.1007/978-3-030-50743-5_16 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2019 %T Toward a Modular Precision Ecosystem for High-Performance Computing %A Hartwig Anzt %A Goran Flegar %A Thomas Gruetzmacher %A Enrique S. Quintana-Orti %K conjugate gradient %K GPUs %K Jacobi method %K Modular precision %K multicore processors %K PageRank %K parallel numerical linear algebra %X With the memory bandwidth of current computer architectures being significantly slower than the (floating point) arithmetic performance, many scientific computations only leverage a fraction of the computational power in today’s high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between floating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientific applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a “modular precision ecosystem” that allows for more flexibility in terms of customized data access. For memory-bounded scientific applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm—two applications that are popular in the communities and at the same characteristic representatives for the field of numerical linear algebra and data analytics, respectively. %B The International Journal of High Performance Computing Applications %V 33 %P 1069-1078 %8 2019-11 %G eng %N 6 %R https://doi.org/10.1177/1094342019846547 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2014 %T Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors %A José I. Aliaga %A Hartwig Anzt %A Maribel Castillo %A Juan C. Fernández %A Germán León %A Joaquín Pérez %A Enrique S. Quintana-Orti %K CG %K CPUs %K energy efficiency %K GPUs %K low-power architectures %X In this paper, we analyze the interactions occurring in the triangle performance-power-energy for the execution of a pivotal numerical algorithm, the iterative conjugate gradient (CG) method, on a diverse collection of parallel multithreaded architectures. This analysis is especially timely in a decade where the power wall has arisen as a major obstacle to build faster processors. Moreover, the CG method has recently been proposed as a complement to the LINPACK benchmark, as this iterative method is argued to be more archetypical of the performance of today's scientific and engineering applications. To gain insights about the benefits of hands-on optimizations we include runtime and energy efficiency results for both out-of-the-box usage relying exclusively on compiler optimizations, and implementations manually optimized for target architectures, that range from general-purpose and digital signal multicore processors to manycore graphics processing units, all representative of current multithreaded systems. %B Concurrency and Computation: Practice and Experience %V 27 %P 885-904 %8 2014-09 %G eng %U http://dx.doi.org/10.1002/cpe.3341 %N 4 %& 885 %R 10.1002/cpe.3341