%0 Generic
%D 2019
%T CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Valeria Barra
%A Natalie Beams
%A Jed Brown
%A Jean-Sylvain Camier
%A Veselin Dobrev
%A Jack Dongarra
%A Yohann Dudouit
%A Paul Fischer
%A Ali Karakus
%A Stefan Kerkemeier
%A Tzanio Kolev
%A YuHsiang Lan
%A Elia Merzari
%A Misun Min
%A Aleks Obabko
%A Scott Parker
%A Thilina Ratnayaka
%A Jeremy Thompson
%A Ananias Tomboulides
%A Vladimir Tomov
%A Tim Warburton
%I Zenodo
%8 2019-10
%G eng
%R https://doi.org/10.5281/zenodo.3477618
%0 Generic
%D 2019
%T CEED ECP Milestone Report: Public release of CEED 2.0
%A Jed Brown
%A Ahmad Abdelfattah
%A Valeria Barra
%A Veselin Dobrev
%A Yohann Dudouit
%A Paul Fischer
%A Tzanio Kolev
%A David Medina
%A Misun Min
%A Thilina Ratnayaka
%A Cameron Smith
%A Jeremy Thompson
%A Stanimire Tomov
%A Vladimir Tomov
%A Tim Warburton
%I Zenodo
%8 2019-04
%G eng
%U https://doi.org/10.5281/zenodo.2641316
%R 10.5281/zenodo.2641316
%0 Journal Article
%J Journal of Advances in Modeling Earth Systems
%D 2018
%T Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling
%A Jian Sun
%A Joshua Fu
%A John Drake
%A Qingzhao Zhu
%A Azzam Haidar
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%K compiler
%K CUDA
%K data transfer
%K gpu
%K hybrid
%K memory layout
%X Global chemistry‐climate models are computationally burdened as the chemical mechanisms become more complex and realistic. Optimization for graphics processing units (GPU) may make longer global simulation with regional detail possible, but limited study has been done to explore the potential benefit for the atmospheric chemistry modeling. Hence, in this study, the second‐order Rosenbrock solver of the chemistry module of CAM4‐Chem is ported to the GPU to gauge potential speed‐up. We find that on the CPU, the fastest performance is achieved using the Intel compiler with a block interleaved memory layout. Different combinations of compiler and memory layout lead to ~11.02× difference in the computational time. In contrast, the GPU version performs the best when using a combination of fully interleaved memory layout with block size equal to the warp size, CUDA streams for independent kernels, and constant memory. Moreover, the most efficient data transfer between CPU and GPU is gained by allocating the memory contiguously during the data initialization on the GPU. Compared to one CPU core, the speed‐up of using one GPU alone reaches a factor of ~11.7× for the computation alone and ~3.82× when the data transfer between CPU and GPU is considered. Using one GPU alone is also generally faster than the multithreaded implementation for 16 CPU cores in a compute node and the single‐source solution (OpenACC). The best performance is achieved by the implementation of the hybrid CPU/GPU version, but rescheduling the workload among the CPU cores is required before the practical CAM4‐Chem simulation.
%B Journal of Advances in Modeling Earth Systems
%V 10
%P 1952–1969
%8 2018-08
%G eng
%N 8
%R https://doi.org/10.1029/2018MS001276
%0 Generic
%D 2017
%T C++ API for Batch BLAS
%A Ahmad Abdelfattah
%A Konstantin Arturov
%A Cris Cecka
%A Jack Dongarra
%A Chip Freitag
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Panruo Wu
%B SLATE Working Notes
%I University of Tennessee
%8 2017-12
%G eng
%0 Generic
%D 2016
%T Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%I GPU Technology Conference (GTC16), Poster
%C San Jose, CA
%8 2016-04
%G eng
%0 Conference Paper
%B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)
%D 2015
%T Cholesky Across Accelerators
%A Asim YarKhan
%A Azzam Haidar
%A Chongxiao Cao
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)
%I IEEE
%C Elizabeth, NJ
%8 2015-08
%G eng
%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra
%D 2015
%T Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra
%A Mark Gates
%A Stanimire Tomov
%A Azzam Haidar
%X Accelerating dense linear algebra using GPUs admits two models: hybrid CPU-GPU and GPU-only. The hybrid model factors the panel on the CPU while updating the trailing matrix on the GPU, concentrating the GPU on high-performance matrix multiplies. The GPU-only model performs the entire computation on the GPU, avoiding costly data transfers to the CPU. We compare these two approaches for three QR-based algorithms: QR factorization, rank revealing QR, and reduction to Hessenberg.
%B 2015 SIAM Conference on Applied Linear Algebra
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng
%0 Journal Article
%J Scientific Programming
%D 2015
%T Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X Low-rank matrices arise in many scientific and engineering computation. Both computational and storage costs of manipulating such matrices may be reduced by taking advantages of their low-rank properties. To compute a low-rank approximation of a dense matrix, in this paper, we study the performance of QR factorization with column pivoting or with restricted pivoting on multicore CPUs with a GPU. We first propose several techniques to reduce the postprocessing time, which is required for restricted pivoting, on a modern CPU. We then examine the potential of using a GPU to accelerate the factorization process with both column and restricted pivoting. Our performance results on two eight-core Intel Sandy Bridge CPUs with one NVIDIA Kepler GPU demonstrate that using the GPU, the factorization time can be reduced by a factor of more than two. In addition, to study the performance of our implementations in practice, we integrate them into a recently-developed software StruMF which algebraically exploits such low-rank structures for solving a general sparse linear system of equations. Our performance results for solving Poisson's equations demonstrate that the proposed techniques can significantly reduce the preconditioner construction time of StruMF on the CPUs, and the construction time can be further reduced by 10%-50% using the GPU.
%B Scientific Programming
%G eng
%0 Conference Paper
%B International Workshop on OpenCL
%D 2014
%T clMAGMA: High Performance Dense Linear Algebra with OpenCL
%A Chongxiao Cao
%A Jack Dongarra
%A Peng Du
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments presented, and in general provides to heterogeneous architectures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance OpenCL BLAS, hardware and OpenCL-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
%B International Workshop on OpenCL
%C Bristol University, England
%8 2014-05
%G eng
%0 Generic
%D 2013
%T clMAGMA: High Performance Dense Linear Algebra with OpenCL
%A Chongxiao Cao
%A Jack Dongarra
%A Peng Du
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X This paper presents the design and implementation of sev- eral fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments pre- sented, and in general provides to heterogeneous architec- tures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is ob- tained through use of the high-performance OpenCL BLAS, hardware and OpenCL-speci c tuning, and a hybridization methodology where we split the algorithm into computa- tional tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
%B University of Tennessee Technical Report (Lawn 275)
%I University of Tennessee
%8 2013-03
%G eng
%0 Conference Proceedings
%B Proc. of the International Conference on Computational Science (ICCS)
%D 2012
%T A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines
%A Marc Baboulin
%A Simplice Donfack
%A Jack Dongarra
%A Laura Grigori
%A Adrien Remi
%A Stanimire Tomov
%K magma
%B Proc. of the International Conference on Computational Science (ICCS)
%V 9
%P 17-26
%8 2012-06
%G eng
%0 Conference Proceedings
%B Symposium for Application Accelerators in High Performance Computing (SAAHPC'11)
%D 2011
%T A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures
%A Mitch Horton
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%K quark
%B Symposium for Application Accelerators in High Performance Computing (SAAHPC'11)
%C Knoxville, TN
%8 2011-07
%G eng
%0 Journal Article
%J International Journal of Computational Science and Engineering
%D 2006
%T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures
%A Stanimire Tomov
%A Julien Langou
%A Jack Dongarra
%A Andrew Canning
%A Lin-Wang Wang
%B International Journal of Computational Science and Engineering
%V 2
%P 205-212
%8 2006-00
%G eng
%0 Conference Proceedings
%B Proceedings of 5th International Conference on Computational Science (ICCS)
%D 2005
%T Comparison of Nonlinear Conjugate-Gradient methods for computing the Electronic Properties of Nanostructure Architectures
%A Stanimire Tomov
%A Julien Langou
%A Andrew Canning
%A Lin-Wang Wang
%A Jack Dongarra
%E V. S. Sunderman
%E Geert Dick van Albada
%E Peter M. Sloot
%E Jack Dongarra
%K doe-nano
%B Proceedings of 5th International Conference on Computational Science (ICCS)
%I Springer's Lecture Notes in Computer Science
%C Atlanta, GA, USA
%P 317-325
%8 2005-01
%G eng
%0 Journal Article
%J International Journal of Computational Science and Engineering (to appear)
%D 2005
%T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures
%A Stanimire Tomov
%A Julien Langou
%A Andrew Canning
%A Lin-Wang Wang
%A Jack Dongarra
%B International Journal of Computational Science and Engineering (to appear)
%8 2005-01
%G eng