%0 Journal Article
%J Applied Parallel and Scientific Computing
%D 2012
%T An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs
%A Jakub Kurzak
%A Rajib Nath
%A Peng Du
%A Jack Dongarra
%E Kristján Jónasson
%B Applied Parallel and Scientific Computing
%V 7133
%P 248-257
%8 00-2012
%G eng
%0 Conference Proceedings
%B ACM/IEEE Conference on Supercomputing (SC’11)
%D 2011
%T Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs
%A Rajib Nath
%A Stanimire Tomov
%A Tingxing Dong
%A Jack Dongarra
%K magma
%B ACM/IEEE Conference on Supercomputing (SC’11)
%C Seattle, WA
%8 11-2011
%G eng
%0 Journal Article
%J Proc. of VECPAR'10
%D 2010
%T Accelerating GPU Kernels for Dense Linear Algebra
%A Rajib Nath
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B Proc. of VECPAR'10
%C Berkeley, CA
%8 06-2010
%G eng
%0 Journal Article
%J Parallel Computing
%D 2010
%T Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
%A Stanimire Tomov
%A Rajib Nath
%A Jack Dongarra
%K magma
%B Parallel Computing
%V 36
%P 645-654
%8 00-2010
%G eng
%0 Book Section
%B Scientific Computing with Multicore and Accelerators
%D 2010
%T Blas for GPUs
%A Rajib Nath
%A Stanimire Tomov
%A Jack Dongarra
%B Scientific Computing with Multicore and Accelerators
%S Chapman & Hall/CRC Computational Science
%I CRC Press
%C Boca Raton, Florida
%@ 9781439825365
%G eng
%& 4
%0 Conference Proceedings
%B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
%D 2010
%T Dense Linear Algebra Solvers for Multicore with GPU Accelerators
%A Stanimire Tomov
%A Rajib Nath
%A Hatem Ltaeif
%A Jack Dongarra
%X Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators. We describe how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures. The approach taken is based on hybridization techniques in the context of Cholesky, LU, and QR factorizations. We use a high-level parallel programming model and leverage existing software infrastructure, e.g. optimized BLAS for CPU and GPU, and LAPACK for sequential CPU processing. Included also are architecture and algorithm-specific optimizations for standard solvers as well as mixed-precision iterative refinement solvers. The new algorithms, depending on the hardware configuration and routine parameters, can lead to orders of magnitude acceleration when compared to the same algorithms on standard multicore architectures that do not contain GPU accelerators. The newly developed DLA solvers are integrated and freely available through the MAGMA library.
%B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
%C Atlanta, GA
%P 1-8
%G eng
%R 10.1109/IPDPSW.2010.5470941
%0 Journal Article
%J IEEE Transaction on Parallel and Distributed Systems (submitted)
%D 2010
%T Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators
%A Hatem Ltaeif
%A Stanimire Tomov
%A Rajib Nath
%A Jack Dongarra
%K magma
%K plasma
%B IEEE Transaction on Parallel and Distributed Systems (submitted)
%8 03-2010
%G eng
%0 Journal Article
%J International Journal of High Performance Computing
%D 2010
%T An Improved MAGMA GEMM for Fermi GPUs
%A Rajib Nath
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B International Journal of High Performance Computing
%V 24
%P 511-515
%8 00-2010
%G eng
%0 Generic
%D 2010
%T An Improved MAGMA GEMM for Fermi GPUs
%A Rajib Nath
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B University of Tennessee Computer Science Technical Report
%8 07-2010
%G eng
%0 Journal Article
%J Proc. of VECPAR'10 (to appear)
%D 2010
%T A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators
%A Hatem Ltaeif
%A Stanimire Tomov
%A Rajib Nath
%A Peng Du
%A Jack Dongarra
%K magma
%K plasma
%B Proc. of VECPAR'10 (to appear)
%C Berkeley, CA
%8 06-2010
%G eng