Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects , Portland, OR, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project , Portland, Oregon, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
Accelerating GPU Kernels for Dense Linear Algebra,” Proc. of VECPAR'10, Berkeley, CA, June 2010.“
Accelerating the Reduction to Upper Hessenberg, Tridiagonal, and Bidiagonal Forms through Hybrid GPU-Based Computing,” Parallel Computing, vol. 36, no. 12, pp. 645-654, 00 2010.“
Autotuning Dense Linear Algebra Libraries on GPUs , Basel, Switzerland, Sixth International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010), June 2010.
Blas for GPUs,” Scientific Computing with Multicore and Accelerators, Boca Raton, Florida, CRC Press, 2010.“
Dense Linear Algebra Solvers for Multicore with GPU Accelerators,” Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, Atlanta, GA, pp. 1-8, 2010. DOI: 10.1109/IPDPSW.2010.5470941“
Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators,” IEEE Transaction on Parallel and Distributed Systems (submitted), March 2010.“
An Improved MAGMA GEMM for Fermi GPUs,” University of Tennessee Computer Science Technical Report, no. UT-CS-10-655 (also LAPACK working note 227), July 2010.“
An Improved MAGMA GEMM for Fermi GPUs,” International Journal of High Performance Computing, vol. 24, no. 4, pp. 511-515, 00 2010.“
A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators,” Proc. of VECPAR'10 (to appear), Berkeley, CA, June 2010.“
Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators , Knoxville, TN, 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster, July 2010.
Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs,” ACM/IEEE Conference on Supercomputing (SC’11), Seattle, WA, November 2011.“
An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs,” Applied Parallel and Scientific Computing, vol. 7133, pp. 248-257, 00 2012.“