Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters,” University of Tennessee Computer Science Technical Report, no. ut-cs-13-714, July 2013.“
MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation, November 2012.
MAGMA: A Breakthrough in Solvers for Eigenvalue Problems , San Jose, CA, GPU Technology Conference (GTC12), Presentation, May 2012.
Acceleration of the BLAST Hydro Code on GPU,” Supercomputing '12 (poster), Salt Lake City, Utah, SC12, November 2012.“
Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems,” Concurrency and Computation: Practice and Experience, October 2013.“
Batched matrix computations on hardware accelerators based on GPUs,” International Journal of High Performance Computing Applications, February 2015. DOI: 10.1177/1094342014567546“
Accelerating the SVD Bi-Diagonalization of a Batch of Small Matrices using GPUs,” Journal of Computational Science, vol. 26, pp. 237–245, May 2018. DOI: 10.1016/j.jocs.2018.01.007“
Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs,” ACM/IEEE Conference on Supercomputing (SC’11), Seattle, WA, November 2011.“
Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster,” The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), May 2013.“
A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU,” IPDPS 2014, Phoenix, AZ, IEEE, May 2014.“
Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices,” International Conference on Computational Science (ICCS 2017), Zurich, Switzerland, Procedia Computer Science, June 2017.“
Optimization for Performance and Energy for Batched Matrix Computations on GPUs,” 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8), San Francisco, CA, ACM, February 2015. DOI: 10.1145/2716282.2716288“
Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs,” VECPAR 2014 (Best Paper), Eugene, OR, June 2014.“
LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU,” 16th IEEE International Conference on High Performance Computing and Communications (HPCC), Paris, France, IEEE, August 2014.“
Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations,” ISC High Performance, Frankfurt, Germany, Springer, July 2015.“
A Fast Batched Cholesky Factorization on a GPU,” International Conference on Parallel Processing (ICPP-2014), Minneapolis, MN, September 2014.“