Autotuning GEMMs for Fermi,” University of Tennessee Computer Science Technical Report, UT-CS-11-671, (also Lawn 245), April 2011.“
Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems , no. UT-CS-11-689, December 2011.
A Block-Asynchronous Relaxation Method for Graphics Processing Units,” University of Tennessee Computer Science Technical Report, no. UT-CS-11-687 / LAWN 258, November 2011.“
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures,” Symposium for Application Accelerators in High Performance Computing (SAAHPC'11), Knoxville, TN, July 2011.“
Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures,” University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250), June 2011.“
A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs,” in GPU Computing Gems, Jade Edition, vol. 2: Elsevier, pp. 473-484, 00 2011.“
LU Factorization for Accelerator-Based Systems,” IEEE/ACS AICCSA 2011, Sharm-El-Sheikh, Egypt, December 2011.“
MAGMA - LAPACK for GPUs , Atlanta, GA, Keeneland GPU Tutorial, April 2011.
MAGMA - LAPACK for HPC on Heterogeneous Architectures , Oak Ridge, TN, Titan Summit at Oak Ridge National Laboratory, Presentation, August 2011.
Matrix Algebra on GPU and Multicore Architectures , Basel, Switzerland, Workshop on GPU-enabled Numerical Libraries, Presentation, May 2011.
Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs,” ACM/IEEE Conference on Supercomputing (SC’11), Seattle, WA, November 2011.“
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs,” International Conference on Parallel Processing (ICPP'11), Taipei, Taiwan, ACM, September 2011. DOI: 10.1109/ICPP.2011.71“
Performance Portability of a GPU Enabled Factorization with the DAGuE Framework,” IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC), June 2011.“
Power-aware Computing on GPGPUs , Gatlinburg, TN, Fall Creek Falls Conference, Poster, September 2011.
Soft Error Resilient QR Factorization for Hybrid System,” University of Tennessee Computer Science Technical Report, no. UT-CS-11-675, Knoxville, TN, July 2011.“
Soft Error Resilient QR Factorization for Hybrid System,” UT-CS-11-675 (also LAPACK Working Note #252), no. ICL-CS-11-675, July 2011.“
Soft Error Resilient QR Factorization for Hybrid System with GPGPU,” Journal of Computational Science, Seattle, WA, Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems at SC11, November 2011.“
A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems,” IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.“
Acceleration of the BLAST Hydro Code on GPU,” Supercomputing '12 (poster), Salt Lake City, Utah, SC12, November 2012.“
Autotuning GEMM Kernels for the Fermi GPU,” IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 11, November 2012. DOI: 10.1109/TPDS.2011.311“
Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems,” ICCS 2012, Omaha, NE, June 2012.“
A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines,” Proc. of the International Conference on Computational Science (ICCS), vol. 9, pp. 17-26, June 2012.“
Dense Linear Algebra on Accelerated Multicore Hardware,” High Performance Scientific Computing: Algorithms and Applications, London, UK, Springer-Verlag, 00 2012.“
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems,” SIAM Journal on Scientific Computing, vol. 34(2), pp. C70-C82, April 2012.“
Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems,” 26th ACM International Conference on Supercomputing (ICS 2012), San Servolo Island, Venice, Italy, ACM, June 2012.“
From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming,” Parallel Computing, vol. 38, no. 8, pp. 391-407, August 2012.“
The Future of Computing: Software Libraries , Savannah, GA, DOD CREATE Developers' Review, Keynote Presentation, February 2012.
MAGMA: A Breakthrough in Solvers for Eigenvalue Problems , San Jose, CA, GPU Technology Conference (GTC12), Presentation, May 2012.
MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation, November 2012.
MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), November 2012.
Matrices Over Runtime Systems at Exascale,” Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.“
A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks,” Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.“
One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators,” The International Conference on Computational Science (ICCS), June 2012.“
Performance evaluation of LU factorization through hardware counter measurements,” University of Tennessee Computer Science Technical Report, no. ut-cs-12-700, October 2012.“
Power Aware Computing on GPUs,” SAAHPC '12 (Best Paper Award), Argonne, IL, July 2012.“
Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture,” LAWN 267, 00 2012.“
Providing GPU Capability to LU and QR within the ScaLAPACK Framework,” University of Tennessee Computer Science Technical Report (also LAWN 272), no. UT-CS-12-699, September 2012.“
Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems,” Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper), Rhodes Island, Greece, August 2012.“
Accelerating Linear System Solutions Using Randomization Techniques,” ACM Transactions on Mathematical Software (also LAWN 246), vol. 39, issue 2, February 2013. DOI: 10.1145/2427023.2427025“
A Block-Asynchronous Relaxation Method for Graphics Processing Units,” Journal of Parallel and Distributed Computing, vol. 73, issue 12, pp. 1613–1626, December 2013. DOI: http://dx.doi.org/10.1016/j.jpdc.2013.05.008“
clMAGMA: High Performance Dense Linear Algebra with OpenCL,” University of Tennessee Technical Report (Lawn 275), no. UT-CS-13-706: University of Tennessee, March 2013.“
Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,” University of Tennessee Computer Science Technical Report, no. ut-cs-13-713, July 2013.“
Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters,” University of Tennessee Computer Science Technical Report, no. ut-cs-13-714, July 2013.“
Keeneland: Computational Science Using Heterogeneous GPU Computing,” Contemporary High Performance Computing: From Petascale Toward Exascale, Boca Raton, FL, Taylor and Francis, 2013.“
Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations,” International Supercomputing Conference (ISC), Lecture Notes in Computer Science, vol. 7905, Leipzig, Germany, Springer Berlin Heidelberg, pp. 67-80, June 2013. DOI: 10.1007/978-3-642-38750-0_6“
Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi,” PPAM 2013, Warsaw, Poland, September 2013.“
Scalable Dense Linear Algebra on Heterogeneous Hardware,” HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing, 2013.“
Soft Error Resilient QR Factorization for Hybrid System with GPGPU,” Journal of Computational Science, vol. 4, issue 6, pp. 457–464, November 2013. DOI: http://dx.doi.org/10.1016/j.jocs.2013.01.004“
Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication,” Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), Eugene, Oregon, USA, ACM Press, June 2013. DOI: 10.1145/2464996.2465438“
Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems,” Concurrency and Computation: Practice and Experience, October 2013.“