Linear Algebra Software for High-Performance Computing (Part 2: Software for Hardware Accelerators and Coprocessors) , Frankfurt, Germany, ISC High Performance (ISC18), Tutorial Presentation, June 2015.
An Introduction to the MAGMA project - Acceleration of Dense Linear Algebra : NVIDIA Webinar, June 2010.
The Future of Computing: Software Libraries , Savannah, GA, DOD CREATE Developers' Review, Keynote Presentation, February 2012.
Dense Linear Algebra Solvers for Multicore with GPU Accelerators , Atlanta, GA, International Parallel and Distributed Processing Symposium (IPDPS 2010), April 2010.
Autotuning Dense Linear Algebra Libraries on GPUs , Basel, Switzerland, Sixth International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010), June 2010.
Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched , Atlanta, GA, SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation, March 2017.
Accelerating Linear Algebra with MAGMA , Knoxville, TN, ECP Annual Meeting 2018, Tutorial, February 2018.
Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers : 2010 Symposium on Application Accelerators in. High-Performance Computing (SAAHPC'10), Tutorial, July 2010.
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption , Frankfurt, Germany, ISC High Performance (ISC18), Best Poster Award, June 2018.
Towards a High-Performance Tensor Algebra Package for Accelerators , Gatlinburg, TN, moky Mountains Computational Sciences and Engineering Conference (SMC15), September 2015.
Tensor Contractions using Optimized Batch GEMM Routines , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
A Standard for Batched BLAS Routines , Paris, France, 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16), April 2016.
Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators , Knoxville, TN, 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster, July 2010.
Power-aware Computing on GPGPUs , Gatlinburg, TN, Fall Creek Falls Conference, Poster, September 2011.
Optimizing Batch HGEMM on Small Sizes Using Tensor Cores , San Jose, CA, GPU Technology Conference (GTC), March 2019.
Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project , Portland, Oregon, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects , Portland, OR, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR) , Washington, DC, NSF PI Meeting, Poster, April 2018. DOI: 10.6084/m9.figshare.6174143.v3
MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines , Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster, November 2018.
Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100 , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems,” EuroMPI'19 Posters, Zurich, Switzerland, no. icl-ut-19-06: ICL, September 2019.“
FFT-ECP Fast Fourier Transform , Houston, TX, 2019 ECP Annual Meeting (Research Poster), January 2019.
Enhancing the Performance of Dense Linear Algebra Solvers on GPUs (in the MAGMA Project) , Austin, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC08), November 2008.
Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes , San Jose, CA, GPU Technology Conference (GTC16), Poster, April 2016.
Acceleration of the BLAST Hydro Code on GPU,” Supercomputing '12 (poster), Salt Lake City, Utah, SC12, November 2012.“
Accelerating Tensor Contractions for High-Order FEM on CPUs, GPUs, and KNLs , Gatlinburg, TN, moky Mountains Computational Sciences and Engineering Conference (SMC16), Poster, September 2016.
Accelerating 2D FFT: Exploit GPU Tensor Cores through Mixed-Precision , Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), ACM Student Research Poster, November 2018.
With Extreme Computing, the Rules Have Changed,” Computing in Science & Engineering, vol. 19, issue 3, pp. 52-62, May 2017. DOI: 10.1109/MCSE.2017.48“
Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy,” ACM Transactions on Mathematical Software, vol. 34, no. 4, pp. 17-22, 00 2008.“
Using MAGMA with PGI Fortran,” PGI Insider, November 2010.“
The use of bulk states to accelerate the band edge state calculation of a semiconductor quantum dot,” Journal of Computational Physics (submitted), January 2006.“
The Use of Bulk States to Accelerate the Band Edge State Calculation of a Semiconductor Quantum Dot,” Journal of Computational Physics, vol. 223, pp. 774-782, 00 2007.“
Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems,” Concurrency and Computation: Practice and Experience, October 2013.“
Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems,” Parallel Computing, vol. 36, no. 5-6, pp. 232-240, 00 2010.“
Structure-aware Linear Solver for Realtime Convex Optimization for Embedded Systems,” IEEE Embedded Systems Letters, vol. 9, issue 3, pp. 61–64, May 2017. DOI: 10.1109/LES.2017.2700401“
State-of-the-Art Eigensolvers for Electronic Structure Calculations of Large Scale Nano-Systems,” Journal of Computational Physics, vol. 227, no. 15, pp. 7113-7124, January 2008.“
Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU,” ACM Transactions on Mathematical Software (TOMS), vol. 43, issue 2, October 2016.“
Solving Linear Diophantine Systems on Parallel Architectures,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, issue 5, pp. 1158-1169, May 2019. DOI: http://dx.doi.org/10.1109/TPDS.2018.2873354“
Solving Dense Symmetric Indefinite Systems using GPUs,” Concurrency and Computation: Practice and Experience, vol. 29, issue 9, March 2017. DOI: 10.1002/cpe.4055“
Soft Error Resilient QR Factorization for Hybrid System with GPGPU,” Journal of Computational Science, Seattle, WA, Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems at SC11, November 2011.“
Soft Error Resilient QR Factorization for Hybrid System with GPGPU,” Journal of Computational Science, vol. 4, issue 6, pp. 457–464, November 2013. DOI: http://dx.doi.org/10.1016/j.jocs.2013.01.004“
Soft Error Resilient QR Factorization for Hybrid System,” UT-CS-11-675 (also LAPACK Working Note #252), no. ICL-CS-11-675, July 2011.“
The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale,” SIAM Review, vol. 60, issue 4, pp. 808–865, November 2018. DOI: 10.1137/17M1117732“
A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators,” Proc. of VECPAR'10 (to appear), Berkeley, CA, June 2010.“
Scalability Study of a Quantum Simulation Code,” PARA 2010, Reykjavik, Iceland, June 2010.“
Reducing the Amount of out-of-core Data Access for GPU-Accelerated Randomized SVD,” Concurrency and Computation: Practice and Experience, April 2020. DOI: 10.1002/cpe.5754“
Prospectus for the Next LAPACK and ScaLAPACK Libraries,” PARA 2006, Umea, Sweden, June 2006.“
Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture,” LAWN 267, 00 2012.“