The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques,” International Conference on Computational Science (ICCS 2018), vol. 10860, Wuxi, China, Springer, pp. 586–600, June 2018. DOI: 10.1007/978-3-319-93698-7_45“
Evaluation and Design of FFT for Distributed Accelerated Systems,” ECP WBS 2.3.3.09 Milestone Report, no. FFT-ECP ST-MS-10-1216: Innovative Computing Laboratory, University of Tennessee, October 2018.“
A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 5, pp. 973–984, May 2018. DOI: 10.1109/TPDS.2017.2783929“
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers,” The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX, IEEE, November 2018.“
Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100 , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
Investigating Power Capping toward Energy-Efficient Scientific Applications,” Concurrency Computation: Practice and Experience, vol. 2018, issue e4485, pp. 1-14, April 2018. DOI: 10.1002/cpe.4485“
MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines , Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster, November 2018.
MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR) , Washington, DC, NSF PI Meeting, Poster, April 2018. DOI: 10.6084/m9.figshare.6174143.v3
Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization,” IEEE High Performance Extreme Computing Conference (HPEC’18), Waltham, MA, IEEE, September 2018.“
The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale,” SIAM Review, vol. 60, issue 4, pp. 808–865, November 2018. DOI: 10.1137/17M1117732“
Tensor Contractions using Optimized Batch GEMM Routines , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption , Frankfurt, Germany, ISC High Performance (ISC18), Best Poster Award, June 2018.
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption,” ISC High Performance (ISC'18), Best Poster, Frankfurt, Germany, June 2018.“
Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,” Parallel Computing, vol. 81, pp. 1–21, January 2019. DOI: 10.1016/j.parco.2018.10.003“
Design and Implementation for FFT-ECP on Distributed Accelerated Systems,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-05: University of Tennessee, April 2019.“
Evaluation of Directive-Based Performance Portable Programming Models,” International Journal of High Performance Computing and Networking (to appear), 2019. DOI: 10.1504/IJHPCN.2017.10009064“
FFT-ECP Fast Fourier Transform , Houston, TX, 2019 ECP Annual Meeting (Research Poster), January 2019.
GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems,” EuroMPI'19 Posters, Zurich, Switzerland, no. icl-ut-19-06: ICL, September 2019.“
MagmaDNN 0.2 High-Performance Data Analytics for Manycore GPUs and CPUs : University of Tennessee, January 2019. DOI: 10.13140/RG.2.2.14906.64961
PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP,” ACM Transactions on Mathematical Software (to appear), 2019.“