Batched One-Sided Factorizations of Tiny Matrices Using GPUs: Challenges and Countermeasures,” Journal of Computational Science, vol. 26, pp. 226–236, May 2018. DOI: 10.1016/j.jocs.2018.01.005“
Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling,” Journal of Advances in Modeling Earth Systems, vol. 10, issue 8, pp. 1952–1969, August 2018. DOI: 10.1029/2018MS001276“
The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques,” International Conference on Computational Science (ICCS 2018), vol. 10860, Wuxi, China, Springer, pp. 586–600, June 2018. DOI: 10.1007/978-3-319-93698-7_45“
Evaluation and Design of FFT for Distributed Accelerated Systems,” ECP WBS 2.3.3.09 Milestone Report, no. FFT-ECP ST-MS-10-1216: Innovative Computing Laboratory, University of Tennessee, October 2018.“
A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 5, pp. 973–984, May 2018. DOI: 10.1109/TPDS.2017.2783929“
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers,” The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX, IEEE, November 2018.“
Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100 , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
Investigating Power Capping toward Energy-Efficient Scientific Applications,” Concurrency Computation: Practice and Experience, vol. 2018, issue e4485, pp. 1-14, April 2018. DOI: 10.1002/cpe.4485“
MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines , Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster, November 2018.
MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR) , Washington, DC, NSF PI Meeting, Poster, April 2018. DOI: 10.6084/m9.figshare.6174143.v3
Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization,” IEEE High Performance Extreme Computing Conference (HPEC’18), Waltham, MA, IEEE, September 2018.“
The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale,” SIAM Review, vol. 60, issue 4, pp. 808–865, November 2018. DOI: 10.1137/17M1117732“
Software-Defined Events (SDEs) in MAGMA-Sparse,” Innovative Computing Laboratory Technical Report, no. ICL-UT-18-12: University of Tennessee, December 2018.“
Solving Linear Diophantine Systems on Parallel Architectures,” IEEE Transactions on Parallel and Distributed Systems, October 2018. DOI: 10.1109/TPDS.2018.2873354“
Tensor Contractions using Optimized Batch GEMM Routines , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption , Frankfurt, Germany, ISC High Performance (ISC18), Best Poster Award, June 2018.
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption,” ISC High Performance (ISC'18), Best Poster, Frankfurt, Germany, June 2018.“
Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,” Parallel Computing, vol. 81, pp. 1–21, January 2019. DOI: 10.1016/j.parco.2018.10.003“
Design and Implementation for FFT-ECP on Distributed Accelerated Systems,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-05: University of Tennessee, April 2019.“
Evaluation of Directive-Based Performance Portable Programming Models,” International Journal of High Performance Computing and Networking (to appear), 2019. DOI: 10.1504/IJHPCN.2017.10009064“
Fast Batched Matrix Multiplication for Small Sizes using Half Precision Arithmetic on GPUs,” 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, IEEE, May 2019.“
FFT-ECP Fast Fourier Transform , Houston, TX, 2019 ECP Annual Meeting (Research Poster), January 2019.
FFT-ECP Implementation Optimizations and Features Phase,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-12: University of Tennessee, October 2019.“
GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems,” EuroMPI'19 Posters, Zurich, Switzerland, no. icl-ut-19-06: ICL, September 2019.“
Hands-on Research and Training in High-Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments,,” ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.“
Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation,” SC'19, Proc. of Workshop on Exascale MPI (ExaMPI), Denver, CO, 2019.“
MagmaDNN 0.2 High-Performance Data Analytics for Manycore GPUs and CPUs : University of Tennessee, January 2019. DOI: 10.13140/RG.2.2.14906.64961
MagmaDNN: Accelerated Deep Learning Using MAGMA,” Practice and Experience in Advanced Research Computing (PEARC ’19), Chicago, IL, ACM, July 2019.“
MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing,” ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.“
OpenDIEL: A Parallel Workflow Engine and DataAnalytics Framework,” Practice and Experience in Advanced Research Computing (PEARC ’19), Chicago, IL, ACM, July 2019.“
Optimizing Batch HGEMM on Small Sizes Using Tensor Cores , San Jose, CA, GPU Technology Conference (GTC), March 2019.
Progressive Optimization of Batched LU Factorization on GPUs,” IEEE High Performance Extreme Computing Conference (HPEC’19), Waltham, MA, IEEE, September 2019.“