Accelerating Numerical Dense Linear Algebra Calculations with GPUs,” Numerical Computations with GPUs: Springer International Publishing, pp. 3-28, 2014. DOI: 10.1007/978-3-319-06548-9_1“
Bringing High Performance Computing to Big Data Algorithms,” Handbook of Big Data Technologies: Springer, 2017. DOI: 10.1007/978-3-319-49340-4“
Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures,” Lecture Notes in Computer Science, vol. 9573: Springer International Publishing, pp. 86-95, September 2015, 2016. DOI: 10.1007/978-3-319-32149-3_9“
Access-averse Framework for Computing Low-rank Matrix Approximations,” First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining, Washington, DC, October 2014.“
Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters,” IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, IEEE, May 2018.“
Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES,” 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, New Orleans, LA, November 2014.“
Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime,” Workshop on Large-Scale Parallel Processing, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.“
Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster,” The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), New Orleans, LA, IEEE, November 2014.“
Heterogeneous Streaming,” The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016, Chicago, IL, IEEE, May 2016.“
Improving the performance of CA-GMRES on multicores with multiple GPUs,” IPDPS 2014, Phoenix, AZ, IEEE, May 2014.“
Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators,” 2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19), Waltham, MA, IEEE, September 2019.“
Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation,” International Parallel and Distributed Processing Symposium (IPDPS), May 2019.“
Mixed-precision Block Gram Schmidt Orthogonalization,” 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Austin, TX, ACM, November 2015.“
Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs,” 2015 SIAM Conference on Applied Linear Algebra, Atlanta, GA, SIAM, October 2015.“
Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs,” VECPAR 2014 (Best Paper), Eugene, OR, June 2014.“
Optimizing Krylov Subspace Solvers on Graphics Processing Units,” Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, Phoenix, AZ, IEEE, May 2014.“
Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors,” 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14), New Orleans, LA, IEEE, November 2014. DOI: 10.1109/ScalA.2014.8“
Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs,” The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.“
Randomized Algorithms to Update Partial Singular Value Decomposition on a Hybrid CPU/GPU Cluster,” The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.“
Sampling Algorithms to Update Truncated SVD,” IEEE International Conference on Big Data, December 2017.“
Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster,” The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), May 2013.“
Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives,” Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award, Orlando, FL, June 2017.“
One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators,” The International Conference on Computational Science (ICCS), June 2012.“
Autotuning Techniques for Performance-Portable Point Set Registration in 3D,” Supercomputing Frontiers and Innovations, vol. 5, no. 4, December 2018. DOI: 10.14529/jsfi180404“
Communication-Avoiding Symmetric-Indefinite Factorization,” SIAM Journal on Matrix Analysis and Application, vol. 35, issue 4, pp. 1364-1406, July 2014.“
Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations,” Scientific Programming, 2015.“
Design and Implementation of the PULSAR Programming System for Large Scale Computing,” Supercomputing Frontiers and Innovations, vol. 4, issue 1, 2017. DOI: 10.14529/jsfi170101“
Distributed-Memory Lattice H-Matrix Factorization,” The International Journal of High Performance Computing Applications, August 2019. DOI: 10.1177/1094342019861139“
Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures,” IPDPS 2013 (submitted), Boston, MA, 00 2013.“
Mixed-Precision Cholesky QR Factorization and its Case Studies on Multicore CPU with Multiple GPUs,” SIAM Journal on Scientific Computing, vol. 37, no. 3, pp. C203-C330, May 2015. DOI: DOI:10.1137/14M0973773“
Non-GPU-resident Dense Symmetric Indefinite Factorization,” Concurrency and Computation: Practice and Experience, November 2016. DOI: 10.1002/cpe.4012“
Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems,” Supercomputing Frontiers and Innovations, vol. 2, no. 4, October 2015. DOI: 10.14529/jsfi1504“
Performance of Asynchronous Optimized Schwarz with One-sided Communication,” Parallel Computing, vol. 86, pp. 66-81, August 2019. DOI: 10.1016/j.parco.2019.05.004“
PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP,” ACM Transactions on Mathematical Software (to appear), 2019.“
The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale,” SIAM Review, vol. 60, issue 4, pp. 808–865, November 2018. DOI: 10.1137/17M1117732“
Solving Dense Symmetric Indefinite Systems using GPUs,” Concurrency and Computation: Practice and Experience, vol. 29, issue 9, March 2017. DOI: 10.1002/cpe.4055“
Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU,” ACM Transactions on Mathematical Software (TOMS), vol. 43, issue 2, October 2016.“
Structure-aware Linear Solver for Realtime Convex Optimization for Embedded Systems,” IEEE Embedded Systems Letters, vol. 9, issue 3, pp. 61–64, May 2017. DOI: 10.1109/LES.2017.2700401“
A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination,” Concurrency and Computation: Practice and Experience, vol. 27, issue 5, pp. 1292-1309, April 2015. DOI: 10.1002/cpe.3306“
Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 8, pp. 1879–1892, August 2018. DOI: 10.1109/TPDS.2018.2808964“
Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems,” Concurrency and Computation: Practice and Experience, October 2013.“
MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines , Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster, November 2018.
MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR) , Washington, DC, NSF PI Meeting, Poster, April 2018. DOI: 10.6084/m9.figshare.6174143.v3
Comparing performance of s-step and pipelined GMRES on distributed-memory multicore CPUs , Pittsburgh, Pennsylvania, SIAM Annual Meeting, July 2017.
MAGMA: A Breakthrough in Solvers for Eigenvalue Problems , San Jose, CA, GPU Technology Conference (GTC12), Presentation, May 2012.
MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation, November 2012.
MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi , Frankfurt, Germany, ISC High Performance (ISC15), Intel Booth Presentation, June 2015.
Production Implementations of Pipelined & Communication-Avoiding Iterative Linear Solvers , Tokyo, Japan, SIAM Conference on Parallel Processing for Scientific Computing, March 2018.
On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties,” University of Tennessee Computer Science Technical Report, no. UT-CS-13-715, July 2013, 2012.“