Dense Linear Algebra Solvers for Multicore with GPU Accelerators , Atlanta, GA, International Parallel and Distributed Processing Symposium (IPDPS 2010), April 2010.
MAGMA - LAPACK for GPUs , Atlanta, GA, Keeneland GPU Tutorial, April 2011.
Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures,” International Journal of Computational Science and Engineering (to appear), January 2005.“
Performance Evaluation for Petascale Quantum Simulation Tools,” Proceedings of the Cray Users' Group Meeting, Atlanta, GA, May 2010.“
MAGMA: A Breakthrough in Solvers for Eigenvalue Problems , San Jose, CA, GPU Technology Conference (GTC12), Presentation, May 2012.
Accelerating the Reduction to Upper Hessenberg, Tridiagonal, and Bidiagonal Forms through Hybrid GPU-Based Computing,” Parallel Computing, vol. 36, no. 12, pp. 645-654, 00 2010.“
Evaluation and Design of FFT for Distributed Accelerated Systems,” ECP WBS 2.3.3.09 Milestone Report, no. FFT-ECP ST-MS-10-1216: Innovative Computing Laboratory, University of Tennessee, October 2018.“
Dense Linear Algebra for Hybrid GPU-based Systems,” Scientific Computing with Multicore and Accelerators, Boca Raton, Florida, CRC Press, 2010.“
MAGMA - LAPACK for HPC on Heterogeneous Architectures , Oak Ridge, TN, Titan Summit at Oak Ridge National Laboratory, Presentation, August 2011.
FFT-ECP Implementation Optimizations and Features Phase,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-12: University of Tennessee, October 2019.“
Accelerating Linear Algebra with MAGMA , Knoxville, TN, ECP Annual Meeting 2018, Tutorial, February 2018.
Dense Linear Algebra Solvers for Multicore with GPU Accelerators,” Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, Atlanta, GA, pp. 1-8, 2010. DOI: 10.1109/IPDPSW.2010.5470941“
Using MAGMA with PGI Fortran,” PGI Insider, November 2010.“
Design and Implementation for FFT-ECP on Distributed Accelerated Systems,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-05: University of Tennessee, April 2019.“
Comparison of Nonlinear Conjugate-Gradient methods for computing the Electronic Properties of Nanostructure Architectures,” Proceedings of 5th International Conference on Computational Science (ICCS), Atlanta, GA, USA, Springer's Lecture Notes in Computer Science, pp. 317-325, January 2005.“
Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers : 2010 Symposium on Application Accelerators in. High-Performance Computing (SAAHPC'10), Tutorial, July 2010.
MAGMA Tensors and Batched Computing for Accelerating Applications on GPUs , San Jose, CA, GPU Technology Conference (GTC17), Presentation in Session S7728, May 2017.
Matrix Algebra on GPU and Multicore Architectures , Basel, Switzerland, Workshop on GPU-enabled Numerical Libraries, Presentation, May 2011.
Performance evaluation for petascale quantum simulation tools,” Proceedings of CUG09, Atlanta, GA, May 2009.“
Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures,” International Journal of Computational Science and Engineering, vol. 2, no. 3/4, pp. 205-212, 00 2006.“
Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems,” University of Tennessee Computer Science Technical Report, UT-CS-08-632 (also LAPACK Working Note 210), January 2008.“
CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps : Zenodo, October 2019. DOI: 10.5281/zenodo.3477618
Linear Algebra Software for High-Performance Computing (Part 2: Software for Hardware Accelerators and Coprocessors) , Frankfurt, Germany, ISC High Performance (ISC18), Tutorial Presentation, June 2015.
Optimizing Krylov Subspace Solvers on Graphics Processing Units,” Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, Phoenix, AZ, IEEE, May 2014.“
Parallelizing the Divide and Conquer Algorithm for the Symmetric Tridiagonal Eigenvalue Problem on Distributed Memory Architectures,” SIAM Journal on Scientific Computing, vol. 6, no. 20, pp. 2223-2236, October 2002.“
Collecting Performance Data with PAPI-C,” Tools for High Performance Computing 2009, 3rd Parallel Tools Workshop, Dresden, Germany, Springer Berlin / Heidelberg, pp. 157-173, May 2010. DOI: 10.1007/978-3-642-11261-4_11“
Technical Comparison between several representative checkpoint/rollback solutions for MPI programs,” ICL Technical Report, no. ICL-UT-06-09, January 2006.“
From MPI to OpenSHMEM: Porting LAMMPS,” OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, Annapolis, MD, USA, Springer International Publishing, pp. 121–137, 2015. DOI: 10.1007/978-3-319-26428-8_8“
Proposal of MPI operation level Checkpoint/Rollback and one implementation,” Proceedings of IEEE CCGrid 2006: IEEE Computer Society, January 2006.“
Results of the PERI survey of SciDAC applications,” Journal of Physics: Conference Series, SciDAC 2007, vol. 78, no. 2007, January 2007.“
Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team,” SciDAC 2009, Journal of Physics: Conference Series, vol. 180(2009)012039, San Diego, California, IOP Publishing, July 2009.“
Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling,” Journal of Advances in Modeling Earth Systems, vol. 10, issue 8, pp. 1952–1969, August 2018. DOI: 10.1029/2018MS001276“
The TOP500 List and Progress in High-Performance Computing,” IEEE Computer, vol. 48, issue 11, pp. 42-49, November 2015. DOI: doi:10.1109/MC.2015.338“
The Marketplace for High-Performance Computers,” Parallel Computing, vol. 25, no. 13-14, pp. 1517-1545, October 2002.“
Overview of High Performance Computers,” Handbook of Massive Data Sets: Kluwer Academic Publishers, pp. 791-852, January 2001.“
Three-dimensional parallel frequency-domain visco-acoustic wave modelling based on a hybrid direct/iterative solver.,” To appear in Geophysical Prospecting journal., 00 2011.“
Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures,” University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250), June 2011.“
CUBE User Manual,” ICL Technical Report, no. ICL-UT-04-01, February 2004.“
L2 Cache Modeling for Scientific Applications on Chip Multi-Processors,” Proceedings of the 2007 International Conference on Parallel Processing, Xi'an, China, IEEE Computer Society, January 2007.“
An Algebra for Cross-Experiment Performance Analysis,” 2004 International Conference on Parallel Processing (ICCP-04), Montreal, Quebec, Canada, August 2004.“
A Scalable Framework for Heterogeneous GPU-Based Clusters,” The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012), Pittsburgh, PA, USA, ACM, June 2012.“
Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems,” 26th ACM International Conference on Supercomputing (ICS 2012), San Servolo Island, Venice, Italy, ACM, June 2012.“
Analytical Modeling and Optimization for Affinity Based Thread Scheduling on Multicore Systems,” IEEE Cluster 2009, New Orleans, August 2009.“
Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems,” International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09), Portland, OR, November 2009.“
Modeling of L2 Cache Behavior for Thread-Parallel Scientific Programs on Chip Multi-Processors,” University of Tennessee Computer Science Technical Report, no. UT-CS-06-583, January 2006.“
Experiments with Strassen's Algorithm: From Sequential to Parallel,” 18th IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2006 (submitted), Dallas, Texas, January 2006.“
Scaling Up Matrix Computations on Shared-Memory Manycore Systems with 1000 CPU Cores,” International conference on Supercomputing, Munich, Germany, ACM, pp. 333-342, June 2014. DOI: 10.1145/2597652.2597670“
Analytical Modeling for Affinity-Based Thread Scheduling on Multicore Platforms,” University of Tennessee Computer Science Technical Report, UT-CS-08-626, January 2008.“
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems,” University of Tennessee Computer Science Technical Report, vol. –10-653, April 2010.“
A Scalable Non-blocking Multicast Scheme for Distributed DAG Scheduling,” The International Conference on Computational Science 2009 (ICCS 2009), vol. 5544, Baton Rouge, LA, pp. 195-204, May 2009.“