Publications
SLATE Working Note 13: Implementing Singular Value and Symmetric Eigenvalue Solvers,”
SLATE Working Notes, no. 13, ICL-UT-19-07: Innovative Computing Laboratory, University of Tennessee, September 2019.
(4.45 MB)
“
SLATE Working Note 12: Implementing Matrix Inversions,”
SLATE Working Notes, no. 12, ICL-UT-19-04: Innovative Computing Laboratory, University of Tennessee, June 2019.
(1.95 MB)
“
SLATE Users' Guide,”
SLATE Working Notes, no. 10, ICL-UT-19-01: Innovative Computing Laboratory, University of Tennessee, January 2019.
“SLATE Mixed Precision Performance Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-03: University of Tennessee, April 2019.
(1.04 MB)
“
SLATE Developers' Guide,”
SLATE Working Notes, no. 11, ICL-UT-19-02: Innovative Computing Laboratory, University of Tennessee, January 2019.
“Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale,”
SLATE Working Notes, no. 1, ICL-UT-17-02: Innovative Computing Laboratory, University of Tennessee, June 2017.
(2.8 MB)
“
PLASMA 17.1 Functionality Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-10: University of Tennessee, June 2017.
(1.8 MB)
“
PLASMA 17 Performance Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-11: University of Tennessee, June 2017.
(7.57 MB)
“
Parallel Norms Performance Report,”
SLATE Working Notes, no. 6, ICL-UT-18-06: Innovative Computing Laboratory, University of Tennessee, June 2018.
(1.13 MB)
“
Parallel BLAS Performance Report,”
SLATE Working Notes, no. 5, ICL-UT-18-01: University of Tennessee, April 2018.
(4.39 MB)
“
MAGMA-sparse Interface Design Whitepaper,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-05, September 2017.
(1.28 MB)
“
Linear Systems Performance Report,”
SLATE Working Notes, no. 8, ICL-UT-18-08: Innovative Computing Laboratory, University of Tennessee, September 2018.
(1.64 MB)
“
Least Squares Performance Report,”
SLATE Working Notes, no. 9, ICL-UT-18-10: Innovative Computing Laboratory, University of Tennessee, December 2018.
(1.76 MB)
“
Implementation of the C++ API for Batch BLAS,”
SLATE Working Notes, no. 7, ICL-UT-18-04: Innovative Computing Laboratory, University of Tennessee, June 2018.
(1.07 MB)
“
Designing SLATE: Software for Linear Algebra Targeting Exascale,”
SLATE Working Notes, no. 3, ICL-UT-17-06: Innovative Computing Laboratory, University of Tennessee, October 2017.
(2.8 MB)
“
clMAGMA: High Performance Dense Linear Algebra with OpenCL,”
University of Tennessee Technical Report (Lawn 275), no. UT-CS-13-706: University of Tennessee, March 2013.
(526.6 KB)
“
C++ API for BLAS and LAPACK,”
SLATE Working Notes, no. 2, ICL-UT-17-03: Innovative Computing Laboratory, University of Tennessee, June 2017.
(1.12 MB)
“
C++ API for Batch BLAS,”
SLATE Working Notes, no. 4, ICL-UT-17-12: University of Tennessee, December 2017.
(1.89 MB)
“
On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties,”
University of Tennessee Computer Science Technical Report, no. UT-CS-13-715, July 2013, 2012.
(358.98 KB)
“
Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
, July 2018.
(483.05 KB)

MAGMA Tutorial
, Atlanta, GA, Keeneland Workshop, February 2012.
(2.47 MB)

MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi
, Frankfurt, Germany, ISC High Performance (ISC15), Intel Booth Presentation, June 2015.
(2.03 MB)

MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors
, Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), November 2012.
(6.4 MB)

MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures
, Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation, November 2012.
(4.69 MB)

Accelerating Linear Algebra with MAGMA
, Knoxville, TN, ECP Annual Meeting 2018, Tutorial, February 2018.
(35.27 MB)

A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination,”
Concurrency and Computation: Practice and Experience, vol. 27, issue 5, pp. 1292-1309, April 2015.
DOI: 10.1002/cpe.3306
(783.45 KB)
“
The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale,”
SIAM Review, vol. 60, issue 4, pp. 808–865, November 2018.
DOI: 10.1137/17M1117732
“Preconditioned Krylov Solvers on GPUs,”
Parallel Computing, June 2017.
DOI: 10.1016/j.parco.2017.05.006
“PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP,”
ACM Transactions on Mathematical Software, vol. 45, issue 2, June 2019.
DOI: 10.1145/3264491
(7.5 MB)
“
Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems,”
Supercomputing Frontiers and Innovations, vol. 2, no. 4, October 2015.
DOI: 10.14529/jsfi1504
(3.68 MB)
“
A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks,”
International Journal of High Performance Computing Applications, vol. 28, issue 2, pp. 196-209, May 2014.
DOI: 10.1177/1094342013502097
(1.74 MB)
“
Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs,”
IEEE Transactions on Parallel and Distributed Systems, no. 1045-9219, November 2015.
“HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi,”
Scientific Programming, vol. 23, issue 1, January 2015.
DOI: 10.3233/SPR-140404
(553.94 KB)
“
Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling,”
Journal of Advances in Modeling Earth Systems, vol. 10, issue 8, pp. 1952–1969, August 2018.
DOI: 10.1029/2018MS001276
“Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
, no. UT-CS-11-689, December 2011.
(608.95 KB)

Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems,”
ICCS 2012, Omaha, NE, June 2012.
(608.95 KB)
“
Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators,”
Proceedings of the IEEE, vol. 106, issue 11, pp. 2040–2055, November 2018.
DOI: 10.1109/JPROC.2018.2868961
“Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs,”
Parallel Computing, vol. 74, pp. 3–18, May 2018.
DOI: 10.1016/j.parco.2017.10.004
“Virtual Systolic Array for QR Decomposition,”
15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013), Boston, MA, IEEE, May 2013.
DOI: 10.1109/IPDPS.2013.119
(749.84 KB)
“
Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication,”
Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), Eugene, Oregon, USA, ACM Press, June 2013.
DOI: 10.1145/2464996.2465438
(1.27 MB)
“
Search Space Generation and Pruning System for Autotuners,”
30th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
(555.44 KB)
“
Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi,”
PPAM 2013, Warsaw, Poland, September 2013.
(284.97 KB)
“
Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors,”
5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14), New Orleans, LA, IEEE, November 2014.
DOI: 10.1109/ScalA.2014.8
(407.5 KB)
“
Massively Parallel Automated Software Tuning,”
48th International Conference on Parallel Processing (ICPP 2019), Kyoto, Japan, ACM Press, August 2019.
“Least Squares Solvers for Distributed-Memory Machines with GPU Accelerators,”
ACM International Conference on Supercomputing (ICS '19), Phoenix, Arizona, ACM, June 2019.
DOI: 10.1145/3330345.3330356
(1.63 MB)
“
Heterogeneous Streaming,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016, Chicago, IL, IEEE, May 2016.
(2.73 MB)
“
Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra,”
2015 SIAM Conference on Applied Linear Algebra, Atlanta, GA, SIAM, October 2015.
(4.7 MB)
“
clMAGMA: High Performance Dense Linear Algebra with OpenCL ,”
International Workshop on OpenCL, Bristol University, England, May 2014.
(460.91 KB)
“
Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices,”
Parallel and Distributed Processing Symposium Workshops (IPDPSW), Orlando, FL, IEEE, June 2017.
DOI: 10.1109/IPDPSW.2017.18
“Accelerating Eigenvector Computation in the Nonsymmetric Eigenvalue Problem,”
VECPAR 2014, Eugene, OR, June 2014.
(199.44 KB)
“