PMIx: Process Management for Exascale Environments,” Parallel Computing, vol. 79, pp. 9–29, January 2018. DOI: 10.1016/j.parco.2018.08.002“
Production Implementations of Pipelined & Communication-Avoiding Iterative Linear Solvers , Tokyo, Japan, SIAM Conference on Parallel Processing for Scientific Computing, March 2018.
Scheduling for Fault-Tolerance: An Introduction,” Topics in Parallel and Distributed Computing: Springer International Publishing, pp. 143–170, 2018. DOI: 10.1007/978-3-319-93109-8“
The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale,” SIAM Review, vol. 60, issue 4, pp. 808–865, November 2018. DOI: 10.1137/17M1117732“
Software-Defined Events (SDEs) in MAGMA-Sparse,” Innovative Computing Laboratory Technical Report, no. ICL-UT-18-12: University of Tennessee, December 2018.“
Software-Defined Events through PAPI for In-Depth Analysis of Application Performance , Basel, Switzerland, 5th Platform for Advanced Scientific Computing Conference (PASC18), July 2018.
Solver Interface & Performance on Cori,” Innovative Computing Laboratory Technical Report, no. ICL-UT-18-05: University of Tennessee, June 2018.“
Solving Linear Diophantine Systems on Parallel Architectures,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, issue 5, pp. 1158-1169, May 2019, 2018. DOI: http://dx.doi.org/10.1109/TPDS.2018.2873354“
A Survey of MPI Usage in the US Exascale Computing Project,” Concurrency Computation: Practice and Experience, September 2018. DOI: 10.1002/cpe.4851“
Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 8, pp. 1879–1892, August 2018. DOI: 10.1109/TPDS.2018.2808964“
Task Based Cholesky Decomposition on Xeon Phi Architectures using OpenMP,” International Journal of Computational Science and Engineering (IJCSE), vol. 17, no. 3, October 2018. DOI: http://dx.doi.org/10.1504/IJCSE.2018.095851“
Tensor Contraction on Distributed Hybrid Architectures using a Task-Based Runtime System,” Innovative Computing Laboratory Technical Report, no. ICL-UT-18-13: University of Tennessee, December 2018.“
Tensor Contractions using Optimized Batch GEMM Routines , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption,” ISC High Performance (ISC'18), Best Poster, Frankfurt, Germany, June 2018.“
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption , Frankfurt, Germany, ISC High Performance (ISC18), Best Poster Award, June 2018.
Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning,” Journal of Parallel and Distributed Computing, vol. 119, pp. 219–230, November 2018. DOI: 10.1016/j.jpdc.2018.04.017“
Variable-Size Batched Condition Number Calculation on GPUs,” SBAC-PAD, Lyon, France, September 2018.“
Variable-Size Batched Gauss–Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors,” Parallel Computing, January 2018. DOI: 10.1016/j.parco.2017.12.006“
Accelerating NWChem Coupled Cluster through Dataflow-Based Execution,” The International Journal of High Performance Computing Applications, pp. 1–13, January 2017. DOI: 10.1177/1094342016672543“
Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched , Atlanta, GA, SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation, March 2017.
Argobots: A Lightweight Low-Level Threading and Tasking Framework,” IEEE Transactions on Parallel and Distributed Systems, October 2017. DOI: 10.1109/TPDS.2017.2766062“
Assuming failure independence: are we right to be wrong?,” The 3rd International Workshop on Fault Tolerant Systems (FTS), Honolulu, Hawaii, IEEE, September 2017.“
Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices,” Parallel and Distributed Processing Symposium Workshops (IPDPSW), Orlando, FL, IEEE, June 2017. DOI: 10.1109/IPDPSW.2017.18“
Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs,” Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, New York, NY, USA, ACM, pp. 1–10, February 2017. DOI: 10.1145/3026937.3026940“
“BDEC Pathways to Convergence: Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry,” Innovative Computing Laboratory Technical Report, no. ICL-UT-17-08: University of Tennessee, November 2017.
Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation,” IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, FL, IEEE, May 2017. DOI: 10.1109/IPDPS.2017.46“
Bringing High Performance Computing to Big Data Algorithms,” Handbook of Big Data Technologies: Springer, 2017. DOI: 10.1007/978-3-319-49340-4“
C++ API for Batch BLAS,” SLATE Working Notes, no. 4, ICL-UT-17-12: University of Tennessee, December 2017.“
C++ API for BLAS and LAPACK,” SLATE Working Notes, no. 2, ICL-UT-17-03: Innovative Computing Laboratory, University of Tennessee, June 2017.“
The Case for Directive Programming for Accelerator Autotuner Optimization,” Innovative Computing Laboratory Technical Report, no. ICL-UT-17-07: University of Tennessee, October 2017.“
Checkpointing Workflows for Fail-Stop Errors,” IEEE Cluster, Honolulu, Hawaii, IEEE, September 2017.“
Comparing performance of s-step and pipelined GMRES on distributed-memory multicore CPUs , Pittsburgh, Pennsylvania, SIAM Annual Meeting, July 2017.
Co-Scheduling Algorithms for Cache-Partitioned Systems,” 19th Workshop on Advances in Parallel and Distributed Computational Models, Orlando, FL, IEEE Computer Society Press, May 2017. DOI: 10.1109/IPDPSW.2017.60“
Dataflow Programming Paradigms for Computational Chemistry Methods,” Innovative Computing Laboratory Technical Report, no. ICL-UT-17-01, Knoxville, TN, University of Tennessee, May 2017.“
Design and Implementation of the PULSAR Programming System for Large Scale Computing,” Supercomputing Frontiers and Innovations, vol. 4, issue 1, 2017. DOI: 10.14529/jsfi170101“
The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems,” International Conference on Computational Science (ICCS 2017), Zürich, Switzerland, Elsevier, June 2017. DOI: DOI:10.1016/j.procs.2017.05.138“
Designing SLATE: Software for Linear Algebra Targeting Exascale,” SLATE Working Notes, no. 3, ICL-UT-17-06: Innovative Computing Laboratory, University of Tennessee, October 2017.“
Dynamic Task Discovery in PaRSEC- A data-flow task-based Runtime,” ScalA17, Denver, ACM, September 2017. DOI: 10.1145/3148226.3148233“
Efficient Communications in Training Large Scale Neural Networks,” ACM MultiMedia Workshop 2017, Mountain View, CA, ACM, October 2017.“
Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures,” Procedia Computer Science, vol. 108, pp. 606–615, June 2017. DOI: 10.1016/j.procs.2017.05.250“
Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA,” Journal of Computational Science, vol. 20, pp. 85–93, May 2017. DOI: 10.1016/j.jocs.2016.12.009“
Flexible Batched Sparse Matrix Vector Product on GPUs , Denver, Colorado, ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, November 2017.
Flexible Batched Sparse Matrix-Vector Product on GPUs,” Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17), Denver, Colorado, ACM Press, November 2017. DOI: 10.1145/3148226.3148230“
A Framework for Out of Memory SVD Algorithms,” ISC High Performance 2017, pp. 158–178, June 2017. DOI: 10.1007/978-3-319-58667-0_9“
High-performance Cholesky Factorization for GPU-only Execution,” Proceedings of the General Purpose GPUs (GPGPU-10), Austin, TX, ACM, February 2017.“
Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale,” 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, ACM, June 2017. DOI: 10.1145/3086157.3086162“
Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives,” Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award, Orlando, FL, June 2017.“
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers,” ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, ACM.“
LAWN 294: Aasen's Symmetric Indenite Linear Solvers in LAPACK,” LAPACK Working Note, no. LAWN 294, ICL-UT-17-13: University of Tennessee, December 2017.“
A Look Back on 30 Years of the Gordon Bell Prize,” International Journal of High Performance Computing and Networking, vol. 31, issue 6, pp. 469–484, 2017.“