"Acceleration of GPU-based Krylov solvers via Data Transfer Reduction", International Journal of High Performance Computing Applications, 2015.
"Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures, and Accuracy", ACM Transactions on Parallel Computing, vol. 1, issue 2, no. 10, pp. 10:1-10:28, 02/2015.
"Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs", International Supercomputing Conference (ISC 2015), Frankfurt, Germany, 07/2015.
"Batched Matrix Computations on Hardware Accelerators", EuroMPI/Asia 2015 Workshop, Bordeaux, France, 09/2015.
"Batched matrix computations on hardware accelerators based on GPUs", International Journal of High Performance Computing Applications, 02/2015.
"Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing", International Journal of Networking and Computing, vol. 5, no. 1, pp. 2-15, 01/2015.
"Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations", Scientific Programming, 2015.
" A Data Flow Divide and Conquer Algorithm for Multicore Architecture", 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, 05/2015.
"On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors", ISC High Performance 2015, Frankfurt, Germany, 07/2015.
"Design for a Soft Error Resilient Dynamic Task-based Runtime", 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, 05/2015.
"Energy efficiency and performance frontiers for sparse computations on GPU supercomputers", Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15), San Francisco, CA, ACM, 02/2015.
"Experiences in autotuning matrix multiplication for energy minimization on GPUs", Concurrency in Computation: Practice and Experience, 05/2015.
"Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations", ISC High Performance, Frankfurt, Germany, Springer, 07/2015.
"Hierarchical DAG scheduling for Hybrid Distributed Systems", 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, 05/2015.
"HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi", Scientific Programming, vol. 23, issue 1, 01/2015.
"Mixed-Precision Cholesky QR Factorization and its Case Studies on Multicore CPU with Multiple GPUs", SIAM Journal on Scientific Computing, vol. 37, no. 3, pp. C203-C330, 05/2015.
"Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures", The Spring Simulation Multi-Conference 2015 (SpringSim'15), Best Paper Award, Alexandria, VA, 04/2015.
"Performance Analysis and Optimization of Two-Sided Factorization Algorithms for Heterogeneous Platform", International Conference on Computational Science (ICCS 2015), Reykjavík, Iceland, 06/2015.
"Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems: Formal Proof", Innovative Computing Laboratory Technical Report, no. ICL-UT-15-01, 04/2015.
"Scheduling for fault-tolerance: an introduction", Innovative Computing Laboratory Technical Report, no. ICL-UT-15-02: University of Tennessee, 01/2015.
"Towards Batched Linear Solvers on Accelerated Hardware Platforms", 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015, San Francisco, CA, ACM, 02/2015.
"Accelerating Eigenvector Computation in the Nonsymmetric Eigenvalue Problem", VECPAR 2014, Eugene, OR, 06/2014.
"Accelerating Numerical Dense Linear Algebra Calculations with GPUs", Numerical Computations with GPUs: Springer International Publishing, pp. 3-28, 2014.
"Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product", University of Tennessee Computer Science Technical Report, no. UT-EECS-14-731: University of Tennessee, 10/2014.
"Access-averse Framework for Computing Low-rank Matrix Approximations", First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining, Washington, DC, 10/2014.
"Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting", Concurrency and Computation: Practice and Experience, vol. 26, issue 7, pp. 1408-1431, 05/2014.
"Analyzing PAPI Performance on Virtual Machines", VMWare Technical Journal, vol. Winter 2013, 01/2014.
"Assembly Operations for Multicore Architectures using Task-Based Runtime Systems", Euro-Par 2014, Porto, Portugal, Springer International Publishing, 08/2014.
"Assessing the Impact of ABFT and Checkpoint Composite Strategies", 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"clMAGMA: High Performance Dense Linear Algebra with OpenCL ", International Workshop on OpenCL, Bristol University, England, 05/2014.
"Communication-Avoiding Symmetric-Indefinite Factorization", SIAM Journal on Matrix Analysis and Application, vol. 35, issue 4, pp. 1364-1406, 07/2014.
"Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems", International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS), Waterloo, Ontario, CA, 08/2014.
"Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES", 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, New Orleans, LA, 11/2014.
"Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime", Workshop on Large-Scale Parallel Processing, IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Design for a Soft Error Resilient Dynamic Task-based Runtime", ICL Technical Report, no. ICL-UT-14-04: University of Tennessee, 11/2014.
"Designing LU-QR Hybrid Solvers for Performance and Stability", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster", The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), New Orleans, LA, IEEE, 11/2014.
"Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs", Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, 05/2014.
"Efficient checkpoint/verification patterns for silent error detection", Innovative Computing Laboratory Technical Report, no. ICL-UT-14-03: University of Tennessee, 05/2014.
"An Efficient Distributed Randomized Algorithm for Solving Large Dense Symmetric Indefinite Linear Systems", Parallel Computing, vol. 40, issue 7, pp. 213-223, 07/2014.
"A Fast Batched Cholesky Factorization on a GPU", International Conference on Parallel Processing (ICPP-2014), Minneapolis, MN, 09/2014.
"Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments", VECPAR 2014, Eugene, OR, 06/2014.
"Hybrid Multi-Elimination ILU Preconditioners on GPUs", International Heterogeneity in Computing Workshop (HCW), IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs", University of Tennessee Computer Science Technical Report, no. UT-EECS-14-727: University of Tennessee, 04/2014.
"Improving the Energy Efficiency of Sparse Linear System Solvers on Multicore and Manycore Systems", Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences, vol. 372, issue 2018, 07/2014.
"Improving the performance of CA-GMRES on multicores with multiple GPUs", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Looking Back at Dense Linear Algebra Software", Journal of Parallel and Distributed Computing, vol. 74, issue 7, pp. 2548–2560, 07/2014.
"LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU", 16th IEEE International Conference on High Performance Computing and Communications (HPCC), Paris, France, IEEE, 08/2014.
"MIAMI: A Framework for Application Performance Diagnosis ", IPASS-2014, Monterey, CA, IEEE, 03/2014.
"Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs", VECPAR 2014 (Best Paper), Eugene, OR, 06/2014.