"Batched Matrix Computations on Hardware Accelerators", EuroMPI/Asia 2015 Workshop, Bordeaux, France, 09/2015.
"Batched matrix computations on hardware accelerators based on GPUs", International Journal of High Performance Computing Applications, 02/2015.
"Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing", International Journal of Networking and Computing, vol. 5, no. 1, pp. 2-15, 01/2015.
"Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations", Scientific Programming, 2015.
" A Data Flow Divide and Conquer Algorithm for Multicore Architecture", 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, 05/2015.
"On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors", ISC High Performance 2015, Frankfurt, Germany, 07/2015.
"Design for a Soft Error Resilient Dynamic Task-based Runtime", 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, 05/2015.
"Energy efficiency and performance frontiers for sparse computations on GPU supercomputers", Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15), San Francisco, CA, ACM, 02/2015.
"Hierarchical DAG scheduling for Hybrid Distributed Systems", 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, 05/2015.
"HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi", Scientific Programming, vol. 23, issue 1, 01/2015.
"Mixed-Precision Cholesky QR Factorization and its Case Studies on Multicore CPU with Multiple GPUs", SIAM Journal on Scientific Computing, 2015.
"Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures", The Spring Simulation Multi-Conference 2015 (SpringSim'15), Alexandria, VA, 04/2015.
"Performance Analysis and Optimization of Two-Sided Factorization Algorithms for Heterogeneous Platform", International Conference on Computational Science (ICCS 2015), Reykjavík, Iceland, 06/2015.
"Towards Batched Linear Solvers on Accelerated Hardware Platforms", 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015, San Francisco, CA, ACM, 02/2015.
"Accelerating Eigenvector Computation in the Nonsymmetric Eigenvalue Problem", VECPAR 2014, Eugene, OR, 06/2014.
"Accelerating Numerical Dense Linear Algebra Calculations with GPUs", Numerical Computations with GPUs: Springer International Publishing, pp. 3-28, 2014.
"Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product", University of Tennessee Computer Science Technical Report, no. UT-EECS-14-731: University of Tennessee, 10/2014.
"Access-averse Framework for Computing Low-rank Matrix Approximations", First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining, Washington, DC, 10/2014.
"Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting", Concurrency and Computation: Practice and Experience, vol. 26, issue 7, pp. 1408-1431, 05/2014.
"Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy", ACM Transactions on Parallel Computing (to appear), 2014.
"Analyzing PAPI Performance on Virtual Machines", VMWare Technical Journal, vol. Winter 2013, 01/2014.
"Assembly Operations for Multicore Architectures using Task-Based Runtime Systems", Euro-Par 2014, Porto, Portugal, Springer International Publishing, 08/2014.
"Assessing the Impact of ABFT and Checkpoint Composite Strategies", 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"clMAGMA: High Performance Dense Linear Algebra with OpenCL ", International Workshop on OpenCL, Bristol University, England, 05/2014.
"Communication-Avoiding Symmetric-Indefinite Factorization", SIAM Journal on Matrix Analysis and Application, vol. 35, issue 4, pp. 1364-1406, 07/2014.
"Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems", International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS), Waterloo, Ontario, CA, 08/2014.
"Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES", 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, New Orleans, LA, 11/2014.
"Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime", Workshop on Large-Scale Parallel Processing, IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Design for a Soft Error Resilient Dynamic Task-based Runtime", ICL Technical Report, no. ICL-UT-14-04: University of Tennessee, 11/2014.
"Designing LU-QR Hybrid Solvers for Performance and Stability", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster", The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), New Orleans, LA, IEEE, 11/2014.
"Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs", Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, 05/2014.
"Efficient checkpoint/verification patterns for silent error detection", Innovative Computing Laboratory Technical Report, no. ICL-UT-14-03: University of Tennessee, 05/2014.
"An Efficient Distributed Randomized Algorithm for Solving Large Dense Symmetric Indefinite Linear Systems", Parallel Computing, vol. 40, issue 7, pp. 213-223, 07/2014.
"A Fast Batched Cholesky Factorization on a GPU", International Conference on Parallel Processing (ICPP-2014), Minneapolis, MN, 09/2014.
"Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments", VECPAR 2014, Eugene, OR, 06/2014.
"Hybrid Multi-Elimination ILU Preconditioners on GPUs", International Heterogeneity in Computing Workshop (HCW), IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs", University of Tennessee Computer Science Technical Report, no. UT-EECS-14-727: University of Tennessee, 04/2014.
"Improving the Energy Efficiency of Sparse Linear System Solvers on Multicore and Manycore Systems", Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences, vol. 372, issue 2018, 07/2014.
"Improving the performance of CA-GMRES on multicores with multiple GPUs", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Looking Back at Dense Linear Algebra Software", Journal of Parallel and Distributed Computing, vol. 74, issue 7, pp. 2548–2560, 07/2014.
"LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU", 16th IEEE International Conference on High Performance Computing and Communications (HPCC), Paris, France, IEEE, 08/2014.
"MIAMI: A Framework for Application Performance Diagnosis ", IPASS-2014, Monterey, CA, IEEE, 03/2014.
"Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs", VECPAR 2014 (Best Paper), Eugene, OR, 06/2014.
"Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems", Supercomputing Frontiers and Innovations, vol. 1, issue 1, 2014.
"New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem", Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper), Phoenix, AZ, IEEE, 05/2014.
"A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks", International Journal of High Performance Computing Applications, vol. 28, issue 2, pp. 196-209, 05/2014.
"Optimizing Krylov Subspace Solvers on Graphics Processing Units", Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI", ICL Technical Report, no. ICL-UT-14-01: University of Tennessee, 02/2014.
"Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors", 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14), New Orleans, LA, IEEE, 11/2014.