"Batched Matrix Computations on Hardware Accelerators", EuroMPI/Asia 2015 Workshop, Bordeaux, France, 09/2015.
"Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing", International Journal of Networking and Computing, vol. 5, no. 1, pp. 2-15, 01/2015.
"Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations", Scientific Programming, 2015.
" A Data Flow Divide and Conquer Algorithm for Multicore Architecture", 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, 05/2015.
"Design for a Soft Error Resilient Dynamic Task-based Runtime", 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, 05/2015.
"Hierarchical DAG scheduling for Hybrid Distributed Systems", 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, 05/2015.
"HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi", Scientific Programming, vol. 23, issue 1, 01/2015.
"Towards Batched Linear Solvers on Accelerated Hardware Platforms", 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015, San Francisco, CA, ACM, 02/2015.
"Accelerating Eigenvector Computation in the Nonsymmetric Eigenvalue Problem", VECPAR 2014, Eugene, OR, 06/2014.
"Accelerating Numerical Dense Linear Algebra Calculations with GPUs", Numerical Computations with GPUs: Springer International Publishing, pp. 3-28, 2014.
"Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product", University of Tennessee Computer Science Technical Report, no. UT-EECS-14-731: University of Tennessee, 10/2014.
"Access-averse Framework for Computing Low-rank Matrix Approximations", First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining, Washington, DC, 10/2014.
"Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting", Concurrency and Computation: Practice and Experience, vol. 26, issue 7, pp. 1408-1431, 05/2014.
"Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy", ACM Transactions on Parallel Computing (to appear), 2014.
"Analyzing PAPI Performance on Virtual Machines", VMWare Technical Journal, vol. Winter 2013, 01/2014.
"Assembly Operations for Multicore Architectures using Task-Based Runtime Systems", Euro-Par 2014, Porto, Portugal, Springer International Publishing, 08/2014.
"Assessing the Impact of ABFT and Checkpoint Composite Strategies", 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"clMAGMA: High Performance Dense Linear Algebra with OpenCL ", International Workshop on OpenCL, Bristol University, England, 05/2014.
"Communication-Avoiding Symmetric-Indefinite Factorization", SIAM Journal on Matrix Analysis and Application, vol. 35, issue 4, pp. 1364-1406, 07/2014.
"Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems", International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS), Waterloo, Ontario, CA, 08/2014.
"Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES", 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, New Orleans, LA, 11/2014.
"Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime", Workshop on Large-Scale Parallel Processing, IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Design for a Soft Error Resilient Dynamic Task-based Runtime", ICL Technical Report, no. ICL-UT-14-04: University of Tennessee, 11/2014.
"Designing LU-QR Hybrid Solvers for Performance and Stability", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster", The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), New Orleans, LA, IEEE, 11/2014.
"Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs", Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, 05/2014.
"Efficient checkpoint/verification patterns for silent error detection", Innovative Computing Laboratory Technical Report, no. ICL-UT-14-03: University of Tennessee, 05/2014.
"An Efficient Distributed Randomized Algorithm for Solving Large Dense Symmetric Indefinite Linear Systems", Parallel Computing, vol. 40, issue 7, pp. 213-223, 07/2014.
"A Fast Batched Cholesky Factorization on a GPU", International Conference on Parallel Processing (ICPP-2014), Minneapolis, MN, 09/2014.
"Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments", VECPAR 2014, Eugene, OR, 06/2014.
"Hybrid Multi-Elimination ILU Preconditioners on GPUs", International Heterogeneity in Computing Workshop (HCW), IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs", University of Tennessee Computer Science Technical Report, no. UT-EECS-14-727: University of Tennessee, 04/2014.
"Improving the Energy Efficiency of Sparse Linear System Solvers on Multicore and Manycore Systems", Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences, vol. 372, issue 2018, 07/2014.
"Improving the performance of CA-GMRES on multicores with multiple GPUs", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Looking Back at Dense Linear Algebra Software", Journal of Parallel and Distributed Computing, vol. 74, issue 7, pp. 2548–2560, 07/2014.
"MIAMI: A Framework for Application Performance Diagnosis ", IPASS-2014, Monterey, CA, IEEE, 03/2014.
"Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs", VECPAR 2014 (Best Paper), Eugene, OR, 06/2014.
"Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems", Supercomputing Frontiers and Innovations, vol. 1, issue 1, 2014.
"New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem", Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper), Phoenix, AZ, IEEE, 05/2014.
"A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks", International Journal of High Performance Computing Applications, vol. 28, issue 2, pp. 196-209, 05/2014.
"Optimizing Krylov Subspace Solvers on Graphics Processing Units", Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI", ICL Technical Report, no. ICL-UT-14-01: University of Tennessee, 02/2014.
"Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors", 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14), New Orleans, LA, IEEE, 11/2014.
"Performance and Reliability Trade-offs for the Double Checkpointing Algorithm", International Journal of Networking and Computing, vol. 4, no. 1, pp. 32-41, 2014.
"Performance of Various Computers Using Standard Linear Equations Software, (Linpack Benchmark Report)", University of Tennessee Computer Science Technical Report, no. CS-89-85: University of Tennessee, 06/2014.
"Power Monitoring with PAPI for Extreme Scale Architectures and Dataflow-based Programming Models", Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications, IEEE Cluster 2014, no. ICL-UT-14-04, Madrid, Spain, IEEE, 09/2014.
"PULSAR Users’ Guide, Parallel Ultra-Light Systolic Array Runtime", University of Tennessee EECS Technical Report, no. UT-EECS-14-733: University of Tennessee, 11/2014.
"Scaling Up Matrix Computations on Shared-Memory Manycore Systems with 1000 CPU Cores", International conference on Supercomputing, Munich, Germany, ACM, pp. 333-342, 06/2014.
"Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures", VECPAR 2014, Eugene, OR, 06/2014.
"A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.