"HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi", Scientific Programming, vol. 23, issue 1, 01/2015.
"Accelerating Eigenvector Computation in the Nonsymmetric Eigenvalue Problem", VECPAR 2014, Eugene, OR, 06/2014.
"Accelerating Numerical Dense Linear Algebra Calculations with GPUs", Numerical Computations with GPUs: Springer International Publishing, pp. 3-28, 2014.
"Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product", University of Tennessee Computer Science Technical Report, no. UT-EECS-14-731: University of Tennessee, 10/2014.
"Access-averse Framework for Computing Low-rank Matrix Approximations", First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining, Washington, DC, 10/2014.
"Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting", Concurrency and Computation: Practice and Experience, vol. 26, issue 7, pp. 1408-1431, 05/2014.
"Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy", ACM Transactions on Parallel Computing (to appear), 2014.
"Analyzing PAPI Performance on Virtual Machines", VMWare Technical Journal, vol. Winter 2013, 01/2014.
"Assessing the Impact of ABFT and Checkpoint Composite Strategies", 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"clMAGMA: High Performance Dense Linear Algebra with OpenCL ", International Workshop on OpenCL, Bristol University, England, 05/2014.
"Communication-Avoiding Symmetric-Indefinite Factorization", SIAM Journal on Matrix Analysis and Application, vol. 35, issue 4, pp. 1364-1406, 07/2014.
"Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems", International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS), Waterloo, Ontario, CA, 08/2014.
"Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES", 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, New Orleans, LA, 11/2014.
"Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime", Workshop on Large-Scale Parallel Processing, IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Design for a Soft Error Resilient Dynamic Task-based Runtime", ICL Technical Report, no. ICL-UT-14-04: University of Tennessee, 11/2014.
"Designing LU-QR Hybrid Solvers for Performance and Stability", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster", The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), New Orleans, LA, IEEE, 11/2014.
"Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs", Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, 05/2014.
"Efficient checkpoint/verification patterns for silent error detection", Innovative Computing Laboratory Technical Report, no. ICL-UT-14-03: University of Tennessee, 05/2014.
"An Efficient Distributed Randomized Algorithm for Solving Large Dense Symmetric Indefinite Linear Systems", Parallel Computing, vol. 40, issue 7, pp. 213-223, 07/2014.
"A Fast Batched Cholesky Factorization on a GPU", International Conference on Parallel Processing (ICPP-2014), Minneapolis, MN, 09/2014.
"Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments", VECPAR 2014, Eugene, OR, 06/2014.
"Hybrid Multi-Elimination ILU Preconditioners on GPUs", International Heterogeneity in Computing Workshop (HCW), IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs", University of Tennessee Computer Science Technical Report, no. UT-EECS-14-727: University of Tennessee, 04/2014.
"Improving the Energy Efficiency of Sparse Linear System Solvers on Multicore and Manycore Systems", Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences, vol. 372, issue 2018, 07/2014.
"Improving the performance of CA-GMRES on multicores with multiple GPUs", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Looking Back at Dense Linear Algebra Software", Journal of Parallel and Distributed Computing, vol. 74, issue 7, pp. 2548–2560, 07/2014.
"MIAMI: A Framework for Application Performance Diagnosis ", IPASS-2014, Monterey, CA, IEEE, 03/2014.
"Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs", VECPAR 2014 (Best Paper), Eugene, OR, 06/2014.
"Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems", Supercomputing Frontiers and Innovations, vol. 1, issue 1, 2014.
"New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem", Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper), Phoenix, AZ, IEEE, 05/2014.
"A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks", International Journal of High Performance Computing Applications, vol. 28, issue 2, pp. 196-209, 05/2014.
"Optimizing Krylov Subspace Solvers on Graphics Processing Units", Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI", ICL Technical Report, no. ICL-UT-14-01: University of Tennessee, 02/2014.
"Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors", 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14), New Orleans, LA, IEEE, 11/2014.
"Performance and Reliability Trade-offs for the Double Checkpointing Algorithm", International Journal of Networking and Computing, vol. 4, no. 1, pp. 32-41, 2014.
"Performance of Various Computers Using Standard Linear Equations Software, (Linpack Benchmark Report)", University of Tennessee Computer Science Technical Report, no. CS-89-85: University of Tennessee, 06/2014.
"Power Monitoring with PAPI for Extreme Scale Architectures and Dataflow-based Programming Models", Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications, IEEE Cluster 2014, no. ICL-UT-14-04, Madrid, Spain, IEEE, 09/2014.
"PULSAR Users’ Guide, Parallel Ultra-Light Systolic Array Runtime", University of Tennessee EECS Technical Report, no. UT-EECS-14-733: University of Tennessee, 11/2014.
"Scaling Up Matrix Computations on Shared-Memory Manycore Systems with 1000 CPU Cores", International conference on Supercomputing, Munich, Germany, ACM, pp. 333-342, 06/2014.
"Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures", VECPAR 2014, Eugene, OR, 06/2014.
"A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination", Concurrency and Computation: Practice and Experience, 06/2014.
"Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes", 23rd International Heterogeneity in Computing Workshop, IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment", IPDPS 2014, Phoenix, AZ, IEEE, 05/2014.
"Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors", Concurrency and Computation: Practice and Experience, 09/2014.
"Utilizing Dataflow-based Execution for Coupled Cluster Methods", IEEE Cluster 2014, no. ICL-UT-14-02, Madrid, Spain, IEEE, 09/2014.
"Analyzing PAPI Performance on Virtual Machines", ICL Technical Report, no. ICL-UT-13-02, 08/2013.
"Assessing the impact of ABFT and Checkpoint composite strategies", University of Tennessee Computer Science Technical Report, no. ICL-UT-13-03, 2013.
"BlackjackBench: Portable Hardware Characterization with Automated Results Analysis", The Computer Journal, 03/2013.