User-Defined Events for Hardware Performance Monitoring,” Procedia Computer Science, vol. 4: Elsevier, pp. 2096-2104, May 2011. DOI: 10.1016/j.procs.2011.04.229“
Acceleration of the BLAST Hydro Code on GPU,” Supercomputing '12 (poster), Salt Lake City, Utah, SC12, November 2012.“
Algorithm-Based Fault Tolerance for Dense Matrix Factorization,” Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, ACM, pp. 225-234, February 2012. DOI: 10.1145/2145816.2145845“
On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties,” University of Tennessee Computer Science Technical Report, no. UT-CS-13-715, July 2013, 2012.“
Anatomy of a Globally Recursive Embedded LINPACK Benchmark,” 2012 IEEE High Performance Extreme Computing Conference, Waltham, MA, pp. 1-6, September 2012. DOI: 10.1109/HPEC.2012.6408679“
Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems,” ICCS 2012, Omaha, NE, June 2012.“
A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI,” 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award), Rhodes, Greece, Springer-Verlag, August 2012.“
A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines,” Proc. of the International Conference on Computational Science (ICCS), vol. 9, pp. 17-26, June 2012.“
A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction,” IPDPS 2012, Shanghai, China, May 2012.“
DAGuE: A generic distributed DAG Engine for High Performance Computing.,” Parallel Computing, vol. 38, no. 1-2: Elsevier, pp. 27-51, 00 2012.“
Dense Linear Algebra on Accelerated Multicore Hardware,” High Performance Scientific Computing: Algorithms and Applications, London, UK, Springer-Verlag, 00 2012.“
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems,” SIAM Journal on Scientific Computing, vol. 34(2), pp. C70-C82, April 2012.“
An efficient distributed randomized solver with application to large dense linear systems,” ICL Technical Report, no. ICL-UT-12-02, July 2012.“
Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems,” 26th ACM International Conference on Supercomputing (ICS 2012), San Servolo Island, Venice, Italy, ACM, June 2012.“
Enabling Application Resilience With and Without the MPI Standard,” 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Ottawa, Canada, May 2012.“
Energy Footprint of Advanced Dense Numerical Linear Algebra using Tile Algorithms on Multicore Architecture,” The 2nd International Conference on Cloud and Green Computing (submitted), Xiangtan, Hunan, China, November 2012.“
Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures using Tree Reduction,” Lecture Notes in Computer Science, vol. 7203, pp. 661-670, September 2012.“
An Evaluation of User-Level Failure Mitigation Support in MPI,” Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, Springer, September 2012.“
Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI,” University of Tennessee Computer Science Technical Report, no. ut-cs-12-702, 00 2012.“
From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming,” Parallel Computing, vol. 38, no. 8, pp. 391-407, August 2012.“
From Serial Loops to Parallel Execution on Distributed Systems,” PPoPP 2012 (submitted), New Orleans, LA, February 2012.“
The Future of Computing: Software Libraries , Savannah, GA, DOD CREATE Developers' Review, Keynote Presentation, February 2012.
GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement,” EuroPar 2012 (also LAWN 260), Rhodes Island, Greece, August 2012.“
Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems,” IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium, Shanghai, China, IEEE Computer Society Press, May 2012.“
HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters,” IPDPS 2012 (Best Paper), Shanghai, China, May 2012.“
High Performance Computing Systems: Status and Outlook,” Acta Numerica, vol. 21, Cambridge, UK, Cambridge University Press, pp. 379-474, May 2012.“
High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors,” ICCS 2012, Omaha, NE, June 2012.“
How LAPACK library enables Microsoft Visual Studio support with CMake and LAPACKE,” University of Tennessee Computer Science Technical Report (also LAWN 270), no. UT-CS-12-698, July 2012.“
HPC Challenge: Design, History, and Implementation Highlights,” On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear): Chapman & Hall/CRC Press, 00 2012.“
An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs,” Applied Parallel and Scientific Computing, vol. 7133, pp. 248-257, 00 2012.“
Looking Back at Dense Linear Algebra Software,” Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear), 00 2012.“
MAGMA: A Breakthrough in Solvers for Eigenvalue Problems , San Jose, CA, GPU Technology Conference (GTC12), Presentation, May 2012.
MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation, November 2012.
MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), November 2012.
MAGMA Tutorial , Atlanta, GA, Keeneland Workshop, February 2012.
Matrices Over Runtime Systems at Exascale,” Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.“
Measuring Energy and Power with PAPI,” International Workshop on Power-Aware Systems and Architectures, Pittsburgh, PA, September 2012. DOI: 10.1109/ICPPW.2012.39“
A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks,” Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.“
One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators,” The International Conference on Computational Science (ICCS), June 2012.“
Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators,” VECPAR 2012, Kobe, Japan, July 2012.“
PAPI-V: Performance Monitoring for Virtual Machines,” CloudTech-HPC 2012, Pittsburgh, PA, September 2012. DOI: 10.1109/ICPPW.2012.29“
“Parallel Processing and Applied Mathematics, 9th International Conference, PPAM 2011,” Lecture Notes in Computer Science, vol. 7203, Torun, Poland, 00 2012.
A Parallel Tiled Solver for Symmetric Indefinite Systems On Multicore Architectures,” IPDPS 2012, Shanghai, China, May 2012.“
Performance Counter Monitoring for the Blue Gene/Q Architecture,” University of Tennessee Computer Science Technical Report, no. ICL-UT-12-01, 00 2012.“
Performance evaluation of LU factorization through hardware counter measurements,” University of Tennessee Computer Science Technical Report, no. ut-cs-12-700, October 2012.“
Power Aware Computing on GPUs,” SAAHPC '12 (Best Paper Award), Argonne, IL, July 2012.“
Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems,” Third International Conference on Energy-Aware High Performance Computing, Hamburg, Germany, September 2012.“
Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture,” LAWN 267, 00 2012.“
Programming the LU Factorization for a Multicore System with Accelerators,” Proceedings of VECPAR’12, Kobe, Japan, April 2012.“