An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs,” Applied Parallel and Scientific Computing, vol. 7133, pp. 248-257, 00 2012.“
Looking Back at Dense Linear Algebra Software,” Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear), 00 2012.“
MAGMA: A Breakthrough in Solvers for Eigenvalue Problems , San Jose, CA, GPU Technology Conference (GTC12), Presentation, May 2012.
MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation, November 2012.
MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), November 2012.
MAGMA Tutorial , Atlanta, GA, Keeneland Workshop, February 2012.
Matrices Over Runtime Systems at Exascale,” Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.“
Measuring Energy and Power with PAPI,” International Workshop on Power-Aware Systems and Architectures, Pittsburgh, PA, September 2012. DOI: 10.1109/ICPPW.2012.39“
A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks,” Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.“
One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators,” The International Conference on Computational Science (ICCS), June 2012.“
Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators,” VECPAR 2012, Kobe, Japan, July 2012.“
PAPI-V: Performance Monitoring for Virtual Machines,” CloudTech-HPC 2012, Pittsburgh, PA, September 2012. DOI: 10.1109/ICPPW.2012.29“
“Parallel Processing and Applied Mathematics, 9th International Conference, PPAM 2011,” Lecture Notes in Computer Science, vol. 7203, Torun, Poland, 00 2012.
A Parallel Tiled Solver for Symmetric Indefinite Systems On Multicore Architectures,” IPDPS 2012, Shanghai, China, May 2012.“
Performance Counter Monitoring for the Blue Gene/Q Architecture,” University of Tennessee Computer Science Technical Report, no. ICL-UT-12-01, 00 2012.“
Performance evaluation of LU factorization through hardware counter measurements,” University of Tennessee Computer Science Technical Report, no. ut-cs-12-700, October 2012.“
Power Aware Computing on GPUs,” SAAHPC '12 (Best Paper Award), Argonne, IL, July 2012.“
Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems,” Third International Conference on Energy-Aware High Performance Computing, Hamburg, Germany, September 2012.“
Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture,” LAWN 267, 00 2012.“
Programming the LU Factorization for a Multicore System with Accelerators,” Proceedings of VECPAR’12, Kobe, Japan, April 2012.“
A Proposal for User-Level Failure Mitigation in the MPI-3 Standard,” University of Tennessee Electrical Engineering and Computer Science Technical Report, no. ut-cs-12-693: University of Tennessee, February 2012.“
Providing GPU Capability to LU and QR within the ScaLAPACK Framework,” University of Tennessee Computer Science Technical Report (also LAWN 272), no. UT-CS-12-699, September 2012.“
“Recent Advances in the Message Passing Interface: 19th European MPI Users' Group Meeting, EuroMPI 2012,” Lecture Notes in Computer Science, vol. 7490, Vienna, Austria, 00 2012.
Reducing the Amount of Pivoting in Symmetric Indefinite Systems,” Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011), vol. 7203: Springer-Verlag Berlin Heidelberg, pp. 133-142, 00 2012.“
A Scalable Framework for Heterogeneous GPU-Based Clusters,” The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012), Pittsburgh, PA, USA, ACM, June 2012.“
Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices,” SIAM Journal on Scientific Computing (Accepted), July 2012.“
Unified Model for Assessing Checkpointing Protocols at Extreme-Scale,” University of Tennessee Computer Science Technical Report (also LAWN 269), no. UT-CS-12-697, June 2012.“
User Level Failure Mitigation in MPI,” Euro-Par 2012: Parallel Processing Workshops, vol. 7640, Rhodes Island, Greece, Springer Berlin Heidelberg, pp. 499-504, August 2012.“
Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems,” Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper), Rhodes Island, Greece, August 2012.“
Weighted Block-Asynchronous Relaxation for GPU-Accelerated Systems,” SIAM Journal on Computing (submitted), March 2012.“
3-D parallel frequency-domain visco-acoustic wave modelling based on a hybrid direct/iterative solver,” 73rd EAGE Conference & Exhibition incorporating SPE EUROPEC 2011, Vienna, Austria, 23-26 May, 00 2011.“
Accelerating Linear System Solutions Using Randomization Techniques,” INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11), Waterloo, Ontario, Canada, July 2011.“
Achieving Numerical Accuracy and High Performance using Recursive Tile LU Factorization,” University of Tennessee Computer Science Technical Report (also as a LAWN), no. ICL-UT-11-08, September 2011.“
Algebraic Schwarz Preconditioning for the Schur Complement: Application to the Time-Harmonic Maxwell Equations Discretized by a Discontinuous Galerkin Method.,” The Twentieth International Conference on Domain Decomposition Methods, La Jolla, California, February 2011.“
Algorithm-based Fault Tolerance for Dense Matrix Factorizations,” University of Tennessee Computer Science Technical Report, no. UT-CS-11-676, Knoxville, TN, August 2011.“
Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures,” University of Tennessee Computer Science Technical Report, UT-CS-11-666, (also Lawn 243), 00 2011.“
Autotuned Parallel I/O for Highly Scalable Biosequence Analysis,” TeraGrid'11, Salt Lake City, Utah, July 2011.“
Autotuning GEMMs for Fermi,” University of Tennessee Computer Science Technical Report, UT-CS-11-671, (also Lawn 245), April 2011.“
BlackjackBench: Hardware Characterization with Portable Micro-Benchmarks and Automatic Statistical Analysis of Results,” IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.“
Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems , no. UT-CS-11-689, December 2011.
A Block-Asynchronous Relaxation Method for Graphics Processing Units,” University of Tennessee Computer Science Technical Report, no. UT-CS-11-687 / LAWN 258, November 2011.“
Changes in Dense Linear Algebra Kernels - Decades Long Perspective,” in Solving the Schrodinger Equation: Has everything been tried? (to appear): Imperial College Press, 00 2011.“
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures,” Symposium for Application Accelerators in High Performance Computing (SAAHPC'11), Knoxville, TN, July 2011.“
Correlated Set Coordination in Fault Tolerant Message Logging Protocols,” Proceedings of 17th International Conference, Euro-Par 2011, Part II, vol. 6853, Bordeaux, France, Springer, pp. 51-64, August 2011.“
DAGuE: A Generic Distributed DAG Engine for High Performance Computing,” Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), Anchorage, Alaska, USA, IEEE, pp. 1151-1158, 00 2011.“
The Design of an Auto-tuning I/O Framework on Cray XT5 System,” Cray Users Group Conference (CUG'11) (Best Paper Finalist), Fairbanks, Alaska, May 2011.“
Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures,” University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250), June 2011.“
Energy and performance characteristics of different parallel implementations of scientific applications on multicore systems,” International Journal of High Performance Computing Applications, vol. 25, no. 3, pp. 342-350, 00 2011.“
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments,” 6th Workshop on Virtualization in High-Performance Cloud Computing, Bordeaux, France, August 2011.“
Exploiting Fine-Grain Parallelism in Recursive LU Factorization,” Proceedings of PARCO'11, no. ICL-UT-11-04, Gent, Belgium, April 2011.“