A Tribute to Gene Golub,” Computing in Science and Engineering: IEEE, pp. 5, January 2008.“
Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems,” Concurrency and Computation: Practice and Experience, October 2013.“
Truss Structural Optimization Using NetSolve System,” Meeting of the Japan Society of Mechanical Engineers, Kyoto University, Kyoto, Japan, October 2002.“
Tuning Principal Component Analysis for GRASS GIS on Multi-core and GPU Architectures,” FOSS4G 2010, Barcelona, Spain, September 2010.“
Unified Model for Assessing Checkpointing Protocols at Extreme-Scale,” Concurrency and Computation: Practice and Experience, November 2013. DOI: 10.1002/cpe.3173“
Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors,” Concurrency and Computation: Practice and Experience, vol. 27, issue 4, pp. 885-904, September 2014. DOI: 10.1002/cpe.3341“
An Updated Set of Basic Linear Algebra Subprograms (BLAS),” ACM Transactions on Mathematical Software, vol. 28, no. 2, pp. 135-151, December 2002. DOI: 10.1145/567806.567807“
Updating Incomplete Factorization Preconditioners for Model Order Reduction,” Numerical Algorithms, vol. 73, issue 3, no. 3, pp. 611–630, February 2016. DOI: 10.1007/s11075-016-0110-2“
The use of bulk states to accelerate the band edge state calculation of a semiconductor quantum dot,” Journal of Computational Physics (submitted), January 2006.“
The Use of Bulk States to Accelerate the Band Edge State Calculation of a Semiconductor Quantum Dot,” Journal of Computational Physics, vol. 223, pp. 774-782, 00 2007.“
User-Defined Events for Hardware Performance Monitoring,” Procedia Computer Science, vol. 4: Elsevier, pp. 2096-2104, May 2011. DOI: 10.1016/j.procs.2011.04.229“
Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning,” Journal of Parallel and Distributed Computing, vol. 119, pp. 219–230, November 2018. DOI: 10.1016/j.jpdc.2018.04.017“
Using MAGMA with PGI Fortran,” PGI Insider, November 2010.“
Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy,” ACM Transactions on Mathematical Software, vol. 34, no. 4, pp. 17-22, 00 2008.“
Using multiple levels of parallelism to enhance the performance of domain decomposition solvers,” Parallel Computing, vol. 36, no. 5-6: Elsevier journals, pp. 285-296, 00 2010.“
Variable-Size Batched Gauss–Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors,” Parallel Computing, January 2018. DOI: 10.1016/j.parco.2017.12.006“
The Virtual Instrument: Support for Grid-enabled Scientific Simulations,” International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 3-17, January 2004.“
The Virtual Instrument: Support for Grid-enabled Scientific Simulations,” Journal of Parallel and Distributed Computing (submitted), October 2002.“
VisPerf: Monitoring Tool for Grid Computing,” Lecture Notes in Computer Science, vol. 2659: Springer Verlag, Heidelberg, pp. 233-243, 00 2003.“
Weighted Block-Asynchronous Relaxation for GPU-Accelerated Systems,” SIAM Journal on Computing (submitted), March 2012.“
Exascale Computing and Big Data,” Communications of the ACM, vol. 58, no. 7: ACM, pp. 56-68, July 2015. DOI: 10.1145/2699414“
The HPL Benchmark: Past, Present & Future , ISC High Performance, Frankfurt, Germany, July 2016.
Accelerating 2D FFT: Exploit GPU Tensor Cores through Mixed-Precision , Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), ACM Student Research Poster, November 2018.
Accelerating Tensor Contractions for High-Order FEM on CPUs, GPUs, and KNLs , Gatlinburg, TN, moky Mountains Computational Sciences and Engineering Conference (SMC16), Poster, September 2016.
Acceleration of the BLAST Hydro Code on GPU,” Supercomputing '12 (poster), Salt Lake City, Utah, SC12, November 2012.“
Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes , San Jose, CA, GPU Technology Conference (GTC16), Poster, April 2016.
Enhancing the Performance of Dense Linear Algebra Solvers on GPUs (in the MAGMA Project) , Austin, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC08), November 2008.
FFT-ECP Fast Fourier Transform , Houston, TX, 2019 ECP Annual Meeting (Research Poster), January 2019.
Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100 , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines , Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster, November 2018.
MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR) , Washington, DC, NSF PI Meeting, Poster, April 2018. DOI: 10.6084/m9.figshare.6174143.v3
Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects , Portland, OR, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project , Portland, Oregon, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
Optimizing Batch HGEMM on Small Sizes Using Tensor Cores , San Jose, CA, GPU Technology Conference (GTC), March 2019.
PAPI 5: Measuring Power, Energy, and the Cloud , Austin, TX, 2013 IEEE International Symposium on Performance Analysis of Systems and Software, April 2013.
Power-aware Computing on GPGPUs , Gatlinburg, TN, Fall Creek Falls Conference, Poster, September 2011.
Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators , Knoxville, TN, 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster, July 2010.
A Standard for Batched BLAS Routines , Paris, France, 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16), April 2016.
Tensor Contractions using Optimized Batch GEMM Routines , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
Towards a High-Performance Tensor Algebra Package for Accelerators , Gatlinburg, TN, moky Mountains Computational Sciences and Engineering Conference (SMC15), September 2015.
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption , Frankfurt, Germany, ISC High Performance (ISC18), Best Poster Award, June 2018.
Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers : 2010 Symposium on Application Accelerators in. High-Performance Computing (SAAHPC'10), Tutorial, July 2010.
Accelerating Linear Algebra with MAGMA , Knoxville, TN, ECP Annual Meeting 2018, Tutorial, February 2018.
Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched , Atlanta, GA, SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation, March 2017.
Autotuning Dense Linear Algebra Libraries on GPUs , Basel, Switzerland, Sixth International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010), June 2010.
Comparing performance of s-step and pipelined GMRES on distributed-memory multicore CPUs , Pittsburgh, Pennsylvania, SIAM Annual Meeting, July 2017.
Dense Linear Algebra Solvers for Multicore with GPU Accelerators , Atlanta, GA, International Parallel and Distributed Processing Symposium (IPDPS 2010), April 2010.
On the Design, Autotuning, and Optimization of GPU Kernels for Kinetic Network Simulations Using Fast Explicit Integration and GPU Batched Computation , Oak Ridge, TN, Joint Institute for Computational Sciences Seminar Series, Presentation, September 2015.
Does your tool support PAPI SDEs yet? , Tahoe City, CA, 13th Scalable Tools Workshop, July 2019.
Flexible Batched Sparse Matrix Vector Product on GPUs , Denver, Colorado, ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, November 2017.