%0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18) %D 2018 %T Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %A Nicholas J. Higham %X Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax = b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16-FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic. %B The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18) %I IEEE %C Dallas, TX %8 2018-11 %G eng %R https://doi.org/10.1109/SC.2018.00050