%0 Generic %D 2020 %T SLATE Tutorial %A Mark Gates %A Jakub Kurzak %A Asim YarKhan %A Ali Charara %A Jamie Finney %A Dalal Sukkari %A Mohammed Al Farhan %A Ichitaro Yamazaki %A Panruo Wu %A Jack Dongarra %I 2020 ECP Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software %D 2019 %T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %A Maksims Abalenkovs %A Negin Bagherpour %A Sven Hammarling %A Jakub Sistek %B ACM Transactions on Mathematical Software %V 45 %8 2019-06 %G eng %N 2 %R https://doi.org/10.1145/3264491 %0 Conference Proceedings %B International Conference on Computational Science (ICCS 2018) %D 2018 %T The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques %A Azzam Haidar %A Ahmad Abdelfattah %A Mawussi Zounon %A Panruo Wu %A Srikara Pranesh %A Stanimire Tomov %A Jack Dongarra %X As parallel computers approach exascale, power efficiency in high-performance computing (HPC) systems is of increasing concern. Exploiting both the hardware features and algorithms is an effective solution to achieve power efficiency, and to address the energy constraints in modern and future HPC systems. In this work, we present a novel design and implementation of an energy-efficient solution for dense linear systems of equations, which are at the heart of large-scale HPC applications. The proposed energy-efficient linear system solvers are based on two main components: (1) iterative refinement techniques, and (2) reduced-precision computing features in modern accelerators and coprocessors. While most of the energy efficiency approaches aim to reduce the consumption with a minimal performance penalty, our method improves both the performance and the energy efficiency. Compared to highly-optimized linear system solvers, our kernels deliver the same accuracy solution up to 2× faster and reduce the energy consumption up to half on Intel Knights Landing (KNL) architectures. By efficiently using the Tensor Cores available in the NVIDIA V100 PCIe GPUs, the speedups can be up to 4× , with more than 80% reduction in the energy consumption. %B International Conference on Computational Science (ICCS 2018) %I Springer %C Wuxi, China %V 10860 %P 586–600 %8 2018-06 %G eng %U https://rdcu.be/bcKSC %R https://doi.org/10.1007/978-3-319-93698-7_45 %0 Generic %D 2018 %T Parallel BLAS Performance Report %A Jakub Kurzak %A Mark Gates %A Asim YarKhan %A Ichitaro Yamazaki %A Panruo Wu %A Piotr Luszczek %A Jamie Finney %A Jack Dongarra %B SLATE Working Notes %I University of Tennessee %8 2018-04 %G eng %1 05 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2018 %T Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures %A Ichitaro Yamazaki %A Jakub Kurzak %A Panruo Wu %A Mawussi Zounon %A Jack Dongarra %K linear algebra %K multithreading %K runtime %K symmetric indefinite matrices %X Recently, the Open Multi-Processing (OpenMP) standard has incorporated task-based programming, where a function call with input and output data is treated as a task. At run time, OpenMP's superscalar scheduler tracks the data dependencies among the tasks and executes the tasks as their dependencies are resolved. On a shared-memory architecture with multiple cores, the independent tasks are executed on different cores in parallel, thereby enabling parallel execution of a seemingly sequential code. With the emergence of many-core architectures, this type of programming paradigm is gaining attention-not only because of its simplicity, but also because it breaks the artificial synchronization points of the program and improves its thread-level parallelization. In this paper, we use these new OpenMP features to develop a portable high-performance implementation of a dense symmetric indefinite linear solver. Obtaining high performance from this kind of solver is a challenge because the symmetric pivoting, which is required to maintain numerical stability, leads to data dependencies that prevent us from using some common performance-improving techniques. To fully utilize a large number of cores through tasking, while conforming to the OpenMP standard, we describe several techniques. Our performance results on current many-core architectures-including Intel's Broadwell, Intel's Knights Landing, IBM's Power8, and Arm's ARMv8-demonstrate the portable and superior performance of our implementation compared with the Linear Algebra PACKage (LAPACK). The resulting solver is now available as a part of the PLASMA software package. %B IEEE Transactions on Parallel and Distributed Systems %V 29 %P 1879–1892 %8 2018-08 %G eng %N 8 %R 10.1109/TPDS.2018.2808964 %0 Generic %D 2017 %T C++ API for Batch BLAS %A Ahmad Abdelfattah %A Konstantin Arturov %A Cris Cecka %A Jack Dongarra %A Chip Freitag %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Panruo Wu %B SLATE Working Notes %I University of Tennessee %8 2017-12 %G eng %1 04 %0 Generic %D 2017 %T The Case for Directive Programming for Accelerator Autotuner Optimization %A Diana Fayad %A Jakub Kurzak %A Piotr Luszczek %A Panruo Wu %A Jack Dongarra %X In this work, we present the use of compiler pragma directives for parallelizing autotuning of specialized compute kernels for hardware accelerators. A set of constructs, that include prallelizing a source code that prune a generated search space with a large number of constraints for an autotunning infrastructure. For a better performance we studied optimization aimed at minimization of the run time.We also studied the behavior of the parallel load balance and the speedup on four different machines: x86, Xeon Phi, ARMv8, and POWER8. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2017-10 %G eng %0 Generic %D 2017 %T Designing SLATE: Software for Linear Algebra Targeting Exascale %A Jakub Kurzak %A Panruo Wu %A Mark Gates %A Ichitaro Yamazaki %A Piotr Luszczek %A Gerald Ragghianti %A Jack Dongarra %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2017-10 %G eng %9 SLATE Working Notes %1 03 %0 Conference Paper %B ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2017 %T Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers %A Azzam Haidar %A Panruo Wu %A Stanimire Tomov %A Jack Dongarra %X The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today’s powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique – we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach. %B ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %I ACM %C Denver, CO %8 11/2017 %G eng %0 Generic %D 2017 %T PLASMA 17 Performance Report %A Maksims Abalenkovs %A Negin Bagherpour %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Samuel Relton %A Jakub Sistek %A David Stevens %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %A Mawussi Zounon %X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2017-06 %G eng %0 Generic %D 2017 %T PLASMA 17.1 Functionality Report %A Maksims Abalenkovs %A Negin Bagherpour %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Samuel Relton %A Jakub Sistek %A David Stevens %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %A Mawussi Zounon %X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2017-06 %G eng %0 Generic %D 2017 %T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale %A Ahmad Abdelfattah %A Hartwig Anzt %A Aurelien Bouteiller %A Anthony Danalis %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Stephen Wood %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2017-06 %G eng %9 SLATE Working Notes %1 01