%0 Generic %D 2020 %T Asynchronous SGD for DNN Training on Shared-Memory Parallel Architectures %A Florent Lopez %A Edmond Chow %A Stanimire Tomov %A Jack Dongarra %K Asynchronous iterative methods %K Deep learning %K gpu %K multicore CPU %K Stochastic Gradient Descent %X We present a parallel asynchronous Stochastic Gradient Descent algorithm for shared memory architectures. Different from previous asynchronous algorithms, we consider the case where the gradient updates are not particularly sparse. In the context of the MagmaDNN framework, we compare the parallel efficiency of the asynchronous implementation with that of the traditional synchronous implementation. Tests are performed for training deep neural networks on multicore CPUs and GPU devices. %B Innovative Computing Laboratory Technical Report %I University of Tennessee, Knoxville %8 2020-03 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS 2020) %D 2020 %T heFFTe: Highly Efficient FFT for Exascale %A Alan Ayala %A Stanimire Tomov %A Azzam Haidar %A Jack Dongarra %K exascale %K FFT %K gpu %K scalable algorithm %X Exascale computing aspires to meet the increasing demands from large scientific applications. Software targeting exascale is typically designed for heterogeneous architectures; henceforth, it is not only important to develop well-designed software, but also make it aware of the hardware architecture and efficiently exploit its power. Currently, several and diverse applications, such as those part of the Exascale Computing Project (ECP) in the United States, rely on efficient computation of the Fast Fourier Transform (FFT). In this context, we present the design and implementation of heFFTe (Highly Efficient FFT for Exascale) library, which targets the upcoming exascale supercomputers. We provide highly (linearly) scalable GPU kernels that achieve more than 40× speedup with respect to local kernels from CPU state-of-the-art libraries, and over 2× speedup for the whole FFT computation. A communication model for parallel FFTs is also provided to analyze the bottleneck for large-scale problems. We show experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 24,576 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs. %B International Conference on Computational Science (ICCS 2020) %C Amsterdam, Netherlands %8 2020-06 %G eng %R https://doi.org/10.1007/978-3-030-50371-0_19 %0 Conference Paper %B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) %D 2020 %T High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs %A Natalie Beams %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %A Tzanio Kolev %A Yohann Dudouit %K Batched linear algebra %K finite elements %K gpu %K high-order methods %K matrix-free FEM %K Tensor contractions %X We present new GPU implementations of the tensor contractions arising from basis-related computations for highorder finite element methods. We consider both tensor and nontensor bases. In the case of tensor bases, we introduce new kernels based on a series of fused device-level matrix multiplications (GEMMs), specifically designed to utilize the fast memory of the GPU. For non-tensor bases, we develop a tuned framework for choosing standard batch-BLAS GEMMs that will maximize performance across groups of elements. The implementations are included in a backend of the libCEED library. We present benchmark results for the diffusion and mass operators using libCEED integration through the MFEM finite element library and compare to those of the previously best-performing GPU backends for stand-alone basis computations. In tensor cases, we see improvements of approximately 10-30% for some cases, particularly for higher basis orders. For the non-tensor tests, the new batch-GEMMs implementation is twice as fast as what was previously available for basis function order greater than five and greater than approximately 105 degrees of freedom in the mesh; up to ten times speedup is seen for eighth-order basis functions. %B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) %I IEEE %8 2020-11 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2020 %T Reducing the Amount of out-of-core Data Access for GPU-Accelerated Randomized SVD %A Yuechao Lu %A Ichitaro Yamazaki %A Fumihiko Ino %A Yasuyuki Matsushita %A Stanimire Tomov %A Jack Dongarra %K Divide and conquer %K gpu %K out-of-core computation %K Singular value decomposition %B Concurrency and Computation: Practice and Experience %8 2020-04 %G eng %R https://doi.org/10.1002/cpe.5754 %0 Generic %D 2019 %T GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems %A Hejer Shaiek %A Stanimire Tomov %A Alan Ayala %A Azzam Haidar %A Jack Dongarra %K CUDA-Aware MPI %K ECP %K FFT %K FFT-ECP %K gpu %K GPUDirect %X Fast Fourier transforms (FFTs) are used in applications ranging from molecular dynamics and spectrum estimation to machine learn- ing, fast convolution and correlation, signal modulation, wireless multimedia applications, and others. However, FFTs are memory bound, and therefore, to accelerate them, it is crucial to avoid and optimize the FFTs’ communications. To this end, we present a 3-D FFT design for distributed graphics processing unit (GPU) systems that: (1) efficiently uses GPUs’ high bandwidth, (2) reduces global communications algorithmically, when possible, and (3) employs GPUDirect technologies as well as MPI optimizations in the development of high-performance FFTs for large-scale GPU-accelerated systems. We show that these developments and optimizations lead to very good strong scalability and a performance that is close to 90% of the theoretical peak. %B EuroMPI'19 Posters, Zurich, Switzerland %I ICL %8 2019-09 %G eng %9 Extended Abstract %0 Journal Article %J Parallel Computing %D 2019 %T Parallel Selection on GPUs %A Tobias Ribizel %A Hartwig Anzt %K approximate selection %K gpu %K kth order statistics %K multiselection %K parallel selection algorithm %X We present a novel parallel selection algorithm for GPUs capable of handling single rank selection (single selection) and multiple rank selection (multiselection). The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always leveraging the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for – and exploiting the characteristics of – “pleasant” data distributions. At the same time, as the proposed SampleSelect algorithm does not work on the actual element values but on the element ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. We also address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy. %B Parallel Computing %V 91 %8 2020-03 %G eng %U https://www.sciencedirect.com/science/article/pii/S0167819119301796 %! Parallel Computing %R https://doi.org/10.1016/j.parco.2019.102588 %0 Journal Article %J Parallel Computing %D 2018 %T Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs %A Mark Gates %A Stanimire Tomov %A Jack Dongarra %K 2-stage %K accelerator %K Divide and conquer %K gpu %K Singular value decomposition %K SVD %X The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today’s high performance computers. For dense matrices, the classic algorithm for the singular value decomposition (SVD) uses a one stage reduction to bidiagonal form, which is limited in performance by the memory bandwidth. To overcome this limitation, a two stage reduction to bidiagonal has been gaining popularity. It first reduces the matrix to band form using high performance Level 3 BLAS, then reduces the band matrix to bidiagonal form. As accelerators such as GPUs and co-processors are becoming increasingly widespread in high-performance computing, a question of great interest to many SVD users is how much the employment of a two stage reduction, as well as other current best practices in GPU computing, can accelerate this important routine. To fulfill this interest, we have developed an accelerated SVD employing a two stage reduction to bidiagonal and a number of other algorithms that are highly optimized for GPUs. Notably, we also parallelize and accelerate the divide and conquer algorithm used to solve the subsequent bidiagonal SVD. By accelerating all phases of the SVD algorithm, we provide a significant speedup compared to existing multi-core and GPU-based SVD implementations. In particular, using a P100 GPU, we illustrate a performance of up to 804 Gflop/s in double precision arithmetic to compute the full SVD of a 20k × 20k matrix in 90 seconds, which is 8.9 × faster than MKL on two 10 core Intel Haswell E5-2650 v3 CPUs, 3.7 × over the multi-core PLASMA two stage version, and 2.6 × over the previously accelerated one stage MAGMA version. %B Parallel Computing %V 74 %P 3–18 %8 2018-05 %G eng %U https://www.sciencedirect.com/science/article/pii/S0167819117301758 %! Parallel Computing %R 10.1016/j.parco.2017.10.004 %0 Journal Article %J Journal of Advances in Modeling Earth Systems %D 2018 %T Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling %A Jian Sun %A Joshua Fu %A John Drake %A Qingzhao Zhu %A Azzam Haidar %A Mark Gates %A Stanimire Tomov %A Jack Dongarra %K compiler %K CUDA %K data transfer %K gpu %K hybrid %K memory layout %X Global chemistry‐climate models are computationally burdened as the chemical mechanisms become more complex and realistic. Optimization for graphics processing units (GPU) may make longer global simulation with regional detail possible, but limited study has been done to explore the potential benefit for the atmospheric chemistry modeling. Hence, in this study, the second‐order Rosenbrock solver of the chemistry module of CAM4‐Chem is ported to the GPU to gauge potential speed‐up. We find that on the CPU, the fastest performance is achieved using the Intel compiler with a block interleaved memory layout. Different combinations of compiler and memory layout lead to ~11.02× difference in the computational time. In contrast, the GPU version performs the best when using a combination of fully interleaved memory layout with block size equal to the warp size, CUDA streams for independent kernels, and constant memory. Moreover, the most efficient data transfer between CPU and GPU is gained by allocating the memory contiguously during the data initialization on the GPU. Compared to one CPU core, the speed‐up of using one GPU alone reaches a factor of ~11.7× for the computation alone and ~3.82× when the data transfer between CPU and GPU is considered. Using one GPU alone is also generally faster than the multithreaded implementation for 16 CPU cores in a compute node and the single‐source solution (OpenACC). The best performance is achieved by the implementation of the hybrid CPU/GPU version, but rescheduling the workload among the CPU cores is required before the practical CAM4‐Chem simulation. %B Journal of Advances in Modeling Earth Systems %V 10 %P 1952–1969 %8 2018-08 %G eng %N 8 %R https://doi.org/10.1029/2018MS001276 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2018 %T Optimization and Performance Evaluation of the IDR Iterative Krylov Solver on GPUs %A Hartwig Anzt %A Moritz Kreutzer %A Eduardo Ponce %A Gregory D. Peterson %A Gerhard Wellein %A Jack Dongarra %K co-design %K gpu %K Induced dimension reduction (IDR) %K kernel fusion %K kernel overlap %K roofline performance model %X In this paper, we present an optimized GPU implementation for the induced dimension reduction algorithm. We improve data locality, combine it with an efficient sparse matrix vector kernel, and investigate the potential of overlapping computation with communication as well as the possibility of concurrent kernel execution. A comprehensive performance evaluation is conducted using a suitable performance model. The analysis reveals efficiency of up to 90%, which indicates that the implementation achieves performance close to the theoretically attainable bound. %B The International Journal of High Performance Computing Applications %V 32 %P 220–230 %8 2018-03 %G eng %R https://doi.org/10.1177/1094342016646844 %0 Journal Article %J Parallel Computing %D 2017 %T Preconditioned Krylov Solvers on GPUs %A Hartwig Anzt %A Mark Gates %A Jack Dongarra %A Moritz Kreutzer %A Gerhard Wellein %A Martin Kohler %K gpu %K ILU %K Jacobi %K Krylov solvers %K Preconditioning %X In this paper, we study the effect of enhancing GPU-accelerated Krylov solvers with preconditioners. We consider the BiCGSTAB, CGS, QMR, and IDR(s) Krylov solvers. For a large set of test matrices, we assess the impact of Jacobi and incomplete factorization preconditioning on the solvers’ numerical stability and time-to-solution performance. We also analyze how the use of a preconditioner impacts the choice of the fastest solver. %B Parallel Computing %8 2017-06 %G eng %U http://www.sciencedirect.com/science/article/pii/S0167819117300777 %! Parallel Computing %R 10.1016/j.parco.2017.05.006 %0 Conference Proceedings %B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2016 %T Efficiency of General Krylov Methods on GPUs – An Experimental Study %A Hartwig Anzt %A Jack Dongarra %A Moritz Kreutzer %A Gerhard Wellein %A Martin Kohler %K algorithmic bombardment %K BiCGSTAB %K CGS %K Convergence %K Electric breakdown %K gpu %K graphics processing units %K Hardware %K IDR(s) %K Krylov solver %K Libraries %K linear systems %K QMR %K Sparse matrices %X This paper compares different Krylov methods based on short recurrences with respect to their efficiency whenimplemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergencecharacteristics. For a large set of test matrices taken from theUniversity of Florida Matrix Collection, we evaluate the methods'performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and executiontime. We also analyze whether the methods are "orthogonal"in terms of problem suitability. We propose best practicesfor choosing methods in a "black box" scenario, where noinformation about the optimal solver is available. %B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %P 683-691 %8 2016-05 %G eng %R 10.1109/IPDPSW.2016.45 %0 Conference Paper %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) %D 2016 %T Efficiency of General Krylov Methods on GPUs – An Experimental Study %A Hartwig Anzt %A Jack Dongarra %A Moritz Kreutzer %A Gerhard Wellein %A Martin Kohler %K algorithmic bombardment %K BiCGSTAB %K CGS %K gpu %K IDR(s) %K Krylov solver %K QMR %X This paper compares different Krylov methods based on short recurrences with respect to their efficiency when implemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergence characteristics. For a large set of test matrices taken from the University of Florida Matrix Collection, we evaluate the methods’ performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and execution time. We also analyze whether the methods are “orthogonal” in terms of problem suitability. We propose best practices for choosing methods in a “black box” scenario, where no information about the optimal solver is available. %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) %I IEEE %C Chicago, IL %8 2016-05 %G eng %R 10.1109/IPDPSW.2016.45 %0 Conference Paper %B 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'16) %D 2016 %T GPU-Aware Non-contiguous Data Movement In Open MPI %A Wei Wu %A George Bosilca %A Rolf vandeVaart %A Sylvain Jeaugey %A Jack Dongarra %K datatype %K gpu %K hybrid architecture %K MPI %K non-contiguous data %X
Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applications. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non-contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance.
To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype packing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Unied Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.
%B 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'16) %I ACM %C Kyoto, Japan %8 2016-06 %G eng %R http://dx.doi.org/10.1145/2907294.2907317 %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %K Applications %K Batched linear algebra %K FEM %K gpu %K Tensor contractions %K Tensor HPC %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Conference Paper %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2015 %T Hierarchical DAG scheduling for Hybrid Distributed Systems %A Wei Wu %A Aurelien Bouteiller %A George Bosilca %A Mathieu Faverge %A Jack Dongarra %K dense linear algebra %K gpu %K heterogeneous architecture %K PaRSEC runtime %X Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the pro-gramming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel effi-ciency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments. %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Hyderabad, India %8 2015-05 %G eng %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2015 %T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems %A Maksims Abalenkovs %A Ahmad Abdelfattah %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %A Asim YarKhan %K dense linear algebra %K gpu %K HPC %K Multicore %K plasma %K Programming models %K runtime %X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. %B Supercomputing Frontiers and Innovations %V 2 %8 2015-10 %G eng %R 10.14529/jsfi1504 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2014 %T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks %A Azzam Haidar %A Raffaele Solcà %A Mark Gates %A Stanimire Tomov %A Thomas C. Schulthess %A Jack Dongarra %K Eigensolver %K electronic structure calculations %K generalized eigensolver %K gpu %K high performance %K hybrid %K Multicore %K two-stage %X The adoption of hybrid CPU–GPU nodes in traditional supercomputing platforms such as the Cray-XK6 opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium-sized generalized eigenvalue problems must be solved many times. These eigenvalue problems are too small to effectively solve on distributed systems, but can benefit from the massive computing power concentrated on a single-node, hybrid CPU–GPU system. However, hybrid systems call for the development of new algorithms that efficiently exploit heterogeneity and massive parallelism of not just GPUs, but of multicore/manycore CPUs as well. Addressing these demands, we developed a generalized eigensolver featuring novel algorithms of increased computational intensity (compared with the standard algorithms), decomposition of the computation into fine-grained memory aware tasks, and their hybrid execution. The resulting eigensolvers are state-of-the-art in high-performance computing, significantly outperforming existing libraries. We describe the algorithm and analyze its performance impact on applications of interest when different fractions of eigenvectors are needed by the host electronic structure code. %B International Journal of High Performance Computing Applications %V 28 %P 196-209 %8 2014-05 %G eng %N 2 %& 196 %R 10.1177/1094342013502097 %0 Conference Paper %B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014 %D 2014 %T Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes %A Xavier Lacoste %A Mathieu Faverge %A Pierre Ramet %A Samuel Thibault %A George Bosilca %K DAG based runtime %K gpu %K Multicore %K Sparse linear solver %X The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of the computing resources. The pressure to maintain reasonable levels of performance and portability, forces the application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical architectures. In this paper, we study the replacement of the highly specialized internal scheduler in PaStiX by two generic runtime frameworks: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them with the opportunity to optimize it in order to maximize the algorithm eefficiency for a predefined execution environment. A comparative study of the performance of the PaStiX solver with the three schedulers { native PaStiX, StarPU and PaRSEC schedulers { on different execution contexts is performed. The analysis highlights the similarities from a performance point of view between the different execution supports. These results demonstrate that these generic DAG-based runtimes provide a uniform and portable programming interface across heterogeneous environments, and are, therefore, a sustainable solution for hybrid environments. %B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Computing %D 2013 %T LU Factorization with Partial Pivoting for a Multicore System with Accelerators %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %K accelerator %K Gaussian elimination %K gpu %K lu factorization %K manycore %K Multicore %K partial pivoting %K plasma %X LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of partial pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs. %B IEEE Transactions on Parallel and Distributed Computing %V 24 %P 1613-1621 %8 2013-08 %G eng %N 8 %& 1613 %R http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.242 %0 Journal Article %J Journal of Computational Science %D 2013 %T Soft Error Resilient QR Factorization for Hybrid System with GPGPU %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K gpgpu %K gpu %K magma %X The general purpose graphics processing units (GPGPUs) are increasingly deployed for scientific computing due to their performance advantages over CPUs. What followed is the fact that fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors, for example, in the form of bit flips. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include: (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R; and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can successfully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs. %B Journal of Computational Science %V 4 %P 457–464 %8 2013-11 %G eng %N 6 %R http://dx.doi.org/10.1016/j.jocs.2013.01.004