%0 Conference Paper %B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2016 %T Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures %A Yulu Jia %A Piotr Luszczek %A Jack Dongarra %X Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction. %B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Conference Paper %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %D 2016 %T Heterogeneous Streaming %A Chris J. Newburn %A Gaurav Bansal %A Michael Wood %A Luis Crivelli %A Judit Planas %A Alejandro Duran %A Paulo Souza %A Leonardo Borges %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %A Hartwig Anzt %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Ichitaro Yamazaki %A Jesus Labarta %K plasma %X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application. %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Journal Article %J Scientific Programming %D 2015 %T HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi %A Azzam Haidar %A Jack Dongarra %A Khairul Kabir %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %A Yulu Jia %K communication and computation overlap %K dynamic runtime scheduling using dataflow dependences %K hardware accelerators and coprocessors %K Intel Xeon Phi processor %K Many Integrated Cores %K numerical linear algebra %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA. %B Scientific Programming %V 23 %8 2015-01 %G eng %N 1 %R 10.3233/SPR-140404 %0 Conference Proceedings %B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15) %D 2015 %T Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators %A Azzam Haidar %A Yulu Jia %A Piotr Luszczek %A Stanimire Tomov %A Asim YarKhan %A Jack Dongarra %K dataflow scheduling %K hardware accelerators %K multi-grain parallelism %X A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications. %B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15) %I ACM %C Austin, TX %V No. 5 %8 2015-11 %G eng %0 Conference Proceedings %B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2013 %T CPU-GPU Hybrid Bidiagonal Reduction With Soft Error Resilience %A Yulu Jia %A Piotr Luszczek %A George Bosilca %A Jack Dongarra %X Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur. %B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %C Montpellier, France %8 2013-11 %G eng %0 Conference Paper %B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013 %D 2013 %T Parallel Reduction to Hessenberg Form with Algorithm-Based Fault Tolerance %A Yulu Jia %A George Bosilca %A Piotr Luszczek %A Jack Dongarra %X This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases. %B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013 %C Denver, CO %8 2013-11 %G eng %0 Conference Paper %B PPAM 2013 %D 2013 %T Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %K magma %K mic %K xeon phi %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA. %B PPAM 2013 %C Warsaw, Poland %8 2013-09 %G eng %0 Generic %D 2013 %T Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures %A Yulu Jia %A Piotr Luszczek %A Jack Dongarra %X Graphics Processing Units (GPUs) are gaining wide spread usage in the field of scientific computing owing to the performance boost GPUs bring to computation intensive applications. The typical configuration is to integrate GPUs and CPUs in the same system where the CPUs handle the control flow and part of the computation workload, and the GPUs serve as accelerators carry out the bulk of the data parallel compute workload. In this paper we design and implement a soft error resilient Hessenberg reduction algorithm on GPU based hybrid platforms. Our design employs algorithm based fault tolerance technique, diskless checkpointing and reverse computation. We detect and correct soft errors on-line without delaying the detection and correction to the end of the factorization. By utilizing idle time of the CPUs and overlapping both host side and GPU side workloads we minimize the observed overhead. Experiment results validated our design philosophy. Our algorithm introduces less than 2% performance overhead compared to the non-fault tolerant hybrid Hessenberg reduction algorithm. %B UT-CS-13-712 %I University of Tennessee Computer Science Technical Report %8 2013-06 %G eng %0 Generic %D 2012 %T MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors %A Jack Dongarra %A Mark Gates %A Yulu Jia %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12) %C Salt Lake City, UT %8 2012-11 %G eng