%0 Conference Paper
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2016
%T Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures
%A Yulu Jia
%A Piotr Luszczek
%A Jack Dongarra
%X Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction.
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%D 2016
%T Heterogeneous Streaming
%A Chris J. Newburn
%A Gaurav Bansal
%A Michael Wood
%A Luis Crivelli
%A Judit Planas
%A Alejandro Duran
%A Paulo Souza
%A Leonardo Borges
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%A Hartwig Anzt
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Ichitaro Yamazaki
%A Jesus Labarta
%X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2016
%T High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems
%A Jack Dongarra
%A Michael A. Heroux
%A Piotr Luszczek
%B International Journal of High Performance Computing Applications
%V 30
%P 3 - 10
%8 2016-02
%G eng
%U http://hpc.sagepub.com/cgi/doi/10.1177/1094342015593158
%N 1
%! International Journal of High Performance Computing Applications
%R 10.1177/1094342015593158
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2015
%T High-Performance Conjugate-Gradient Benchmark: A New Metric for Ranking High-Performance Computing Systems
%A Jack Dongarra
%A Michael A. Heroux
%A Piotr Luszczek
%K Additive Schwarz
%K HPC Benchmarking
%K Multigrid smoothing
%K Preconditioned Conjugate Gradient
%K Validation and Verification
%X We describe a new high-performance conjugate-gradient (HPCG) benchmark. HPCG is composed of computations and data-access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and to be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement.
%B The International Journal of High Performance Computing Applications
%G eng
%R 10.1177/1094342015593158
%0 Journal Article
%J Scientific Programming
%D 2015
%T HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
%A Azzam Haidar
%A Jack Dongarra
%A Khairul Kabir
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%A Yulu Jia
%K communication and computation overlap
%K dynamic runtime scheduling using dataflow dependences
%K hardware accelerators and coprocessors
%K Intel Xeon Phi processor
%K Many Integrated Cores
%K numerical linear algebra
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.
%B Scientific Programming
%V 23
%8 2015-01
%G eng
%N 1
%R 10.3233/SPR-140404
%0 Generic
%D 2015
%T HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems
%A Jack Dongarra
%A Michael A. Heroux
%A Piotr Luszczek
%K Additive Schwarz
%K HPC Benchmarking
%K Multigrid smoothing
%K Preconditioned Conjugate Gradient
%K Validation and Verification
%X We describe a new high performance conjugate gradient (HPCG) benchmark. HPCG is composed of computations and data access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2015-01
%G eng
%U http://www.eecs.utk.edu/resources/library/file/1047/ut-eecs-15-736.pdf
%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K Computer science
%K factorization
%K Heterogeneous systems
%K Intel Xeon Phi
%K linear algebra
%X We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs and coprocessors using a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components.
%B VECPAR 2014
%C Eugene, OR
%8 2014-06
%G eng
%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2013
%T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K algorithms
%K bidiagional reduction
%K bulge chasing
%K data translation layer
%K dynamic scheduling
%K high performance kernels
%K performance
%K tile algorithms
%K two-stage approach
%X This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.
%B ACM Transactions on Mathematical Software (TOMS)
%V 39
%G eng
%N 3
%R 10.1145/2450153.2450154
%0 Book Section
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%D 2013
%T HPC Challenge: Design, History, and Implementation Highlights
%A Jack Dongarra
%A Piotr Luszczek
%K exascale
%K hpc challenge
%K hpcc
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%I Taylor and Francis
%C Boca Raton, FL
%@ 978-1-4665-6834-1
%G eng
%& 2
%0 Journal Article
%J ICCS 2012
%D 2012
%T High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors
%A Peng Du
%A Piotr Luszczek
%A Jack Dongarra
%B ICCS 2012
%C Omaha, NE
%8 2012-06
%G eng
%0 Journal Article
%J On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear)
%D 2012
%T HPC Challenge: Design, History, and Implementation Highlights
%A Jack Dongarra
%A Piotr Luszczek
%E Jeffrey Vetter
%B On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear)
%I Chapman & Hall/CRC Press
%8 2012-00
%G eng
%0 Generic
%D 2011
%T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247)
%8 2011-05
%G eng
%0 Journal Article
%J IEEE Cluster 2011
%D 2011
%T High Performance Dense Linear System Solver with Soft Error Resilience
%A Peng Du
%A Piotr Luszczek
%A Jack Dongarra
%K ft-la
%B IEEE Cluster 2011
%C Austin, TX
%8 2011-09
%G eng
%0 Conference Proceedings
%B Proceedings of MTAGS11
%D 2011
%T High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%B Proceedings of MTAGS11
%C Seattle, WA
%8 2011-11
%G eng
%0 Journal Article
%J in Beautiful Code Leading Programmers Explain How They Think (Chapter 14)
%D 2008
%T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination
%A Jack Dongarra
%A Piotr Luszczek
%E Andy Oram
%E G. Wilson
%B in Beautiful Code Leading Programmers Explain How They Think (Chapter 14)
%P 243-282
%8 2008-01
%G eng
%0 Generic
%D 2008
%T HPCS Library Study Effort
%A Jack Dongarra
%A James Demmel
%A Parry Husbands
%A Piotr Luszczek
%B University of Tennessee Computer Science Technical Report, UT-CS-08-617
%8 2008-01
%G eng
%0 Journal Article
%J International Journal for High Performance Computer Applications
%D 2007
%T High Performance Development for High End Computing with Python Language Wrapper (PLW)
%A Jack Dongarra
%A Piotr Luszczek
%B International Journal for High Performance Computer Applications
%V 21
%P 360-369
%8 2007-00
%G eng
%0 Journal Article
%J in Beautiful Code Leading Programmers Explain How They Think
%D 2007
%T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination
%A Jack Dongarra
%A Piotr Luszczek
%E Andy Oram
%E Greg Wilson
%B in Beautiful Code Leading Programmers Explain How They Think
%I O'Reilly Media, Inc.
%8 2007-06
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications (to appear)
%D 2006
%T High Performance Development for High End Computing with Python Language Wrapper (PLW)
%A Piotr Luszczek
%K hpcc
%K lfc
%B International Journal of High Performance Computing Applications (to appear)
%8 2006-00
%G eng
%0 Conference Proceedings
%B SC06 Conference Tutorial
%D 2006
%T The HPC Challenge (HPCC) Benchmark Suite
%A Piotr Luszczek
%A David Bailey
%A Jack Dongarra
%A Jeremy Kepner
%A Robert Lucas
%A Rolf Rabenseifner
%A Daisuke Takahashi
%K hpcc
%K hpcchallenge
%B SC06 Conference Tutorial
%I IEEE
%C Tampa, Florida
%8 2006-11
%G eng
%0 Journal Article
%J SC|05 Tutorial - S13
%D 2005
%T HPC Challenge v1.x Benchmark Suite
%A Piotr Luszczek
%A David Koester
%K hpcc
%B SC|05 Tutorial - S13
%C Seattle, Washington
%8 2005-01
%G eng