%0 Journal Article
%J International Journal of Networking and Computing
%D 2019
%T Combining checkpointing and replication for reliable execution of linear workflows with fail-stop and silent errors
%A Anne Benoit
%A Aurelien Cavelan
%A Florina M. Ciorba
%A Valentin Le Fèvre
%A Yves Robert
%B International Journal of Networking and Computing
%V 9
%P 2-27
%8 2019
%G eng
%0 Journal Article
%J Parallel Computing
%D 2019
%T Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms
%A Valentin Le Fèvre
%A Thomas Herault
%A Yves Robert
%A Aurelien Bouteiller
%A Atsushi Hori
%A George Bosilca
%A Jack Dongarra
%B Parallel Computing
%V 85
%P 1–12
%8 07-2019
%G eng
%R https://doi.org/10.1016/j.parco.2019.02.002
%0 Journal Article
%J International Journal of High Performance Computing and Networking (to appear)
%D 2019
%T Evaluation of Directive-Based Performance Portable Programming Models
%A M. Graham Lopez
%A Wayne Joubert
%A Verónica Larrea
%A Oscar Hernandez
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X We present an extended exploration of the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architecture with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer offload features of OpenMP 4.5 are for moving codes between architectures, and we document how much tuning might be required and what lessons we can learn from these experiences. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. To better understand fundamental compute vs. bandwidth bound characteristics, we add the compute-bound Level 3 BLAS GEMM kernel to our linear algebra evaluation. We implement the kernels of interest using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both x86_64 and Power8 with attached NVIDIA GPUs, x86_64 multicores, self-hosted Intel Xeon Phi KNL, as well as an x86_64 host system with Intel Xeon Phi coprocessors. We update these evaluations with the newest version of the NVIDIA Pascal architecture (P100), Intel KNL 7230, Power8+, and the newest supporting compiler implementations. Furthermore, we present in detail what factors affected the performance portability, including how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimise and target multiple platforms.
%B International Journal of High Performance Computing and Networking (to appear)
%G eng
%R 10.1504/IJHPCN.2017.10009064
%0 Journal Article
%J Int. Journal of High Performance Computing Applications
%D 2019
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Han, Li
%A Le Fèvre, Valentin
%A Canon, Louis-Claude
%A Robert, Yves
%A Vivien, Frédéric
%B Int. Journal of High Performance Computing Applications
%8 To appear
%G eng
%0 Conference Proceedings
%B RTSS'2019, the 40th IEEE Real-Time Systems Symposium
%D 2019
%T Improved energy-aware strategies for periodic real-time tasks under reliability constraints
%A Li Han
%A Louis-Claude Canon
%A Jing Liu
%A Yves Robert
%A Frederic Vivien
%B RTSS'2019, the 40th IEEE Real-Time Systems Symposium
%I IEEE Press
%G eng
%0 Conference Paper
%B Future of Information and Communication Conference (FICC)
%D 2019
%T Interoperable Convergence of Storage, Networking, and Computation
%A Micah Beck
%A Terry Moore
%A Piotr Luszczek
%K active networks
%K distributed cloud
%K distributed processing
%K distributed storage
%K edge computing
%K network convergence
%K network layering
%K scalability
%X In every form of digital store-and-forward communication, intermediate forwarding nodes are computers, with attendant memory and processing resources. This has inevitably originated efforts to create a wide-area infrastructure that goes beyond simple store-and-forward, a facility that makes more general and varied use of the potential of this collection of increasingly powerful nodes. Historically, these efforts predate the advent of globally routed packet networking. The desire for a converged infrastructure of this kind has only intensified over the last 30 years, as memory, storage, and processing resources have both increased in density and speed while simultaneously decreasing in cost. Although there is a general consensus seems that it should be possible to define and deploy such a dramatically more capable wide-area facility, a great deal of investment in research prototypes has yet to produce a credible candidate architecture. Drawing on technical analysis, historical examples, and case studies, we present an argument for the hypothesis that in order to realize a distributed system with the kind of convergent generality and deployment scalability that might qualify as "future-defining," we must build it from a small set of simple, generic, and limited abstractions of the low level resources (processing, storage and network) of its intermediate nodes.
%B Future of Information and Communication Conference (FICC)
%I Science and Information (SAI)
%C San Francisco
%8 03-2019
%G eng
%U https://arxiv.org/abs/1706.07519
%0 Conference Proceedings
%B FICC 2019
%D 2019
%T Interoperable Convergence of Storage, Networking, and Computation
%A Micah Beck
%A Terry Moore
%A Piotr Luszczek
%A Anthony Danalis
%X In every form of digital store-and-forward communication, intermediate forwarding nodes are computers, with attendant memory and processing resources. This has inevitably stimulated efforts to create a wide-area infrastructure that goes beyond simple store-and-forward to create a platform that makes more general and varied use of the potential of this collection of increasingly powerful nodes. Historically, these efforts predate the advent of globally routed packet networking. The desire for a converged infrastructure of this kind has only intensified over the last 30 years, as memory, storage, and processing resources have increased in both density and speed while simultaneously decreasing in cost. Although there is a general consensus that it should be possible to define and deploy such a dramatically more capable wide-area platform, a great deal of investment in research prototypes has yet to produce a credible candidate architecture. Drawing on technical analysis, historical examples, and case studies, we present an argument for the hypothesis that in order to realize a distributed system with the kind of convergent generality and deployment scalability that might qualify as "future-defining," we must build it from a small set of simple, generic, and limited abstractions of the low level resources (processing, storage and network) of its intermediate nodes.
%B FICC 2019
%I Springer
%C San Francisco, CA
%8 March 14-15
%G eng
%0 Journal Article
%J Future Generation Computer Systems
%D 2019
%T Local Rollback for Resilient MPI Applications with Application-Level Checkpointing and Message Logging
%A Nuria Losada
%A George Bosilca
%A Aurelien Bouteiller
%A Patricia González
%A María J. Martín
%K Application-level checkpointing
%K Local rollback
%K Message logging
%K MPI
%K resilience
%X The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface – the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard – enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the ComPiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level—thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications.
%B Future Generation Computer Systems
%V 91
%P 450-464
%8 02-2019
%G eng
%R https://doi.org/10.1016/j.future.2018.09.041
%0 Conference Paper
%B International Parallel and Distributed Processing Symposium (IPDPS)
%D 2019
%T Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation
%A Zhaojun Bai
%A Jack Dongarra
%A Ding Lu
%A Ichitaro Yamazaki
%X Some scientific and engineering applications need to compute a large number of eigenpairs of a large Hermitian matrix. Though the Lanczos method is effective for computing a few eigenvalues, it can be expensive for computing a large number of eigenpairs (e.g., in terms of computation and communication). To improve the performance of the method, in this paper, we study an s-step variant of thick-restart Lanczos (TRLan) combined with an explicit external deflation (EED). The s-step method generates a set of s basis vectors at a time and reduces the communication costs of generating the basis vectors. We then design a specialized matrix powers kernel (MPK) that reduces both the communication and computational costs by taking advantage of the special properties of the deflation matrix. We conducted numerical experiments of the new TRLan eigensolver using synthetic matrices and matrices from electronic structure calculations. The performance results on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) demonstrate the potential of the specialized MPK to significantly reduce the execution time of the TRLan eigensolver. The speedups of up to 3.1× and 5.3× were obtained in our sequential and parallel runs, respectively.
%B International Parallel and Distributed Processing Symposium (IPDPS)
%8 05-2019
%G eng
%0 Journal Article
%J ACM Transactions on Mathematical Software (to appear)
%D 2019
%T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Sven Hammarling
%A Jakub Sistek
%B ACM Transactions on Mathematical Software (to appear)
%G eng
%0 Conference Proceedings
%B SC'2019, the IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis
%D 2019
%T Replication is More Efficient Than You Think
%A Anne Benoit
%A Thomas Herault
%A Valentin Le Fèvre
%A Yves Robert
%B SC'2019, the IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis
%I ACM Press
%8 11-2019
%G eng
%0 Conference Paper
%B The 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18)
%D 2018
%T ADAPT: An Event-Based Adaptive Collective Communication Framework
%A Xi Luo
%A Wei Wu
%A George Bosilca
%A Thananon Patinyasakdikul
%A Linnan Wang
%A Jack Dongarra
%X The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while mamtaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3X and 1.5X speedup for CPU data and 2X and 10X speedup for GPU data using ADAPT event-based broadcast and reduce operations.
%B The 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18)
%I ACM Press
%C Tempe, Arizona
%8 06-2018
%@ 9781450357852
%G eng
%R 10.1145/3208040.3208054
%0 Journal Article
%J Proceedings of the IEEE
%D 2018
%T Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Yaohung Tsai
%K Dense numerical linear algebra
%K performance autotuning
%X Computational problems in engineering and scientific disciplines often rely on the solution of many instances of small systems of linear equations, which are called batched solves. In this paper, we focus on the important variants of both batch Cholesky factorization and subsequent substitution. The former requires the linear system matrices to be symmetric positive definite (SPD). We describe the implementation and automated performance engineering of these kernels that implement the factorization and the two substitutions. Our target platforms are graphics processing units (GPUs), which over the past decade have become an attractive high-performance computing (HPC) target for solvers of linear systems of equations. Due to their throughput-oriented design, GPUs exhibit the highest processing rates among the available processors. However, without careful design and coding, this speed is mostly restricted to large matrix sizes. We show an automated exploration of the implementation space as well as a new data layout for the batched class of SPD solvers. Our tests involve the solution of many thousands of linear SPD systems of exactly the same size. The primary focus of our techniques is on the individual matrices in the batch that have dimensions ranging from 5-by-5 up to 100-by-100. We compare our autotuned solvers against the state-of-the-art solvers such as those provided through NVIDIA channels and publicly available in the optimized MAGMA library. The observed performance is competitive and many times superior for many practical cases. The advantage of the presented methodology lies in achieving these results in a portable manner across matrix storage formats and GPU hardware architecture platforms.
%B Proceedings of the IEEE
%V 106
%P 2040–2055
%8 11-2018
%G eng
%N 11
%R 10.1109/JPROC.2018.2868961
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2018
%T Autotuning Techniques for Performance-Portable Point Set Registration in 3D
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A David Keffer
%A Vasileios Maroulas
%A Jack Dongarra
%X We present an autotuning approach applied to exhaustive performance engineering of the EM-ICP algorithm for the point set registration problem with a known reference. We were able to achieve progressively higher performance levels through a variety of code transformations and an automated procedure of generating a large number of implementation variants. Furthermore, we managed to exploit code patterns that are not common when only attempting manual optimization but which yielded in our tests better performance for the chosen registration algorithm. Finally, we also show how we maintained high levels of the performance rate in a portable fashion across a wide range of hardware platforms including multicore, manycore coprocessors, and accelerators. Each of these hardware classes is much different from the others and, consequently, cannot reliably be mastered by a single developer in a short time required to deliver a close-to-optimal implementation. We assert in our concluding remarks that our methodology as well as the presented tools provide a valid automation system for software optimization tasks on modern HPC hardware.
%B Supercomputing Frontiers and Innovations
%V 5
%8 12-2018
%G eng
%& 42
%R 10.14529/jsfi180404
%0 Report
%D 2018
%T Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
%A Jack Dongarra
%A Iain Duff
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jonathan Hogg
%A Pedro Valero Lara
%A Piotr Luszczek
%A Mawussi Zounon
%A Samuel D. Relton
%A Stanimire Tomov
%A Timothy Costa
%A Sarah Knepper
%X This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility.
%8 07-2018
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T Big Data and Extreme-Scale Computing: Pathways to Convergence - Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry
%A Mark Asch
%A Terry Moore
%A Rosa M. Badia
%A Micah Beck
%A Pete Beckman
%A Thierry Bidot
%A François Bodin
%A Franck Cappello
%A Alok Choudhary
%A Bronis R. de Supinski
%A Ewa Deelman
%A Jack Dongarra
%A Anshu Dubey
%A Geoffrey Fox
%A Haohuan Fu
%A Sergi Girona
%A Michael Heroux
%A Yutaka Ishikawa
%A Kate Keahey
%A David Keyes
%A William T. Kramer
%A Jean-François Lavignon
%A Yutong Lu
%A Satoshi Matsuoka
%A Bernd Mohr
%A Stéphane Requena
%A Joel Saltz
%A Thomas Schulthess
%A Rick Stevens
%A Martin Swany
%A Alexander Szalay
%A William Tang
%A Gaël Varoquaux
%A Jean-Pierre Vilotte
%A Robert W. Wisniewski
%A Zhiwei Xu
%A Igor Zacharov
%X Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review.
%B The International Journal of High Performance Computing Applications
%V 32
%P 435–479
%8 07-2018
%G eng
%N 4
%R https://doi.org/10.1177/1094342018778123
%0 Generic
%D 2018
%T Distributed Termination Detection for HPC Task-Based Environments
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Valentin Le Fèvre
%A Yves Robert
%A Jack Dongarra
%X This paper revisits distributed termination detection algorithms in the context of high-performance computing applications in task systems. We first outline the need to efficiently detect termination in workflows for which the total number of tasks is data dependent and therefore not known statically but only revealed dynamically during execution. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). On the theoretical side, we analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. On the practical side, we provide a highly tuned implementation of each termination detection algorithm within PaRSEC and compare their performance for a variety of benchmarks, extracted from scientific applications that exhibit dynamic behaviors.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 06-2018
%G eng
%0 Conference Paper
%B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids
%D 2018
%T Do moldable applications perform better on failure-prone HPC platforms?
%A Valentin Le Fèvre
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Atsushi Hori
%A Yves Robert
%A Jack Dongarra
%X This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) Rigid applications, which use a constant number of processors throughout execution; (ii) Moldable applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GridShaped applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage.
%B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids
%S LNCS
%I Springer Verlag
%C Turin, Italy
%8 08-2018
%G eng
%0 Conference Paper
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%D 2018
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Li Han
%A Valentin Le Fèvre
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%X This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target failstop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows.
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%I IEEE Computer Society Press
%C Eugene, OR
%8 08-2018
%G eng
%0 Generic
%D 2018
%T Implementation of the C++ API for Batch BLAS
%A Ahmad Abdelfattah
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 06-2018
%G eng
%0 Generic
%D 2018
%T Initial Integration and Evaluation of SLATE and STRUMPACK
%A Pieter Ghysels
%A Sherry Li
%A Asim YarKhan
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 12-2018
%G eng
%0 Generic
%D 2018
%T Linear Systems Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Ali Charara
%A Asim YarKhan
%A Jamie Finney
%A Gerald Ragghianti
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 09-2018
%G eng
%9 SLATE Working Notes
%0 Generic
%D 2018
%T Parallel BLAS Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I University of Tennessee
%8 04-2018
%G eng
%0 Generic
%D 2018
%T Parallel Norms Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 06-2018
%G eng
%0 Journal Article
%J SIAM Review
%D 2018
%T The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%K bidiagonal matrix
%K bisection
%K Divide and conquer
%K Hestenes method
%K Jacobi method
%K Kogbetliantz method
%K MRRR
%K QR iteration
%K Singular value decomposition
%K SVD
%X The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EISPACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy.
%B SIAM Review
%V 60
%P 808–865
%8 11-2018
%G eng
%U https://epubs.siam.org/doi/10.1137/17M1117732
%N 4
%! SIAM Rev.
%R 10.1137/17M1117732
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2017
%T Argobots: A Lightweight Low-Level Threading and Tasking Framework
%A Sangmin Seo
%A Abdelhalim Amer
%A Pavan Balaji
%A Cyril Bordage
%A George Bosilca
%A Alex Brooks
%A Philip Carns
%A Adrian Castello
%A Damien Genet
%A Thomas Herault
%A Shintaro Iwasaki
%A Prateek Jindal
%A Sanjay Kale
%A Sriram Krishnamoorthy
%A Jonathan Lifflander
%A Huiwei Lu
%A Esteban Meneses
%A Mar Snir
%A Yanhua Sun
%A Kenjiro Taura
%A Pete Beckman
%K Argobots
%K context switch
%K I/O
%K interoperability
%K lightweight
%K MPI
%K OpenMP
%K stackable scheduler
%K tasklet
%K user-level thread
%X In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, are either too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing a rich set of controls to allow specialization by the user or high-level programming model. We describe the design, implementation, and optimization of Argobots and present integrations with three example high-level models: OpenMP, MPI, and co-located I/O service. Evaluations show that (1) Argobots outperforms existing generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency hiding capabilities; and (4) I/O service with Argobots reduces interference with co-located applications, achieving performance competitive with that of the Pthreads version.
%B IEEE Transactions on Parallel and Distributed Systems
%8 10-2017
%G eng
%U http://ieeexplore.ieee.org/document/8082139/
%R 10.1109/TPDS.2017.2766062
%0 Conference Paper
%B Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2017
%T Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Yu Pei
%A Jack Dongarra
%K batch computation
%K Cholesky Factorization
%K data layout
%K GPU computing
%K numerical linear algebra
%X Batch matrix operations address the case of solving the same linear algebra problem for a very large number of very small matrices. In this paper, we focus on implementing the batch Cholesky factorization in CUDA, in single precision arithmetic, for NVIDIA GPUs. Specifically, we look into the benefits of using noncanonical data layouts, where consecutive memory locations store elements with the same row and column index in a set of consecutive matrices. We discuss a number of different implementation options and tuning parameters. We demonstrate superior performance to traditional implementations for the case of very small matrices.
%B Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Orlando, FL
%8 06-2017
%G eng
%R 10.1109/IPDPSW.2017.18
%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2017
%T Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation
%A Mathieu Faverge
%A Julien Langou
%A Yves Robert
%A Jack Dongarra
%K Algorithm design and analysis
%K Approximation algorithms
%K Kernel
%K Multicore processing
%K Shape
%K Software algorithms
%K Transforms
%X We study tiled algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthog-onal transformations: (i) the tiled bidiagonalization algorithm BIDIAG, which is a tiled version of the standard scalar bidiago-nalization algorithm; and (ii) the R-bidiagonalization algorithm R-BIDIAG, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R- factor. For both BIDIAG and R-BIDIAG, we use four main types of reduction trees, namely FLATTS, FLATTT, GREEDY, and a newly introduced auto-adaptive tree, AUTO. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BIDIAG has a shorter critical path length than BIDIAG for tall and skinny matrices, and (ii) GREEDY based schemes are much better than earlier proposed algorithms with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared- memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Orlando, FL
%8 05-2017
%G eng
%R 10.1109/IPDPS.2017.46
%0 Book Section
%B Handbook of Big Data Technologies
%D 2017
%T Bringing High Performance Computing to Big Data Algorithms
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Handbook of Big Data Technologies
%I Springer
%@ 978-3-319-49339-8
%G eng
%R 10.1007/978-3-319-49340-4
%0 Generic
%D 2017
%T C++ API for Batch BLAS
%A Ahmad Abdelfattah
%A Konstantin Arturov
%A Cris Cecka
%A Jack Dongarra
%A Chip Freitag
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Luszczek, Piotr
%A Stanimire Tomov
%A Panruo Wu
%B SLATE Working Notes
%I University of Tennessee
%8 12-2017
%G eng
%0 Generic
%D 2017
%T C++ API for BLAS and LAPACK
%A Mark Gates
%A Piotr Luszczek
%A Ahmad Abdelfattah
%A Jakub Kurzak
%A Jack Dongarra
%A Konstantin Arturov
%A Cris Cecka
%A Chip Freitag
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 06-2017
%G eng
%0 Generic
%D 2017
%T The Case for Directive Programming for Accelerator Autotuner Optimization
%A Diana Fayad
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Jack Dongarra
%X In this work, we present the use of compiler pragma directives for parallelizing autotuning of specialized compute kernels for hardware accelerators. A set of constructs, that include prallelizing a source code that prune a generated search space with a large number of constraints for an autotunning infrastructure. For a better performance we studied optimization aimed at minimization of the run time.We also studied the behavior of the parallel load balance and the speedup on four different machines: x86, Xeon Phi, ARMv8, and POWER8.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 10-2017
%G eng
%0 Generic
%D 2017
%T Comparing performance of s-step and pipelined GMRES on distributed-memory multicore CPUs
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Piotr Luszczek
%A Jack Dongarra
%I SIAM Annual Meeting
%C Pittsburgh, Pennsylvania
%8 07-2017
%G eng
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2017
%T Design and Implementation of the PULSAR Programming System for Large Scale Computing
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Yves Robert
%A Jack Dongarra
%X The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.
%B Supercomputing Frontiers and Innovations
%V 4
%G eng
%U http://superfri.org/superfri/article/view/121/210
%N 1
%R 10.14529/jsfi170101
%0 Generic
%D 2017
%T Designing SLATE: Software for Linear Algebra Targeting Exascale
%A Jakub Kurzak
%A Panruo Wu
%A Mark Gates
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Gerald Ragghianti
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 10-2017
%G eng
%9 SLATE Working Notes
%0 Journal Article
%J International Journal of High Performance Computing and Networking (IJHPCN)
%D 2017
%T Evaluation of Directive-based Performance Portable Programming Models
%A M. Graham Lopez
%A Verónica Larrea
%A Wayne Joubert
%A Oscar Hernandez
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%B International Journal of High Performance Computing and Networking (IJHPCN)
%V (In Press)
%8 2017
%G eng
%0 Conference Proceedings
%B Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award
%D 2017
%T Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Piotr Luszczek
%A Jack Dongarra
%B Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award
%C Orlando, FL
%8 06-2017
%G eng
%0 Generic
%D 2017
%T MAGMA-sparse Interface Design Whitepaper
%A Hartwig Anzt
%A Erik Boman
%A Jack Dongarra
%A Goran Flegar
%A Mark Gates
%A Michael Heroux
%A Mark Hoemmen
%A Jakub Kurzak
%A Piotr Luszczek
%A Sivasankaran Rajamanickam
%A Stanimire Tomov
%A Stephen Wood
%A Ichitaro Yamazaki
%X In this report we describe the logic and interface we develop for the MAGMA-sparse library to allow for easy integration as third-party library into a top-level software ecosystem. The design choices are based on extensive consultation with other software library developers, in particular the Trilinos software development team. The interface documentation is at this point not exhaustive, but a first proposal for setting a standard. Although the interface description targets the MAGMA-sparse software module, we hope that the design choices carry beyond this specific library, and are attractive for adoption in other packages. This report is not intended as static document, but will be updated over time to reflect the agile software development in the ECP 1.3.3.11 STMS11-PEEKS project.
%B Innovative Computing Laboratory Technical Report
%8 09-2017
%G eng
%9 Technical Report
%0 Conference Paper
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%D 2017
%T Optimal Checkpointing Period with replicated execution on heterogeneous platforms
%A Anne Benoit
%A Aurelien Cavelan
%A Valentin Le Fèvre
%A Yves Robert
%X In this paper, we design and analyze strategies to replicate the execution of an application on two different platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~W for a periodic checkpointing strategy where both platforms concurrently try and execute W units of work before checkpointing. The first platform that completes its pattern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use first or second-order approximations to compute overheads and optimal pattern sizes, and show through extensive simulations that these models are very accurate. The simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively different speeds: in average, over a wide range of scenarios, the overhead is reduced by 30%. The simulations also demonstrate that the periodic checkpointing strategy is globally more efficient, unless platform speeds are quite close.
%B 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
%I IEEE Computer Society Press
%C Washington, DC
%8 06-2017
%G eng
%R 10.1145/3086157.3086165
%0 Generic
%D 2017
%T PLASMA 17 Performance Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 06-2017
%G eng
%0 Generic
%D 2017
%T PLASMA 17.1 Functionality Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 06-2017
%G eng
%0 Generic
%D 2017
%T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Aurelien Bouteiller
%A Anthony Danalis
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Stephen Wood
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 06-2017
%G eng
%9 SLATE Working Notes
%0 Journal Article
%J IEEE Transactions on Computers
%D 2017
%T Towards Optimal Multi-Level Checkpointing
%A Anne Benoit
%A Aurelien Cavelan
%A Valentin Le Fèvre
%A Yves Robert
%A Hongyang Sun
%K checkpointing
%K Dynamic programming
%K Error analysis
%K Heuristic algorithms
%K Optimized production technology
%K protocols
%K Shape
%B IEEE Transactions on Computers
%V 66
%P 1212–1226
%8 07-2017
%G eng
%N 7
%R 10.1109/TC.2016.2643660
%0 Generic
%D 2016
%T 2016 Dense Linear Algebra Software Packages Survey
%A Jack Dongarra
%A Jim Demmel
%A Julien Langou
%A Julie Langou
%X The 2016 Dense Linear Algebra Software Packages Survey was administered from January 1st 2016 to April 12 2016. 234 respondents answered the survey. The survey was advertised directly to the Linear Algebra community via our LAPACK/ScaLAPACK forum, NA Digest and we also directly contacted vendors and linear algebra experts. The breakdown of respondents was: 74% researchers or scientists, 25% were Principal Investigators and 25% Software maintainers or System administrators. The goal of the survey was to get the Linear Algebra community opinion and provide input on dense linear algebra software packages, in particular LAPACK, ScaLAPACK, PLASMA and MAGMA. The ultimate purpose of the survey was to improve these libraries to benefit our user community. The survey would allow the team to prioritize the many possible improvements that could be done. We also asked input from users accessing these libraries via 3rd party interfaces, for example MATLAB, Intel’s MKL, Python’s NumPy, AMD's ACML, and many others.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 09-2016
%G eng
%0 Conference Paper
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2016
%T Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures
%A Yulu Jia
%A Piotr Luszczek
%A Jack Dongarra
%X Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction.
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 05-2016
%G eng
%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%D 2016
%T Heterogeneous Streaming
%A Chris J. Newburn
%A Gaurav Bansal
%A Michael Wood
%A Luis Crivelli
%A Judit Planas
%A Alejandro Duran
%A Paulo Souza
%A Leonardo Borges
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%A Hartwig Anzt
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Ichitaro Yamazaki
%A Jesus Labarta
%X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 05-2016
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2016
%T High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems
%A Jack Dongarra
%A Michael Heroux
%A Piotr Luszczek
%B International Journal of High Performance Computing Applications
%V 30
%P 3 - 10
%8 02-2016
%G eng
%U http://hpc.sagepub.com/cgi/doi/10.1177/1094342015593158
%N 1
%! International Journal of High Performance Computing Applications
%R 10.1177/1094342015593158
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2016
%T Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs
%A Ahmad Abdelfattah
%A Hatem Ltaeif
%A David Keyes
%A Jack Dongarra
%X Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications.
%B Concurrency and Computation: Practice and Experience
%V 28
%P 3447 - 3465
%8 05-2016
%G eng
%U http://onlinelibrary.wiley.com/doi/10.1002/cpe.3874/full
%N 12
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.v28.1210.1002/cpe.3874
%0 Journal Article
%J International Journal of Parallel Programming
%D 2016
%T Porting the PLASMA Numerical Library to the OpenMP Standard
%A Asim YarKhan
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X PLASMA is a numerical library intended as a successor to LAPACK for solving problems in dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for efficient multithreading of algorithms expressed in a serial fashion. QUARK is a superscalar scheduler and implements automatic parallelization by tracking data dependencies and resolving data hazards at runtime. Recently, this type of scheduling has been incorporated in the OpenMP standard, which allows to transition PLASMA from the proprietary solution offered by QUARK to the standard solution offered by OpenMP. This article studies the feasibility of such transition.
%B International Journal of Parallel Programming
%8 06-2016
%G eng
%U http://link.springer.com/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6.pdfhttp://link.springer.com/article/10.1007/s10766-016-0441-6/fulltext.html
%! Int J Parallel Prog
%R 10.1007/s10766-016-0441-6
%0 Conference Paper
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2016
%T Search Space Generation and Pruning System for Autotuners
%A Piotr Luszczek
%A Mark Gates
%A Jakub Kurzak
%A Anthony Danalis
%A Jack Dongarra
%X This work tackles two simultaneous challenges faced by autotuners: the ease of describing a complex, multidimensional search space, and the speed of evaluating that space, while applying a multitude of pruning constraints. This article presents a declarative notation for describing a search space and a translation system for conversion to a standard C code for fast and multithreaded, as necessary, evaluation. The notation is Python-based and thus simple in syntax and easy to assimilate by the user interested in tuning rather than learning a new programming language. A large number of dimensions and a large number of pruning constraints may be expressed with little effort. The system is discussed in the context of autotuning the canonical matrix multiplication kernel for NVIDIA GPUs, where the search space has 15 dimensions and involves application of 10 complex pruning constrains. The speed of evaluation is compared against generators created using imperative programming style in various scripting and compiled languages.
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 05-2016
%G eng
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD)
%D 2016
%T Towards Achieving Performance Portability Using Directives for Accelerators
%A Lopez, M
%A Larrea, V
%A Joubert, W
%A Hernandez, O
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer of- fload features of OpenMP 4.5 are for moving codes between architectures, how much tuning might be required and what lessons we can learn from this experience. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement these kernels using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both X86 64 with attached NVIDIA GPUs, self-hosted Intel Xeon Phi KNL, as well as an X86 64 host system with Intel Xeon Phi coprocessors. In this paper, we explain what factors affected the performance portability such as how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimize and target to multiple platforms.
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD)
%I Innovative Computing Laboratory, University of Tennessee
%C Salt Lake City, Utah
%8 11-2016
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2015
%T Acceleration of GPU-based Krylov solvers via Data Transfer Reduction
%A Hartwig Anzt
%A William Sawyer
%A Stanimire Tomov
%A Piotr Luszczek
%A Jack Dongarra
%B International Journal of High Performance Computing Applications
%G eng
%0 Conference Paper
%B EuroMPI/Asia 2015 Workshop
%D 2015
%T Batched Matrix Computations on Hardware Accelerators
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for effective approach to develop energy efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations: Cholesky, LU, and QR for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybridMAGMAfactorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient for in our applications’ context. We illustrate all these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared to a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5x speedup on the NVIDIA K40 GPU.
%B EuroMPI/Asia 2015 Workshop
%C Bordeaux, France
%8 09-2015
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2015
%T Batched matrix computations on hardware accelerators based on GPUs
%A Azzam Haidar
%A Tingxing Dong
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K batched factorization
%K hardware accelerators
%K numerical linear algebra
%K numerical software libraries
%K one-sided factorization algorithms
%X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.
%B International Journal of High Performance Computing Applications
%8 02-2015
%G eng
%R 10.1177/1094342014567546
%0 Conference Paper
%B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)
%D 2015
%T Cholesky Across Accelerators
%A Asim YarKhan
%A Azzam Haidar
%A Chongxiao Cao
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)
%I IEEE
%C Elizabeth, NJ
%8 08-2015
%G eng
%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%D 2015
%T Efficient Eigensolver Algorithms on Accelerator Based Architectures
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges -starting from our algorithm design, kernel optimization and tuning, to our programming model- in the development of a scalable high-performance symmetric eigenvalue and singular value solver.
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%I SIAM
%C Atlanta, GA
%8 10-2015
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T Experiences in Autotuning Matrix Multiplication for Energy Minimization on GPUs
%A Hartwig Anzt
%A Blake Haugen
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K Autotuning
%K energy efficiency
%K hardware accelerators
%K matrix multiplication
%K power
%X In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 5096 - 5113
%8 Oct-12-2015
%G eng
%U http://doi.wiley.com/10.1002/cpe.3516https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2Fcpe.3516
%N 17
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.3516
%0 Journal Article
%J Concurrency in Computation: Practice and Experience
%D 2015
%T Experiences in autotuning matrix multiplication for energy minimization on GPUs
%A Hartwig Anzt
%A Blake Haugen
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B Concurrency in Computation: Practice and Experience
%V 27
%P 5096-5113
%8 12-2015
%G eng
%N 17
%R 10.1002/cpe.3516
%0 Conference Paper
%B 17th IEEE International Conference on High Performance Computing and Communications
%D 2015
%T Flexible Linear Algebra Development and Scheduling with Cholesky Factorization
%A Azzam Haidar
%A Asim YarKhan
%A Chongxiao Cao
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore-CPUs, GPUs, and coprocessors. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. On the other hand, a Xeon Phi coprocessor will work most effectively with degree of parallelism between GPUs and multicore-CPUs. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization.
%B 17th IEEE International Conference on High Performance Computing and Communications
%C Newark, NJ
%8 08-2015
%G eng
%0 Conference Paper
%B ISC High Performance
%D 2015
%T Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations
%A Azzam Haidar
%A Tingxing Dong
%A Stanimire Tomov
%A Piotr Luszczek
%A Jack Dongarra
%B ISC High Performance
%I Springer
%C Frankfurt, Germany
%8 07-2015
%G eng
%0 Conference Proceedings
%B OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies
%D 2015
%T From MPI to OpenSHMEM: Porting LAMMPS
%A Tang, Chunyan
%A Bouteiller, Aurelien
%A Herault, Thomas
%A Gorentla Venkata, Manjunath
%A Bosilca, George
%E Gorentla Venkata, Manjunath
%E Shamis, Pavel
%E Imam, Neena
%E Lopez, M. Graham
%X This work details the opportunities and challenges of porting a Petascale, MPI-based application –-LAMMPS–- to OpenSHMEM. We investigate the major programming challenges stemming from the differences in communication semantics, address space organization, and synchronization operations between the two programming models. This work provides several approaches to solve those challenges for representative communication patterns in LAMMPS, e.g., by considering group synchronization, peer's buffer status tracking, and unpacked direct transfer of scattered data. The performance of LAMMPS is evaluated on the Titan HPC system at ORNL. The OpenSHMEM implementations are compared with MPI versions in terms of both strong and weak scaling. The results outline that OpenSHMEM provides a rich semantic to implement scalable scientific applications. In addition, the experiments demonstrate that OpenSHMEM can compete with, and often improve on, the optimized MPI implementation.
%B OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies
%I Springer International Publishing
%C Annapolis, MD, USA
%P 121–137
%@ 978-3-319-26428-8
%G eng
%R 10.1007/978-3-319-26428-8_8
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2015
%T High-Performance Conjugate-Gradient Benchmark: A New Metric for Ranking High-Performance Computing Systems
%A Jack Dongarra
%A Michael Heroux
%A Piotr Luszczek
%K Additive Schwarz
%K HPC Benchmarking
%K Multigrid smoothing
%K Preconditioned Conjugate Gradient
%K Validation and Verification
%X We describe a new high-performance conjugate-gradient (HPCG) benchmark. HPCG is composed of computations and data-access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and to be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement.
%B The International Journal of High Performance Computing Applications
%G eng
%R 10.1177/1094342015593158
%0 Journal Article
%J Scientific Programming
%D 2015
%T HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
%A Azzam Haidar
%A Jack Dongarra
%A Khairul Kabir
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%A Yulu Jia
%K communication and computation overlap
%K dynamic runtime scheduling using dataflow dependences
%K hardware accelerators and coprocessors
%K Intel Xeon Phi processor
%K Many Integrated Cores
%K numerical linear algebra
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.
%B Scientific Programming
%V 23
%8 01-2015
%G eng
%N 1
%R 10.3233/SPR-140404
%0 Generic
%D 2015
%T HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems
%A Jack Dongarra
%A Michael Heroux
%A Piotr Luszczek
%K Additive Schwarz
%K HPC Benchmarking
%K Multigrid smoothing
%K Preconditioned Conjugate Gradient
%K Validation and Verification
%X We describe a new high performance conjugate gradient (HPCG) benchmark. HPCG is composed of computations and data access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 01-2015
%G eng
%U http://www.eecs.utk.edu/resources/library/file/1047/ut-eecs-15-736.pdf
%0 Conference Paper
%B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award)
%D 2015
%T MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing
%A Azzam Haidar
%A Stanimire Tomov
%A Piotr Luszczek
%A Jack Dongarra
%X Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries for embedded systems challenging. In this paper, we present the design and implementation of embedded system aware algorithms, that target these challenges in the area of dense linear algebra. We consider the fundamental problems of solving linear systems of equations and least squares problems, using the LU, QR, and Cholesky factorizations, and illustrate our results, both in terms of performance and energy efficiency, on the Jetson TK1 development kit. We developed performance optimizations for both small and large problems. In contrast to the corresponding LAPACK algorithms, the new designs target the use of many-cores, readily available now even in mobile devices like the Jetson TK1, e.g., featuring 192 CUDA cores. The implementations presented will form the core of a MAGMA Embedded library, to be released as part of the MAGMA libraries.
%B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award)
%I IEEE
%C Waltham, MA
%8 09-2015
%G eng
%0 Generic
%D 2015
%T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I ISC High Performance (ISC15), Intel Booth Presentation
%C Frankfurt, Germany
%8 06-2015
%G eng
%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2015
%T Mixing LU-QR Factorization Algorithms to Design High-Performance Dense Linear Algebra Solvers
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%K lu factorization
%K Numerical algorithms
%K QR factorization
%K Stability; Performance
%X This paper introduces hybrid LU–QR algorithms for solving dense linear systems of the form Ax=b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floating-point operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. The choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. A comprehensive set of experiments shows that hybrid LU–QR algorithms provide a continuous range of trade-offs between stability and performances.
%B Journal of Parallel and Distributed Computing
%V 85
%P 32-46
%8 11-2015
%G eng
%R doi:10.1016/j.jpdc.2015.06.007
%0 Conference Paper
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8)
%D 2015
%T Optimization for Performance and Energy for Batched Matrix Computations on GPUs
%A Azzam Haidar
%A Tingxing Dong
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K batched factorization
%K hardware accelerators
%K numerical linear algebra
%K numerical software libraries
%K one-sided factorization algorithms
%X As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU’s significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2.5 speedup on the K40 GPU.
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8)
%I ACM
%C San Francisco, CA
%8 02-2015
%G eng
%R 10.1145/2716282.2716288
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2015
%T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems
%A Maksims Abalenkovs
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%A Asim YarKhan
%K dense linear algebra
%K gpu
%K HPC
%K Multicore
%K Programming models
%K runtime
%X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.
%B Supercomputing Frontiers and Innovations
%V 2
%8 10-2015
%G eng
%R 10.14529/jsfi1504
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs
%A Theo Mary
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 11-2015
%G eng
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Randomized Algorithms to Update Partial Singular Value Decomposition on a Hybrid CPU/GPU Cluster
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 11-2015
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%K Gaussian elimination
%K lu factorization
%K Multicore
%K parallel
%K shared memory
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 1292-1309
%8 04-2015
%G eng
%N 5
%R 10.1002/cpe.3306
%0 Conference Paper
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015
%D 2015
%T Towards Batched Linear Solvers on Accelerated Hardware Platforms
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K batched factorization
%K hardware accelerators
%K numerical linear algebra
%K numerical software libraries
%K one-sided factorization algorithms
%X As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU’s symmetric multiprocessors factorizes a single problem at a time.We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA’s CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015
%I ACM
%C San Francisco, CA
%8 02-2015
%G eng
%0 Conference Proceedings
%B 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
%D 2015
%T UCX: An Open Source Framework for HPC Network APIs and Beyond
%A P. Shamis
%A M. G. Venkata
%A M. G. Lopez
%A M. B. Baker
%A O. Hernandez
%A Y. Itigin
%A M. Dubman
%A G. Shainer
%A R. L. Graham
%A L. Liss
%A Y. Shahar
%A S. Potluri
%A D. Rossetti
%A D. Becker
%A D. Poole
%A C. Lamb
%A S. Kumar
%A C. Stunkel
%A G. Bosilca
%A A. Bouteiller
%K application program interfaces
%K Bandwidth
%K Electronics packaging
%K Hardware
%K high throughput computing
%K highly-scalable network stack
%K HPC
%K HPC network APIs
%K I/O bound applications
%K Infiniband
%K input-output programs
%K Libraries
%K Memory management
%K message passing
%K message passing interface
%K Middleware
%K MPI
%K open source framework
%K OpenSHMEM
%K parallel programming
%K parallel programming models
%K partitioned global address space languages
%K PGAS
%K PGAS languages
%K Programming
%K protocols
%K public domain software
%K RDMA
%K system libraries
%K task-based paradigms
%K UCX
%K Unified Communication X
%X This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.
%B 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
%I IEEE
%C Santa Clara, CA, USA
%P 40-43
%8 Aug
%@ 978-1-4673-9160-3
%G eng
%M 15573048
%R 10.1109/HOTI.2015.13
%0 Conference Proceedings
%B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15)
%D 2015
%T Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators
%A Azzam Haidar
%A Yulu Jia
%A Piotr Luszczek
%A Stanimire Tomov
%A Asim YarKhan
%A Jack Dongarra
%K dataflow scheduling
%K hardware accelerators
%K multi-grain parallelism
%X A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications.
%B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15)
%I ACM
%C Austin, TX
%V No. 5
%8 11-2015
%G eng
%0 Book Section
%B Numerical Computations with GPUs
%D 2014
%T Accelerating Numerical Dense Linear Algebra Calculations with GPUs
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Numerical Computations with GPUs
%I Springer International Publishing
%P 3-28
%@ 978-3-319-06547-2
%G eng
%& 1
%R 10.1007/978-3-319-06548-9_1
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2014
%T Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K factorization
%K parallel linear algebra
%K recursion
%K shared memory synchronization
%K threaded parallelism
%X The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS.
%B Concurrency and Computation: Practice and Experience
%V 26
%P 1408-1431
%8 05-2014
%G eng
%U http://doi.wiley.com/10.1002/cpe.3110
%N 7
%! Concurrency Computat.: Pract. Exper.
%& 1408
%R 10.1002/cpe.3110
%0 Conference Paper
%B International Workshop on OpenCL
%D 2014
%T clMAGMA: High Performance Dense Linear Algebra with OpenCL
%A Chongxiao Cao
%A Jack Dongarra
%A Peng Du
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments presented, and in general provides to heterogeneous architectures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance OpenCL BLAS, hardware and OpenCL-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
%B International Workshop on OpenCL
%C Bristol University, England
%8 05-2014
%G eng
%0 Conference Paper
%B International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS)
%D 2014
%T Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems
%A Marc Baboulin
%A Jack Dongarra
%A Remi Lacroix
%X This paper presents an efficient computation for least squares conditioning or estimates of it. We propose performance results using new routines on top of the multicore-GPU library MAGMA. This set of routines is based on an efficient computation of the variance-covariance matrix for which, to our knowledge, there is no implementation in current public domain libraries LAPACK and ScaLAPACK.
%B International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS)
%C Waterloo, Ontario, CA
%8 08-2014
%G eng
%0 Conference Paper
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%D 2014
%T Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K dataflow
%K message-passing
%K multithreading
%K QR decomposition
%K runtime
%K systolic array
%X A systolic array provides an alternative computing paradigm to the von Neuman architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems.
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 05-2014
%G eng
%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Designing LU-QR Hybrid Solvers for Performance and Stability
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%X This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. Ideally, the choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the Parsec software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 05-2014
%@ 978-1-4799-3800-1
%G eng
%R 10.1109/IPDPS.2014.108
%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K Computer science
%K factorization
%K Heterogeneous systems
%K Intel Xeon Phi
%K linear algebra
%X We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs and coprocessors using a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components.
%B VECPAR 2014
%C Eugene, OR
%8 06-2014
%G eng
%0 Conference Paper
%B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014
%D 2014
%T Hybrid Multi-Elimination ILU Preconditioners on GPUs
%A Dimitar Lukarski
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%X Abstract—Iterative solvers for sparse linear systems often benefit from using preconditioners. While there are implementations for many iterative methods that leverage the computing power of accelerators, porting the latest developments in preconditioners to accelerators has been challenging. In this paper we develop a selfadaptive multi-elimination preconditioner for graphics processing units (GPUs). The preconditioner is based on a multi-level incomplete LU factorization and uses a direct dense solver for the bottom-level system. For test matrices from the University of Florida matrix collection, we investigate the influence of handling the triangular solvers in the distinct iteration steps in either single or double precision arithmetic. Integrated into a Conjugate Gradient method, we show that our multi-elimination algorithm is highly competitive against popular preconditioners, including multi-colored symmetric Gauss-Seidel relaxation preconditioners, and (multi-colored symmetric) ILU for numerous problems.
%B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 05-2014
%G eng
%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2014
%T Looking Back at Dense Linear Algebra Software
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%K decompositional approach
%K dense linear algebra
%K parallel algorithms
%X Over the years, computational physics and chemistry served as an ongoing source of problems that demanded the ever increasing performance from hardware as well as the software that ran on top of it. Most of these problems could be translated into solutions for systems of linear equations: the very topic of numerical linear algebra. Seemingly then, a set of efficient linear solvers could be solving important scientific problems for years to come. We argue that dramatic changes in hardware designs precipitated by the shifting nature of the marketplace of computer hardware had a continuous effect on the software for numerical linear algebra. The extraction of high percentages of peak performance continues to require adaptation of software. If the past history of this adaptive nature of linear algebra software is any guide then the future theme will feature changes as well–changes aimed at harnessing the incredible advances of the evolving hardware infrastructure.
%B Journal of Parallel and Distributed Computing
%V 74
%P 2548–2560
%8 07-2014
%G eng
%N 7
%& 2548
%R 10.1016/j.jpdc.2013.10.005
%0 Conference Paper
%B 16th IEEE International Conference on High Performance Computing and Communications (HPCC)
%D 2014
%T LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU
%A Tingxing Dong
%A Azzam Haidar
%A Piotr Luszczek
%A James Harris
%A Stanimire Tomov
%A Jack Dongarra
%X Gaussian Elimination is commonly used to solve dense linear systems in scientific models. In a large number of applications, a need arises to solve many small size problems, instead of few large linear systems. The size of each of these small linear systems depends, for example, on the number of the ordinary differential equations (ODEs) used in the model, and can be on the order of hundreds of unknowns. To efficiently exploit the computing power of modern accelerator hardware, these linear systems are processed in batches. To improve the numerical stability of the Gaussian Elimination, at least partial pivoting is required, most often accomplished with row pivoting. However, row pivoting can result in a severe performance penalty on GPUs because it brings in thread divergence and non-coalesced memory accesses. The state-of-the-art libraries for linear algebra that target GPUs, such as MAGMA, focus on large matrix sizes. They change the data layout by transposing the matrix to avoid these divergence and non-coalescing penalties. However, the data movement associated with transposition is very expensive for small matrices. In this paper, we propose a batched LU factorization for GPUs by using a multi-level blocked right looking algorithm that preserves the data layout but minimizes the penalty of partial pivoting. Our batched LU achieves up to 2:5-fold speedup when compared to the alternative CUBLAS solutions on a K40c GPU and 3:6-fold speedup over MKL on a node of the Titan supercomputer at ORNL in a nuclear reaction network simulation.
%B 16th IEEE International Conference on High Performance Computing and Communications (HPCC)
%I IEEE
%C Paris, France
%8 08-2014
%G eng
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2014
%T Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems
%A Jack Dongarra
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Asim YarKhan
%K dense linear algebra
%K hardware accelerators
%K task superscalar scheduling
%X Hardware heterogeneity of the HPC platforms is no longer considered unusual but instead have become the most viable way forward towards Exascale. In fact, the multitude of the heterogeneous resources available to modern computers are designed for different workloads and their efficient use is closely aligned with the specialized role envisaged by their design. Commonly in order to efficiently use such GPU resources, the workload in question must have a much greater degree of parallelism than workloads often associated with multicore processors (CPUs). Available GPU variants differ in their internal architecture and, as a result, are capable of handling workloads of varying degrees of complexity and a range of computational patterns. This vast array of applicable workloads will likely lead to an ever accelerated mixing of multicore-CPUs and GPUs in multi-user environments with the ultimate goal of offering adequate computing facilities for a wide range of scientific and technical workloads. In the following paper, we present a research prototype that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems. Our lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. This concept is reminiscent of dataflow and systolic architectures in its conceptualization of a workload as a set of side-effect-free tasks that pass data items whenever the associated work assignment have been completed. Additionally, our task abstractions and their parametrization enable uniformity in the algorithmic development across all the heterogeneous resources without sacrificing precious compute cycles. We include performance results for dense linear algebra functions which demonstrate the practicality and effectiveness of our approach that is aptly capable of full utilization of a wide range of accelerator hardware.
%B Supercomputing Frontiers and Innovations
%V 1
%G eng
%N 1
%R http://dx.doi.org/10.14529/jsfi1401
%0 Conference Paper
%B Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper)
%D 2014
%T New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem
%A Azzam Haidar
%A Piotr Luszczek
%A Jack Dongarra
%X We describe a design and implementation of a multi-stage algorithm for computing eigenvectors of a dense symmetric matrix. We show that reformulating the existing algorithms is beneficial in terms of performance even if that doubles the computational complexity. Through detailed analysis, we show that the effect of the increase in the asymptotic operation count may be compensated by a much improved performance rate. Our performance results indicate that using our approach achieves very good speedup and scalability even when directly compared with the existing state-of-the-art software.
%B Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper)
%I IEEE
%C Phoenix, AZ
%8 05-2014
%G eng
%R 10.1109/IPDPSW.2014.130
%0 Conference Paper
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%D 2014
%T Optimizing Krylov Subspace Solvers on Graphics Processing Units
%A Stanimire Tomov
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Hartwig Anzt
%A William Sawyer
%X Krylov subspace solvers are often the method of choice when solving sparse linear systems iteratively. At the same time, hardware accelerators such as graphics processing units (GPUs) continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to leverage the full potential of the accelerator. In this paper we target the acceleration of the BiCGSTAB solver for GPUs, showing that significant improvement can be achieved by reformulating the method and developing application-specific kernels instead of using the generic CUBLAS library provided by NVIDIA. We propose an implementation that benefits from a significantly reduced number of kernel launches and GPUhost communication events, by means of increased data locality and a simultaneous reduction of multiple scalar products. Using experimental data, we show that, depending on the dominance of the untouched sparse matrix vector products, significant performance improvements can be achieved compared to a reference implementation based on the CUBLAS library. We feel that such optimizations are crucial for the subsequent development of highlevel sparse linear algebra libraries.
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 05-2014
%G eng
%0 Conference Paper
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%D 2014
%T Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors
%A Azzam Haidar
%A Chongxiao Cao
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly.
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%I IEEE
%C New Orleans, LA
%8 11-2014
%G eng
%R 10.1109/ScalA.2014.8
%0 Generic
%D 2014
%T PULSAR Users’ Guide, Parallel Ultra-Light Systolic Array Runtime
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X PULSAR version 2.0, released in November 2014, is a complete programming platform for large-scale distributed memory systems with multicore processors and hardware accelerators. PULSAR provides a simple abstraction layer over multithreading, message passing, and multi-GPU, multi-stream programming. PULSAR offers a general-purpose programming model, suitable for a wide range of scientific and engineering applications. PULSAR was inspired by systolic arrays, popularized by Hsiang-Tsung Kung and Charles E. Leiserson.
%B University of Tennessee EECS Technical Report
%I University of Tennessee
%8 11-2014
%G eng
%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures
%A Hartwig Anzt
%A Dimitar Lukarski
%A Stanimire Tomov
%A Jack Dongarra
%X Based on the premise that preconditioners needed for scientific computing are not only required to be robust in the numerical sense, but also scalable for up to thousands of light-weight cores, we argue that this two-fold goal is achieved for the recently developed self-adaptive multi-elimination preconditioner. For this purpose, we revise the underlying idea and analyze the performance of implementations realized in the PARALUTION and MAGMA open-source software libraries on GPU architectures (using either CUDA or OpenCL), Intel’s Many Integrated Core Architecture, and Intel’s Sandy Bridge processor. The comparison with other well-established preconditioners like multi-coloured Gauss-Seidel, ILU(0) and multi-colored ILU(0), shows that the twofold goal of a numerically stable cross-platform performant algorithm is achieved.
%B VECPAR 2014
%C Eugene, OR
%8 06-2014
%G eng
%0 Conference Paper
%B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014
%D 2014
%T Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes
%A Xavier Lacoste
%A Mathieu Faverge
%A Pierre Ramet
%A Samuel Thibault
%A George Bosilca
%K DAG based runtime
%K gpu
%K Multicore
%K Sparse linear solver
%X The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of the computing resources. The pressure to maintain reasonable levels of performance and portability, forces the application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical architectures. In this paper, we study the replacement of the highly specialized internal scheduler in PaStiX by two generic runtime frameworks: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them with the opportunity to optimize it in order to maximize the algorithm eefficiency for a predefined execution environment. A comparative study of the performance of the PaStiX solver with the three schedulers { native PaStiX, StarPU and PaRSEC schedulers { on different execution contexts is performed. The analysis highlights the similarities from a performance point of view between the different execution supports. These results demonstrate that these generic DAG-based runtimes provide a uniform and portable programming interface across heterogeneous environments, and are, therefore, a sustainable solution for hybrid environments.
%B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 05-2014
%G eng
%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment
%A Azzam Haidar
%A Chongxiao Cao
%A Jack Dongarra
%A Piotr Luszczek
%A Stanimire Tomov
%K algorithms
%K Computer science
%K CUDA
%K Heterogeneous systems
%K Intel Xeon Phi
%K linear algebra
%K nVidia
%K Tesla K20
%K Tesla M2090
%X Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resourcespecific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 05-2014
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2014
%T Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors
%A José I. Aliaga
%A Hartwig Anzt
%A Maribel Castillo
%A Juan C. Fernández
%A Germán León
%A Joaquín Pérez
%A Enrique S. Quintana-Ortí
%K CG
%K CPUs
%K energy efficiency
%K GPUs
%K low-power architectures
%X In this paper, we analyze the interactions occurring in the triangle performance-power-energy for the execution of a pivotal numerical algorithm, the iterative conjugate gradient (CG) method, on a diverse collection of parallel multithreaded architectures. This analysis is especially timely in a decade where the power wall has arisen as a major obstacle to build faster processors. Moreover, the CG method has recently been proposed as a complement to the LINPACK benchmark, as this iterative method is argued to be more archetypical of the performance of today's scientific and engineering applications. To gain insights about the benefits of hands-on optimizations we include runtime and energy efficiency results for both out-of-the-box usage relying exclusively on compiler optimizations, and implementations manually optimized for target architectures, that range from general-purpose and digital signal multicore processors to manycore graphics processing units, all representative of current multithreaded systems.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 885-904
%8 09-2014
%G eng
%U http://dx.doi.org/10.1002/cpe.3341
%N 4
%& 885
%R 10.1002/cpe.3341
%0 Journal Article
%J The Computer Journal
%D 2013
%T BlackjackBench: Portable Hardware Characterization with Automated Results Analysis
%A Anthony Danalis
%A Piotr Luszczek
%A Gabriel Marin
%A Jeffrey Vetter
%A Jack Dongarra
%K hardware characterization
%K micro-benchmarks
%K statistical analysis
%X DARPA's AACE project aimed to develop Architecture Aware Compiler Environments. Such a compiler automatically characterizes the targeted hardware and optimizes the application codes accordingly. We present the BlackjackBench suite, a collection of portable micro-benchmarks that automate system characterization, plus statistical analysis techniques for interpreting the results. The BlackjackBench benchmarks discover the effective sizes and speeds of the hardware environment rather than the often unattainable peak values. We aim at hardware characteristics that can be observed by running executables generated by existing compilers from standard C codes. We characterize the memory hierarchy, including cache sharing and non-uniform memory access characteristics of the system, properties of the processing cores affecting the instruction execution speed and the length of the operating system scheduler time slot. We show how these features of modern multicores can be discovered programmatically. We also show how the features could potentially interfere with each other resulting in incorrect interpretation of the results, and how established classification and statistical analysis techniques can reduce experimental noise and aid automatic interpretation of results. We show how effective hardware metrics from our probes allow guided tuning of computational kernels that outperform an autotuning library further tuned by the hardware vendor.
%B The Computer Journal
%8 03-2013
%G eng
%R 10.1093/comjnl/bxt057
%0 Generic
%D 2013
%T clMAGMA: High Performance Dense Linear Algebra with OpenCL
%A Chongxiao Cao
%A Jack Dongarra
%A Peng Du
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X This paper presents the design and implementation of sev- eral fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments pre- sented, and in general provides to heterogeneous architec- tures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is ob- tained through use of the high-performance OpenCL BLAS, hardware and OpenCL-speci c tuning, and a hybridization methodology where we split the algorithm into computa- tional tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
%B University of Tennessee Technical Report (Lawn 275)
%I University of Tennessee
%8 03-2013
%G eng
%0 Conference Proceedings
%B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2013
%T CPU-GPU Hybrid Bidiagonal Reduction With Soft Error Resilience
%A Yulu Jia
%A Piotr Luszczek
%A George Bosilca
%A Jack Dongarra
%X Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur.
%B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C Montpellier, France
%8 11-2013
%G eng
%0 Journal Article
%J Scalable Computing and Communications: Theory and Practice
%D 2013
%T Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Piotr Luszczek
%A Jack Dongarra
%E Samee Khan
%E Lin-Wang Wang
%E Albert Zomaya
%B Scalable Computing and Communications: Theory and Practice
%I John Wiley & Sons
%P 699-735
%8 03-2013
%G eng
%0 Generic
%D 2013
%T Designing LU-QR hybrid solvers for performance and stability
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report (also LAWN 282)
%I University of Tennessee
%8 10-2013
%G eng
%0 Journal Article
%J Journal of Supercomputing
%D 2013
%T Enabling Workflows in GridSolve: Request Sequencing and Service Trading
%A Yinan Li
%A Asim YarKhan
%A Jack Dongarra
%A Keith Seymour
%A Aurlie Hurault
%K grid computing
%K gridpac
%K netsolve
%K service trading
%K workflow applications
%X GridSolve employs a RPC-based client-agent-server model for solving computational problems. There are two deficiencies associated with GridSolve when a computational problem essentially forms a workflow consisting of a sequence of tasks with data dependencies between them. First, intermediate results are always passed through the client, resulting in unnecessary data transport. Second, since the execution of each individual task is a separate RPC session, it is difficult to enable any potential parallelism among tasks. This paper presents a request sequencing technique that addresses these deficiencies and enables workflow executions. Building on the request sequencing work, one way to generate workflows is by taking higher level service requests and decomposing them into a sequence of simpler service requests using a technique called service trading. A service trading component is added to GridSolve to take advantage of the new dynamic request sequencing. The features described here include automatic DAG construction and data dependency analysis, direct interserver data transfer, parallel task execution capabilities, and a service trading component.
%B Journal of Supercomputing
%V 64
%P 1133-1152
%8 06-2013
%G eng
%N 3
%& 1133
%R 10.1007/s11227-010-0549-1
%0 Journal Article
%J Parallel Computing
%D 2013
%T Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Mathias Jacquelin
%A Julien Langou
%A Yves Robert
%K Cluster
%K Distributed memory
%K Hierarchical architecture
%K multi-core
%K numerical linear algebra
%K QR factorization
%X This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ‘‘communication-avoiding’’), it is natural to consider hierarchical trees composed of an ‘‘inter-node’’ tree which acts on top of ‘‘intra-node’’ trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ‘‘TS level’’ for cache-friendliness, (1) ‘‘low-level’’ for decoupled highly parallel inter-node reductions, (2) ‘‘domino level’’ to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGUE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
%B Parallel Computing
%V 39
%P 212-232
%8 05-2013
%G eng
%N 4-5
%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2013
%T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K algorithms
%K bidiagional reduction
%K bulge chasing
%K data translation layer
%K dynamic scheduling
%K high performance kernels
%K performance
%K tile algorithms
%K two-stage approach
%X This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.
%B ACM Transactions on Mathematical Software (TOMS)
%V 39
%G eng
%N 3
%R 10.1145/2450153.2450154
%0 Book Section
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%D 2013
%T HPC Challenge: Design, History, and Implementation Highlights
%A Jack Dongarra
%A Piotr Luszczek
%K exascale
%K hpc challenge
%K hpcc
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%I Taylor and Francis
%C Boca Raton, FL
%@ 978-1-4665-6834-1
%G eng
%& 2
%0 Generic
%D 2013
%T Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC
%A Guillaume Aupy
%A Mathieu Faverge
%A Yves Robert
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter-node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures
%B Lawn 277
%8 05-2013
%G eng
%0 Generic
%D 2013
%T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware
%A Azzam Haidar
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report (also LAWN 283)
%I University of Tennessee
%8 10-2013
%G eng
%0 Conference Paper
%B Supercomputing 2013
%D 2013
%T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware
%A Azzam Haidar
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%B Supercomputing 2013
%C Denver, CO
%8 11-2013
%G eng
%0 Book Section
%B Handbook of Linear Algebra
%D 2013
%T LAPACK
%A Zhaojun Bai
%A James Demmel
%A Jack Dongarra
%A Julien Langou
%A Jenny Wang
%X With a substantial amount of new material, the Handbook of Linear Algebra, Second Edition provides comprehensive coverage of linear algebra concepts, applications, and computational software packages in an easy-to-use format. It guides you from the very elementary aspects of the subject to the frontiers of current research. Along with revisions and updates throughout, the second edition of this bestseller includes 20 new chapters.
%B Handbook of Linear Algebra
%7 Second
%I CRC Press
%C Boca Raton, FL
%@ 9781466507289
%G eng
%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2013
%T Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms
%A Fred G. Gustavson
%A Jerzy Wasniewski
%A Jack Dongarra
%A José Herrero
%A Julien Langou
%X Four routines called DPOTF3i, i = a,b,c,d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts decreasing. This is our main result and it implies, due to the use of larger block size nb, that DGEMM, DSYRK, and DTRSM performance also increases! The four DPOTF3i routines use simple register blocking. Different platforms have different numbers of registers. Thus, our four routines have different register blocking sizes. BPF is introduced. LAPACK routines for POTRF and PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK POTRF source codes. We call these codes BPTRF. There are two variants of BPF: lower and upper. Upper BPF is “identical” to Square Block Packed Format (SBPF). “LAPACK” implementations on multicore processors use SBPF. Lower BPF is less efficient than upper BPF. Vector inplace transposition converts lower BPF to upper BPF very efficiently. Corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms are given for n ≈ nb as well as results for large n comparing DBPTRF versus DPOTRF.
%B ACM Transactions on Mathematical Software (TOMS)
%V 39
%8 02-2013
%G eng
%N 2
%R 10.1145/2427023.2427026
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Computing
%D 2013
%T LU Factorization with Partial Pivoting for a Multicore System with Accelerators
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K accelerator
%K Gaussian elimination
%K gpu
%K lu factorization
%K manycore
%K Multicore
%K partial pivoting
%X LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of partial pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
%B IEEE Transactions on Parallel and Distributed Computing
%V 24
%P 1613-1621
%8 08-2013
%G eng
%N 8
%& 1613
%R http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.242
%0 Journal Article
%J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%D 2013
%T Multithreading in the PLASMA Library
%A Jakub Kurzak
%A Piotr Luszczek
%A Asim YarKhan
%A Mathieu Faverge
%A Julien Langou
%A Henricus Bouwmeester
%A Jack Dongarra
%E Mohamed Ahmed
%E Reda Ammar
%E Sanguthevar Rajasekaran
%B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%I Taylor & Francis
%8 00-2013
%G eng
%0 Conference Paper
%B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013
%D 2013
%T Parallel Reduction to Hessenberg Form with Algorithm-Based Fault Tolerance
%A Yulu Jia
%A George Bosilca
%A Piotr Luszczek
%A Jack Dongarra
%X This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.
%B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013
%C Denver, CO
%8 11-2013
%G eng
%0 Conference Paper
%B International Conference on Computational Science (ICCS 2013)
%D 2013
%T A Parallel Solver for Incompressible Fluid Flows
%A Yushan Wang
%A Marc Baboulin
%A Joël Falcou
%A Yann Fraigneau
%A Olivier Le Maître
%K ADI
%K Navier-Stokes equations
%K Parallel computing
%K Partial diagonalization
%K Prediction-projection
%K SIMD
%X The Navier-Stokes equations describe a large class of fluid flows but are difficult to solve analytically because of their nonlin- earity. We present in this paper a parallel solver for the 3-D Navier-Stokes equations of incompressible unsteady flows with constant coefficients, discretized by the finite difference method. We apply the prediction-projection method which transforms the Navier-Stokes equations into three Helmholtz equations and one Poisson equation. For each Helmholtz system, we apply the Alternating Direction Implicit (ADI) method resulting in three tridiagonal systems. The Poisson equation is solved using partial diagonalization which transforms the Laplacian operator into a tridiagonal one. We describe an implementation based on MPI where the computations are performed on each subdomain and information is exchanged on the interfaces, and where the tridiagonal system solutions are accelerated using vectorization techniques. We present performance results on a current multicore system.
%B International Conference on Computational Science (ICCS 2013)
%I Elsevier B.V.
%C Barcelona, Spain
%8 06-2013
%G eng
%R DOI: 10.1016/j.procs.2013.05.207
%0 Conference Paper
%B PPAM 2013
%D 2013
%T Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%K magma
%K mic
%K xeon phi
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.
%B PPAM 2013
%C Warsaw, Poland
%8 09-2013
%G eng
%0 Book Section
%B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing
%D 2013
%T Scalable Dense Linear Algebra on Heterogeneous Hardware
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Abstract. Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect, that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand, when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs), which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores.
%B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing
%G eng
%0 Journal Article
%J Journal of Computational Science
%D 2013
%T Soft Error Resilient QR Factorization for Hybrid System with GPGPU
%A Peng Du
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K gpgpu
%K gpu
%K magma
%X The general purpose graphics processing units (GPGPUs) are increasingly deployed for scientific computing due to their performance advantages over CPUs. What followed is the fact that fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors, for example, in the form of bit flips. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include: (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R; and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can successfully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.
%B Journal of Computational Science
%V 4
%P 457–464
%8 11-2013
%G eng
%N 6
%R http://dx.doi.org/10.1016/j.jocs.2013.01.004
%0 Conference Paper
%B 17th IEEE High Performance Extreme Computing Conference (HPEC '13)
%D 2013
%T Standards for Graph Algorithm Primitives
%A Tim Mattson
%A David Bader
%A Jon Berry
%A Aydin Buluc
%A Jack Dongarra
%A Christos Faloutsos
%A John Feo
%A John Gilbert
%A Joseph Gonzalez
%A Bruce Hendrickson
%A Jeremy Kepner
%A Charles Lieserson
%A Andrew Lumsdaine
%A David Padua
%A Steve W. Poole
%A Steve Reinhardt
%A Mike Stonebraker
%A Steve Wallach
%A Andrew Yoo
%K algorithms
%K graphs
%K linear algebra
%K software standards
%X It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.
%B 17th IEEE High Performance Extreme Computing Conference (HPEC '13)
%I IEEE
%C Waltham, MA
%8 09-2013
%G eng
%R 10.1109/HPEC.2013.6670338
%0 Generic
%D 2013
%T Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures
%A Yulu Jia
%A Piotr Luszczek
%A Jack Dongarra
%X Graphics Processing Units (GPUs) are gaining wide spread usage in the ﬁeld of scientiﬁc computing owing to the performance boost GPUs bring to computation intensive applications. The typical conﬁguration is to integrate GPUs and CPUs in the same system where the CPUs handle the control ﬂow and part of the computation workload, and the GPUs serve as accelerators carry out the bulk of the data parallel compute workload. In this paper we design and implement a soft error resilient Hessenberg reduction algorithm on GPU based hybrid platforms. Our design employs algorithm based fault tolerance technique, diskless checkpointing and reverse computation. We detect and correct soft errors on-line without delaying the detection and correction to the end of the factorization. By utilizing idle time of the CPUs and overlapping both host side and GPU side workloads we minimize the observed overhead. Experiment results validated our design philosophy. Our algorithm introduces less than 2% performance overhead compared to the non-fault tolerant hybrid Hessenberg reduction algorithm.
%B UT-CS-13-712
%I University of Tennessee Computer Science Technical Report
%8 06-2013
%G eng
%0 Conference Paper
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%D 2013
%T Virtual Systolic Array for QR Decomposition
%A Jakub Kurzak
%A Piotr Luszczek
%A Mark Gates
%A Ichitaro Yamazaki
%A Jack Dongarra
%K dataflow programming
%K message passing
%K multi-core
%K QR decomposition
%K roofline model
%K systolic array
%X Systolic arrays offer a very attractive, data-centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for the QR decomposition is developed and a virtualization layer is used for mapping of the algorithm to a large distributed memory system. Strong scaling properties are discovered, superior to existing solutions.
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%I IEEE
%C Boston, MA
%8 05-2013
%G eng
%R 10.1109/IPDPS.2013.119
%0 Generic
%D 2012
%T On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B University of Tennessee Computer Science Technical Report
%8 07-2013
%G eng
%0 Conference Proceedings
%B 2012 IEEE High Performance Extreme Computing Conference
%D 2012
%T Anatomy of a Globally Recursive Embedded LINPACK Benchmark
%A Piotr Luszczek
%A Jack Dongarra
%X We present a complete bottom-up implementation of an embedded LINPACK benchmark on iPad 2. We use a novel formulation of a recursive LU factorization that is recursive and parallel at the global scope. We be believe our new algorithm presents an alternative to existing linear algebra parallelization techniques such as master-worker and DAG-based approaches. We show a assembly API that allows us a much higher level of abstraction and provides rapid code development within the confines of mobile device SDK. We use performance modeling to help with the limitation of the device and the limited access to device from the development environment not geared for HPC application tuning.
%B 2012 IEEE High Performance Extreme Computing Conference
%C Waltham, MA
%P 1-6
%8 09-2012
%@ 978-1-4673-1577-7
%G eng
%R 10.1109/HPEC.2012.6408679
%0 Journal Article
%J IPDPS 2012
%D 2012
%T A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction
%A Azzam Haidar
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%B IPDPS 2012
%C Shanghai, China
%8 05-2012
%G eng
%0 Journal Article
%J Parallel Computing
%D 2012
%T DAGuE: A generic distributed DAG Engine for High Performance Computing.
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%E Torsten Hoefler
%K dague
%K parsec
%B Parallel Computing
%I Elsevier
%V 38
%P 27-51
%8 00-2012
%G eng
%0 Journal Article
%J High Performance Scientific Computing: Algorithms and Applications
%D 2012
%T Dense Linear Algebra on Accelerated Multicore Hardware
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%E Michael Berry
%E et al.,
%B High Performance Scientific Computing: Algorithms and Applications
%I Springer-Verlag
%C London, UK
%8 00-2012
%G eng
%0 Conference Proceedings
%B The 2nd International Conference on Cloud and Green Computing (submitted)
%D 2012
%T Energy Footprint of Advanced Dense Numerical Linear Algebra using Tile Algorithms on Multicore Architecture
%A Jack Dongarra
%A Hatem Ltaeif
%A Piotr Luszczek
%A Vincent M Weaver
%B The 2nd International Conference on Cloud and Green Computing (submitted)
%C Xiangtan, Hunan, China
%8 11-2012
%G eng
%0 Journal Article
%J Lecture Notes in Computer Science
%D 2012
%T Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures using Tree Reduction
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%B Lecture Notes in Computer Science
%V 7203
%P 661-670
%8 09-2012
%G eng
%0 Journal Article
%J Parallel Computing
%D 2012
%T From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming
%A Peng Du
%A Rick Weber
%A Piotr Luszczek
%A Stanimire Tomov
%A Gregory D. Peterson
%A Jack Dongarra
%B Parallel Computing
%V 38
%P 391-407
%8 08-2012
%G eng
%0 Journal Article
%J EuroPar 2012 (also LAWN 260)
%D 2012
%T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement
%A Hartwig Anzt
%A Piotr Luszczek
%A Jack Dongarra
%A Vincent Heuveline
%B EuroPar 2012 (also LAWN 260)
%C Rhodes Island, Greece
%8 08-2012
%G eng
%0 Conference Proceedings
%B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium
%D 2012
%T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Julien Langou
%A Yves Robert
%B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium
%I IEEE Computer Society Press
%C Shanghai, China
%8 05-2012
%G eng
%0 Journal Article
%J ICCS 2012
%D 2012
%T High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors
%A Peng Du
%A Piotr Luszczek
%A Jack Dongarra
%B ICCS 2012
%C Omaha, NE
%8 06-2012
%G eng
%0 Generic
%D 2012
%T How LAPACK library enables Microsoft Visual Studio support with CMake and LAPACKE
%A Julien Langou
%A Bill Hoffman
%A Brad King
%B University of Tennessee Computer Science Technical Report (also LAWN 270)
%8 07-2012
%G eng
%0 Journal Article
%J On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear)
%D 2012
%T HPC Challenge: Design, History, and Implementation Highlights
%A Jack Dongarra
%A Piotr Luszczek
%E Jeffrey Vetter
%B On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear)
%I Chapman & Hall/CRC Press
%8 00-2012
%G eng
%0 Journal Article
%J Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear)
%D 2012
%T Looking Back at Dense Linear Algebra Software
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%E Viktor K. Prasanna
%E Yves Robert
%E Per Stenström
%B Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear)
%8 00-2012
%G eng
%0 Generic
%D 2012
%T MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors
%A Jack Dongarra
%A Mark Gates
%A Yulu Jia
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12)
%C Salt Lake City, UT
%8 11-2012
%G eng
%0 Journal Article
%J Supercomputing '12 (poster)
%D 2012
%T Matrices Over Runtime Systems at Exascale
%A Emmanuel Agullo
%A George Bosilca
%A Cedric Castagnède
%A Jack Dongarra
%A Hatem Ltaeif
%A Stanimire Tomov
%B Supercomputing '12 (poster)
%C Salt Lake City, Utah
%8 11-2012
%G eng
%0 Conference Proceedings
%B International Workshop on Power-Aware Systems and Architectures
%D 2012
%T Measuring Energy and Power with PAPI
%A Vincent M Weaver
%A Matt Johnson
%A Kiran Kasichayanula
%A James Ralph
%A Piotr Luszczek
%A Dan Terpstra
%A Shirley Moore
%K papi
%B International Workshop on Power-Aware Systems and Architectures
%C Pittsburgh, PA
%8 09-2012
%G eng
%0 Journal Article
%J VECPAR 2012
%D 2012
%T Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators
%A Ahmad Abdelfattah
%A Jack Dongarra
%A David Keyes
%A Hatem Ltaeif
%B VECPAR 2012
%C Kobe, Japan
%8 07-2012
%G eng
%0 Journal Article
%J SAAHPC '12 (Best Paper Award)
%D 2012
%T Power Aware Computing on GPUs
%A Kiran Kasichayanula
%A Dan Terpstra
%A Piotr Luszczek
%A Stanimire Tomov
%A Shirley Moore
%A Gregory D. Peterson
%K magma
%B SAAHPC '12 (Best Paper Award)
%C Argonne, IL
%8 07-2012
%G eng
%0 Conference Proceedings
%B Third International Conference on Energy-Aware High Performance Computing
%D 2012
%T Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
%A George Bosilca
%A Jack Dongarra
%A Hatem Ltaeif
%B Third International Conference on Energy-Aware High Performance Computing
%C Hamburg, Germany
%8 09-2012
%G eng
%0 Journal Article
%J LAWN 267
%D 2012
%T Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B LAWN 267
%8 00-2012
%G eng
%0 Conference Proceedings
%B Proceedings of VECPAR’12
%D 2012
%T Programming the LU Factorization for a Multicore System with Accelerators
%A Jakub Kurzak
%A Piotr Luszczek
%A Mathieu Faverge
%A Jack Dongarra
%K plasma
%K quark
%B Proceedings of VECPAR’12
%C Kobe, Japan
%8 04-2012
%G eng
%0 Journal Article
%J SIAM Journal on Scientific Computing (Accepted)
%D 2012
%T Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B SIAM Journal on Scientific Computing (Accepted)
%8 07-2012
%G eng
%0 Generic
%D 2011
%T Achieving Numerical Accuracy and High Performance using Recursive Tile LU Factorization
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K plasma
%K quark
%B University of Tennessee Computer Science Technical Report (also as a LAWN)
%8 09-2011
%G eng
%0 Conference Proceedings
%B The Twentieth International Conference on Domain Decomposition Methods
%D 2011
%T Algebraic Schwarz Preconditioning for the Schur Complement: Application to the Time-Harmonic Maxwell Equations Discretized by a Discontinuous Galerkin Method.
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Stephane Lanteri
%A Jean Roman
%B The Twentieth International Conference on Domain Decomposition Methods
%C La Jolla, California
%8 02-2011
%G eng
%U http://hal.inria.fr/inria-00577639
%0 Generic
%D 2011
%T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
%A Azzam Haidar
%A Hatem Ltaeif
%A Asim YarKhan
%A Jack Dongarra
%K plasma
%K quark
%B University of Tennessee Computer Science Technical Report, UT-CS-11-666, (also Lawn 243)
%8 00-2011
%G eng
%0 Journal Article
%J TeraGrid'11
%D 2011
%T Autotuned Parallel I/O for Highly Scalable Biosequence Analysis
%A Haihang You
%A Bhanu Rekapalli
%A Qing Liu
%A Shirley Moore
%B TeraGrid'11
%C Salt Lake City, Utah
%8 07-2011
%G eng
%0 Conference Proceedings
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%D 2011
%T BlackjackBench: Hardware Characterization with Portable Micro-Benchmarks and Automatic Statistical Analysis of Results
%A Anthony Danalis
%A Piotr Luszczek
%A Gabriel Marin
%A Jeffrey Vetter
%A Jack Dongarra
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%C Anchorage, AK
%8 05-2011
%G eng
%0 Journal Article
%J in Solving the Schrodinger Equation: Has everything been tried? (to appear)
%D 2011
%T Changes in Dense Linear Algebra Kernels - Decades Long Perspective
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%E P. Popular
%B in Solving the Schrodinger Equation: Has everything been tried? (to appear)
%I Imperial College Press
%8 00-2011
%G eng
%0 Conference Proceedings
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%D 2011
%T DAGuE: A Generic Distributed DAG Engine for High Performance Computing
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%K dague
%K parsec
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%I IEEE
%C Anchorage, Alaska, USA
%P 1151-1158
%8 00-2011
%G eng
%0 Conference Proceedings
%B Cray Users Group Conference (CUG'11) (Best Paper Finalist)
%D 2011
%T The Design of an Auto-tuning I/O Framework on Cray XT5 System
%A Haihang You
%A Qing Liu
%A Zhiqiang Li
%A Shirley Moore
%K gco
%B Cray Users Group Conference (CUG'11) (Best Paper Finalist)
%C Fairbanks, Alaska
%8 05-2011
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2011
%T Energy and performance characteristics of different parallel implementations of scientific applications on multicore systems
%A Charles Lively
%A Xingfu Wu
%A Valerie Taylor
%A Shirley Moore
%A Hung-Ching Chang
%A Kirk Cameron
%K mumi
%B International Journal of High Performance Computing Applications
%V 25
%P 342-350
%8 00-2011
%G eng
%0 Conference Proceedings
%B 6th Workshop on Virtualization in High-Performance Cloud Computing
%D 2011
%T Evaluation of the HPC Challenge Benchmarks in Virtualized Environments
%A Piotr Luszczek
%A Eric Meek
%A Shirley Moore
%A Dan Terpstra
%A Vincent M Weaver
%A Jack Dongarra
%K hpcc
%B 6th Workshop on Virtualization in High-Performance Cloud Computing
%C Bordeaux, France
%8 08-2011
%G eng
%0 Conference Proceedings
%B Proceedings of PARCO'11
%D 2011
%T Exploiting Fine-Grain Parallelism in Recursive LU Factorization
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K plasma
%B Proceedings of PARCO'11
%C Gent, Belgium
%8 04-2011
%G eng
%0 Conference Proceedings
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%D 2011
%T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%I IEEE
%C Anchorage, Alaska, USA
%P 1432-1441
%8 05-2011
%G eng
%0 Generic
%D 2011
%T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement
%A Hartwig Anzt
%A Piotr Luszczek
%A Jack Dongarra
%A Vincent Heuveline
%K magma
%B University of Tennessee Computer Science Technical Report UT-CS-11-690 (also Lawn 260)
%8 12-2011
%G eng
%0 Generic
%D 2011
%T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Julien Langou
%A Yves Robert
%K magma
%K plasma
%B University of Tennessee Computer Science Technical Report (also Lawn 257)
%8 10-2011
%G eng
%0 Generic
%D 2011
%T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247)
%8 05-2011
%G eng
%0 Journal Article
%J IEEE Cluster 2011
%D 2011
%T High Performance Dense Linear System Solver with Soft Error Resilience
%A Peng Du
%A Piotr Luszczek
%A Jack Dongarra
%K ft-la
%B IEEE Cluster 2011
%C Austin, TX
%8 09-2011
%G eng
%0 Conference Proceedings
%B Proceedings of MTAGS11
%D 2011
%T High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%B Proceedings of MTAGS11
%C Seattle, WA
%8 11-2011
%G eng
%0 Journal Article
%J in GPU Computing Gems, Jade Edition
%D 2011
%T A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Samuel Thibault
%A Stanimire Tomov
%E Wen-mei W. Hwu
%K magma
%K morse
%B in GPU Computing Gems, Jade Edition
%I Elsevier
%V 2
%P 473-484
%8 00-2011
%G eng
%0 Journal Article
%J IEEE/ACS AICCSA 2011
%D 2011
%T LU Factorization for Accelerator-Based Systems
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Mathieu Faverge
%A Julien Langou
%A Hatem Ltaeif
%A Stanimire Tomov
%K magma
%K morse
%B IEEE/ACS AICCSA 2011
%C Sharm-El-Sheikh, Egypt
%8 12-2011
%G eng
%0 Conference Proceedings
%B International Conference on Parallel Processing (ICPP'11)
%D 2011
%T Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs
%A Allen Maloney
%A Scott Biersdorff
%A Sameer Shende
%A Heike Jagode
%A Stanimire Tomov
%A Guido Juckeland
%A Robert Dietrich
%A Duncan Poole
%A Christopher Lamb
%K magma
%K mumi
%K papi
%B International Conference on Parallel Processing (ICPP'11)
%C Taipei, Taiwan
%8 09-2011
%G eng
%0 Conference Proceedings
%B Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11)
%D 2011
%T Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%K plasma
%K quark
%B Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11)
%C Seattle, WA
%8 11-2011
%G eng
%0 Generic
%D 2011
%T Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report, UT-CS-11-677, (also Lawn254)
%8 08-2011
%G eng
%0 Journal Article
%J IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC)
%D 2011
%T Performance Portability of a GPU Enabled Factorization with the DAGuE Framework
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Pierre Lemariner
%A Narapat Ohm Saengpatsa
%A Stanimire Tomov
%A Jack Dongarra
%K dague
%K magma
%K parsec
%B IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC)
%8 06-2011
%G eng
%0 Conference Proceedings
%B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011)
%D 2011
%T Power-Aware Prediction Models of Hybrid (MPI/OpenMP) Scientific Applications
%A Charles Lively
%A Xingfu Wu
%A Valerie Taylor
%A Shirley Moore
%A Hung-Ching Chang
%A Chun-Yi Su
%A Kirk Cameron
%K mumi
%B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011)
%C Hamburg, Germany
%8 09-2011
%G eng
%0 Conference Proceedings
%B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011)
%D 2011
%T Profiling High Performance Dense Linear Algebra Algorithms on Multicore Architectures for Power and Energy Efficiency
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K mumi
%B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011)
%C Hamburg, Germany
%8 09-2011
%G eng
%0 Journal Article
%J Future Generation Computer Systems
%D 2011
%T QCG-OMPI: MPI Applications on Grids.
%A Emmanuel Agullo
%A Camille Coti
%A Thomas Herault
%A Julien Langou
%A Sylvain Peyronnet
%A A. Rezmerita
%A Franck Cappello
%A Jack Dongarra
%B Future Generation Computer Systems
%V 27
%P 435-369
%8 01-2011
%G eng
%0 Conference Proceedings
%B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011
%D 2011
%T Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure
%A George Bosilca
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%A A. Rezmerita
%E Yiannis Cotronis
%E Anthony Danalis
%E Dimitrios S. Nikolopoulos
%E Jack Dongarra
%K ftmpi
%B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011
%I Springer
%C Santorini, Greece
%V 6960
%P 342-344
%8 09-2011
%G eng
%0 Journal Article
%J UT-CS-11-675 (also LAPACK Working Note #252)
%D 2011
%T Soft Error Resilient QR Factorization for Hybrid System
%A Peng Du
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B UT-CS-11-675 (also LAPACK Working Note #252)
%8 07-2011
%G eng
%0 Generic
%D 2011
%T Soft Error Resilient QR Factorization for Hybrid System
%A Peng Du
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K ft-la
%B University of Tennessee Computer Science Technical Report
%C Knoxville, TN
%8 07-2011
%G eng
%0 Journal Article
%J Journal of Computational Science
%D 2011
%T Soft Error Resilient QR Factorization for Hybrid System with GPGPU
%A Peng Du
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K ft-la
%B Journal of Computational Science
%I Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems at SC11
%C Seattle, WA
%8 11-2011
%G eng
%0 Journal Article
%J Submitted to SIAM Journal on Scientific Computing (SISC)
%D 2011
%T Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices.
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B Submitted to SIAM Journal on Scientific Computing (SISC)
%8 00-2011
%G eng
%0 Conference Proceedings
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%D 2011
%T Two-stage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures
%A Piotr Luszczek
%A Hatem Ltaeif
%A Jack Dongarra
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%C Anchorage, AK
%8 05-2011
%G eng
%0 Conference Proceedings
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%D 2011
%T A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Pierre Lemariner
%A Narapat Ohm Saengpatsa
%A Stanimire Tomov
%A Jack Dongarra
%K dague
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%C Anchorage, AK
%8 05-2011
%G eng
%0 Journal Article
%J Submitted to Concurrency and Computations: Practice and Experience
%D 2010
%T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
%A Azzam Haidar
%A Hatem Ltaeif
%A Asim YarKhan
%A Jack Dongarra
%K plasma
%K quark
%B Submitted to Concurrency and Computations: Practice and Experience
%8 11-2010
%G eng
%0 Generic
%D 2010
%T Analysis of Various Scalar, Vector, and Parallel Implementations of RandomAccess
%A Piotr Luszczek
%A Jack Dongarra
%K hpcc
%B Innovative Computing Laboratory (ICL) Technical Report
%8 06-2010
%G eng
%0 Journal Article
%J Parallel Computing (to appear)
%D 2010
%T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%B Parallel Computing (to appear)
%8 00-2010
%G eng
%0 Journal Article
%J Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale
%D 2010
%T Constructing Resiliant Communication Infrastructure for Runtime Environments in Advances in Parallel Computing
%A George Bosilca
%A Camille Coti
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%E Barbara Chapman
%E Frederic Desprez
%E Gerhard R. Joubert
%E Alain Lichnewsky
%E Frans Peters
%E T. Priol
%B Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale
%V 19
%P 441-451
%G eng
%R 10.3233/978-1-60750-530-3-441
%0 Generic
%D 2010
%T DAGuE: A generic distributed DAG engine for high performance computing
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%K dague
%B Innovative Computing Laboratory Technical Report
%8 04-2010
%G eng
%0 Conference Proceedings
%B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
%D 2010
%T Dense Linear Algebra Solvers for Multicore with GPU Accelerators
%A Stanimire Tomov
%A Rajib Nath
%A Hatem Ltaeif
%A Jack Dongarra
%X Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators. We describe how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures. The approach taken is based on hybridization techniques in the context of Cholesky, LU, and QR factorizations. We use a high-level parallel programming model and leverage existing software infrastructure, e.g. optimized BLAS for CPU and GPU, and LAPACK for sequential CPU processing. Included also are architecture and algorithm-specific optimizations for standard solvers as well as mixed-precision iterative refinement solvers. The new algorithms, depending on the hardware configuration and routine parameters, can lead to orders of magnitude acceleration when compared to the same algorithms on standard multicore architectures that do not contain GPU accelerators. The newly developed DLA solvers are integrated and freely available through the MAGMA library.
%B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
%C Atlanta, GA
%P 1-8
%G eng
%R 10.1109/IPDPSW.2010.5470941
%0 Generic
%D 2010
%T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-10-660
%8 09-2010
%G eng
%0 Generic
%D 2010
%T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K plasma
%B Innovative Computing Laboratory Technical Report
%8 00-2010
%G eng
%0 Conference Proceedings
%B Proceedings of EuroMPI 2010
%D 2010
%T Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%E Jack Dongarra
%E Michael Resch
%E Rainer Keller
%E Edgar Gabriel
%K ftmpi
%B Proceedings of EuroMPI 2010
%I Springer
%C Stuttgart, Germany
%8 09-2010
%G eng
%0 Journal Article
%J in Performance Tuning of Scientific Applications (to appear)
%D 2010
%T Empirical Performance Tuning of Dense Linear Algebra Software
%A Jack Dongarra
%A Shirley Moore
%E David Bailey
%E Robert Lucas
%E Sam Williams
%B in Performance Tuning of Scientific Applications (to appear)
%8 00-2010
%G eng
%0 Generic
%D 2010
%T Faster, Cheaper, Better - A Hybridization Methodology to Develop Linear Algebra Software for GPUs
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Samuel Thibault
%A Stanimire Tomov
%K magma
%K morse
%B LAPACK Working Note
%8 00-2010
%G eng
%0 Journal Article
%J IEEE Transaction on Parallel and Distributed Systems (submitted)
%D 2010
%T Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators
%A Hatem Ltaeif
%A Stanimire Tomov
%A Rajib Nath
%A Jack Dongarra
%K magma
%K plasma
%B IEEE Transaction on Parallel and Distributed Systems (submitted)
%8 03-2010
%G eng
%0 Journal Article
%J Sparse Days 2010 Meeting at CERFACS
%D 2010
%T MaPHyS or the Development of a Parallel Algebraic Domain Decomposition Solver in the Course of the Solstice Project
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Jean Roman
%A Yohan Lee-Tin-Yien
%B Sparse Days 2010 Meeting at CERFACS
%C Toulouse, France
%8 06-2010
%G eng
%0 Conference Proceedings
%B First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010)
%D 2010
%T Mixed-Tool Performance Analysis on Hybrid Multicore Architectures
%A Peng Du
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010)
%C San Diego, CA
%8 09-2010
%G eng
%0 Conference Proceedings
%B Symposium on Application Accelerators in High-Performance Computing (SAAHPC '10)
%D 2010
%T OpenCL Evaluation for Numerical Linear Algebra Library Development
%A Peng Du
%A Piotr Luszczek
%A Jack Dongarra
%K magma
%B Symposium on Application Accelerators in High-Performance Computing (SAAHPC '10)
%C Knoxville, TN
%8 07-2010
%G eng
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2010
%T Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%P 417-423
%8 04-2010
%G eng
%0 Conference Proceedings
%B Proceedings of the Cray Users' Group Meeting
%D 2010
%T Performance Evaluation for Petascale Quantum Simulation Tools
%A Stanimire Tomov
%A Wenchang Lu
%A
%A Jerzy Bernholc
%A Shirley Moore
%A Jack Dongarra
%B Proceedings of the Cray Users' Group Meeting
%C Atlanta, GA
%8 05-2010
%G eng
%0 Journal Article
%J Future Generation Computer Systems
%D 2010
%T QCG-OMPI: MPI Applications on Grids
%A Emmanuel Agullo
%A Camille Coti
%A Thomas Herault
%A Julien Langou
%A Sylvain Peyronnet
%A A. Rezmerita
%A Franck Cappello
%A Jack Dongarra
%B Future Generation Computer Systems
%V 27
%P 357-369
%8 03-2010
%G eng
%0 Conference Proceedings
%B 24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224)
%D 2010
%T QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
%A Emmanuel Agullo
%A Camille Coti
%A Jack Dongarra
%A Thomas Herault
%A Julien Langou
%B 24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224)
%C Atlanta, GA
%8 04-2010
%G eng
%0 Conference Proceedings
%B Proceedings of IPDPS 2011
%D 2010
%T QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Samuel Thibault
%A Stanimire Tomov
%K magma
%K morse
%K plasma
%B Proceedings of IPDPS 2011
%C Anchorage, AK
%8 10-2010
%G eng
%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2010
%T Rectangular Full Packed Format for Cholesky’s Algorithm: Factorization, Solution, and Inversion
%A Fred G. Gustavson
%A Jerzy Wasniewski
%A Jack Dongarra
%A Julien Langou
%B ACM Transactions on Mathematical Software (TOMS)
%C Atlanta, GA
%V 37
%8 04-2010
%G eng
%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2010
%T Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion
%A Fred G. Gustavson
%A Jerzy Wasniewski
%A Jack Dongarra
%A Julien Langou
%B ACM Transactions on Mathematical Software (TOMS)
%V 37
%8 04-2010
%G eng
%0 Generic
%D 2010
%T Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling
%A Jack Dongarra
%A Piotr Luszczek
%K hpcc
%B University of Tennessee Computer Science Technical Report
%8 10-2010
%G eng
%0 Journal Article
%J PARA 2010
%D 2010
%T Scalability Study of a Quantum Simulation Code
%A Jerzy Bernholc
%A Miroslav Hodak
%A Wenchang Lu
%A Shirley Moore
%A Stanimire Tomov
%B PARA 2010
%C Reykjavik, Iceland
%8 06-2010
%G eng
%0 Journal Article
%J Proc. of VECPAR'10 (to appear)
%D 2010
%T A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators
%A Hatem Ltaeif
%A Stanimire Tomov
%A Rajib Nath
%A Peng Du
%A Jack Dongarra
%K magma
%K plasma
%B Proc. of VECPAR'10 (to appear)
%C Berkeley, CA
%8 06-2010
%G eng
%0 Journal Article
%J SC'10
%D 2010
%T Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
%A Fengguang Song
%A Hatem Ltaeif
%A Bilel Hadri
%A Jack Dongarra
%K plasma
%B SC'10
%I ACM SIGARCH/ IEEE Computer Society
%C New Orleans, LA
%8 11-2010
%G eng
%0 Generic
%D 2010
%T Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
%A Fengguang Song
%A Hatem Ltaeif
%A Bilel Hadri
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report
%V –10-653
%8 04-2010
%G eng
%0 Generic
%D 2010
%T Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Rajib Nath
%A Jean Roman
%A Samuel Thibault
%A Stanimire Tomov
%I 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster
%C Knoxville, TN
%8 07-2010
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2010
%T Scheduling Dense Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%K gridpac
%K plasma
%B Concurrency and Computation: Practice and Experience
%V 22
%P 15-44
%8 01-2010
%G eng
%0 Journal Article
%J Journal of Scientific Computing
%D 2010
%T Scheduling Two-sided Transformations using Tile Algorithms on Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%A Rosa M. Badia
%K plasma
%B Journal of Scientific Computing
%V 18
%P 33-50
%8 00-2010
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience (to appear)
%D 2010
%T SmartGridRPC: The new RPC model for high performance Grid Computing and Its Implementation in SmartGridSolve
%A Thomas Brady
%A Alexey Lastovetsky
%A Keith Seymour
%A Michele Guidolin
%A Jack Dongarra
%K netsolve
%B Concurrency and Computation: Practice and Experience (to appear)
%8 01-2010
%G eng
%0 Journal Article
%J PGI Insider
%D 2010
%T Using MAGMA with PGI Fortran
%A Stanimire Tomov
%A Mathieu Faverge
%A Piotr Luszczek
%A Jack Dongarra
%K magma
%B PGI Insider
%8 11-2010
%G eng
%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2009
%T Algorithmic Based Fault Tolerance Applied to High Performance Computing
%A Jack Dongarra
%A George Bosilca
%A Remi Delmas
%A Julien Langou
%B Journal of Parallel and Distributed Computing
%V 69
%P 410-416
%8 00-2009
%G eng
%0 Journal Article
%J Parallel Computing
%D 2009
%T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B Parallel Computing
%V 35
%P 38-53
%8 00-2009
%G eng
%0 Conference Proceedings
%B 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear)
%D 2009
%T Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware
%A Emmanuel Agullo
%A Bilel Hadri
%A Hatem Ltaeif
%A Jack Dongarra
%B 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear)
%8 00-2009
%G eng
%0 Journal Article
%J Numerical Linear Algebra with Applications
%D 2009
%T Computing the Conditioning of the Components of a Linear Least-squares Solution
%A Marc Baboulin
%A Jack Dongarra
%A Serge Gratton
%A Julien Langou
%B Numerical Linear Algebra with Applications
%V 16
%P 517-533
%8 00-2009
%G eng
%0 Generic
%D 2009
%T Constructing resiliant communication infrastructure for runtime environments
%A George Bosilca
%A Camille Coti
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%8 07-2009
%G eng
%0 Journal Article
%J ParCo 2009
%D 2009
%T Constructing Resilient Communication Infrastructure for Runtime Environments
%A Pierre Lemariner
%A George Bosilca
%A Camille Coti
%A Thomas Herault
%A Jack Dongarra
%B ParCo 2009
%C Lyon France
%8 09-2009
%G eng
%0 Journal Article
%J PPAM 2009
%D 2009
%T Dependency-Driven Scheduling of Dense Matrix Factorizations on Shared-Memory Systems
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%B PPAM 2009
%C Poland
%8 09-2009
%G eng
%0 Journal Article
%J Submitted to Transaction on Parallel and Distributed Systems
%D 2009
%T Enhancing Parallelism of Tile QR Factorization for Multicore Architectures
%A Bilel Hadri
%A Hatem Ltaeif
%A Emmanuel Agullo
%A Jack Dongarra
%K plasma
%B Submitted to Transaction on Parallel and Distributed Systems
%8 12-2009
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications (to appear)
%D 2009
%T The International Exascale Software Project: A Call to Cooperative Action by the Global High Performance Community
%A Jack Dongarra
%A Pete Beckman
%A Patrick Aerts
%A Franck Cappello
%A Thomas Lippert
%A Satoshi Matsuoka
%A Paul Messina
%A Terry Moore
%A Rick Stevens
%A Anne Trefethen
%A Mateo Valero
%B International Journal of High Performance Computing Applications (to appear)
%8 07-2009
%G eng
%0 Conference Proceedings
%B 9th International Conference on Computational Science (ICCS 2009)
%D 2009
%T A Note on Auto-tuning GEMM for GPUs
%A Yinan Li
%A Jack Dongarra
%A Stanimire Tomov
%E Gabrielle Allen
%E Jarosław Nabrzyski
%E E. Seidel
%E Geert Dick van Albada
%E Jack Dongarra
%E Peter M. Sloot
%B 9th International Conference on Computational Science (ICCS 2009)
%C Baton Rouge, LA
%P 884-892
%8 05-2009
%G eng
%R 10.1007/978-3-642-01970-8_89
%0 Conference Proceedings
%B Journal of Physics: Conference Series
%D 2009
%T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
%A Emmanuel Agullo
%A James Demmel
%A Jack Dongarra
%A Bilel Hadri
%A Jakub Kurzak
%A Julien Langou
%A Hatem Ltaeif
%A Piotr Luszczek
%A Stanimire Tomov
%K magma
%K plasma
%B Journal of Physics: Conference Series
%V 180
%8 00-2009
%G eng
%0 Generic
%D 2009
%T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
%A Emmanuel Agullo
%A James Demmel
%A Jack Dongarra
%A Bilel Hadri
%A Jakub Kurzak
%A Julien Langou
%A Hatem Ltaeif
%A Piotr Luszczek
%A Rajib Nath
%A Stanimire Tomov
%A Asim YarKhan
%A Vasily Volkov
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09)
%C Portland, OR
%8 11-2009
%G eng
%0 Generic
%D 2009
%T Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project
%A Rajib Nath
%A Jack Dongarra
%A Stanimire Tomov
%A Hatem Ltaeif
%A Peng Du
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09)
%C Portland, Oregon
%8 11-2009
%G eng
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems (to appear)
%D 2009
%T Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems (to appear)
%8 05-2009
%G eng
%0 Journal Article
%J in Cyberinfrastructure Technologies and Applications
%D 2009
%T Parallel Dense Linear Algebra Software in the Multicore Era
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%E Junwei Cao
%K plasma
%B in Cyberinfrastructure Technologies and Applications
%I Nova Science Publishers, Inc.
%P 9-24
%8 00-2009
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2009
%T Parallel Programming in MATLAB
%A Piotr Luszczek
%K lfc
%K plasma
%B The International Journal of High Performance Computing Applications
%V 23
%P 277-283
%8 07-2009
%G eng
%0 Conference Proceedings
%B Proceedings of CUG09
%D 2009
%T Performance evaluation for petascale quantum simulation tools
%A Stanimire Tomov
%A Wenchang Lu
%A Jerzy Bernholc
%A Shirley Moore
%A Jack Dongarra
%K doe-nano
%B Proceedings of CUG09
%C Atlanta, GA
%8 05-2009
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2009
%T The Problem with the Linpack Benchmark Matrix Generator
%A Julien Langou
%A Jack Dongarra
%K hpl
%B International Journal of High Performance Computing Applications
%V 23
%P 5-14
%8 00-2009
%G eng
%0 Journal Article
%J ACM TOMS (to appear)
%D 2009
%T Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion
%A Fred G. Gustavson
%A Jerzy Wasniewski
%A Jack Dongarra
%A Julien Langou
%B ACM TOMS (to appear)
%8 00-2009
%G eng
%0 Journal Article
%J in Handbook of Research on Scalable Computing Technologies (to appear)
%D 2009
%T Reliability and Performance Modeling and Analysis for Grid Computing
%A Yuan-Shun Dai
%A Jack Dongarra
%E Kuan-Ching Li
%E Ching-Hsien Hsu
%E Laurence Yang
%E Jack Dongarra
%E Hans Zima
%B in Handbook of Research on Scalable Computing Technologies (to appear)
%I IGI Global
%P 219-245
%8 00-2009
%G eng
%0 Generic
%D 2009
%T Scheduling Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%B University of Tennessee Computer Science Department Technical Report, UT-CS-09-636 (Also LAPACK Working Note 213)
%8 00-2009
%G eng
%0 Journal Article
%J Concurrency Practice and Experience (to appear)
%D 2009
%T Scheduling Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%K plasma
%B Concurrency Practice and Experience (to appear)
%8 00-2009
%G eng
%0 Generic
%D 2009
%T Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures
%A Bilel Hadri
%A Hatem Ltaeif
%A Emmanuel Agullo
%A Jack Dongarra
%K plasma
%B Innovative Computing Laboratory Technical Report (also LAPACK Working Note 222 and CS Tech Report UT-CS-09-645)
%8 09-2009
%G eng
%0 Conference Proceedings
%B accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010)
%D 2009
%T Tile QR Factorization with Parallel Panel Processing for Multicore Architectures
%A Bilel Hadri
%A Hatem Ltaeif
%A Emmanuel Agullo
%A Jack Dongarra
%K plasma
%B accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010)
%C Atlanta, GA
%8 12-2009
%G eng
%0 Journal Article
%J 15th European PVM/MPI Users' Group Meeting, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science
%D 2008
%E Alexey Lastovetsky
%E Tahar Kechadi
%E Jack Dongarra
%B 15th European PVM/MPI Users' Group Meeting, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science
%I Springer Berlin
%C Dublin Ireland
%V 5205
%8 01-2008
%G eng
%0 Generic
%D 2008
%T Algorithmic Based Fault Tolerance Applied to High Performance Computing
%A George Bosilca
%A Remi Delmas
%A Jack Dongarra
%A Julien Langou
%B University of Tennessee Computer Science Technical Report, UT-CS-08-620 (also LAPACK Working Note 205)
%8 01-2008
%G eng
%0 Journal Article
%J VECPAR '08, High Performance Computing for Computational Science
%D 2008
%T Computing the Conditioning of the Components of a Linear Least Squares Solution
%A Marc Baboulin
%A Jack Dongarra
%A Serge Gratton
%A Julien Langou
%B VECPAR '08, High Performance Computing for Computational Science
%C Toulouse, France
%8 01-2008
%G eng
%0 Journal Article
%J in Advances in Computers
%D 2008
%T DARPA's HPCS Program: History, Models, Tools, Languages
%A Jack Dongarra
%A Robert Graybill
%A William Harrod
%A Robert Lucas
%A Ewing Lusk
%A Piotr Luszczek
%A Janice McMahon
%A Allan Snavely
%A Jeffrey Vetter
%A Katherine Yelick
%A Sadaf Alam
%A Roy Campbell
%A Laura Carrington
%A Tzu-Yi Chen
%A Omid Khalili
%A Jeremy Meredith
%A Mustafa Tikir
%E M. Zelkowitz
%B in Advances in Computers
%I Elsevier
%V 72
%8 01-2008
%G eng
%0 Journal Article
%J in High Performance Computing and Grids in Action
%D 2008
%T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%A Julien Langou
%A Piotr Luszczek
%A Stanimire Tomov
%E Lucio Grandinetti
%B in High Performance Computing and Grids in Action
%I IOS Press
%C Amsterdam
%8 01-2008
%G eng
%0 Journal Article
%J in Beautiful Code Leading Programmers Explain How They Think (Chapter 14)
%D 2008
%T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination
%A Jack Dongarra
%A Piotr Luszczek
%E Andy Oram
%E G. Wilson
%B in Beautiful Code Leading Programmers Explain How They Think (Chapter 14)
%P 243-282
%8 01-2008
%G eng
%0 Generic
%D 2008
%T HPCS Library Study Effort
%A Jack Dongarra
%A James Demmel
%A Parry Husbands
%A Piotr Luszczek
%B University of Tennessee Computer Science Technical Report, UT-CS-08-617
%8 01-2008
%G eng
%0 Conference Proceedings
%B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing
%D 2008
%T Interior State Computation of Nano Structures
%A Andrew Canning
%A Jack Dongarra
%A Julien Langou
%A Osni Marques
%A Stanimire Tomov
%A Christof Voemel
%A Lin-Wang Wang
%B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing
%C Trondheim, Norway
%8 05-2008
%G eng
%0 Journal Article
%J Concurrency: Practice and Experience
%D 2008
%T The LINPACK Benchmark: Past, Present, and Future
%A Jack Dongarra
%A Piotr Luszczek
%A Antoine Petitet
%K hpl
%B Concurrency: Practice and Experience
%V 15
%P 803-820
%8 00-2008
%G eng
%0 Generic
%D 2008
%T Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-08-624 (also LAPACK Working Note 208)
%8 08-2008
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2008
%T Parallel Tiled QR Factorization for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%B Concurrency and Computation: Practice and Experience
%V 20
%P 1573-1590
%8 01-2008
%G eng
%0 Generic
%D 2008
%T The PlayStation 3 for High Performance Scientific Computing
%A Jakub Kurzak
%A Alfredo Buttari
%A Piotr Luszczek
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report
%8 01-2008
%G eng
%0 Journal Article
%J Computing in Science and Engineering
%D 2008
%T The PlayStation 3 for High Performance Scientific Computing
%A Jakub Kurzak
%A Alfredo Buttari
%A Piotr Luszczek
%A Jack Dongarra
%B Computing in Science and Engineering
%P 80-83
%8 01-2008
%G eng
%0 Generic
%D 2008
%T The Problem with the Linpack Benchmark Matrix Generator
%A Jack Dongarra
%A Julien Langou
%B University of Tennessee Computer Science Technical Report, UT-CS-08-621 (also LAPACK Working Note 206)
%8 06-2008
%G eng
%0 Generic
%D 2008
%T Request Sequencing: Enabling Workflow for Efficient Parallel Problem Solving in GridSolve
%A Yinan Li
%A Jack Dongarra
%K netsolve
%B ICL Technical Report
%8 04-2008
%G eng
%0 Conference Proceedings
%B International Conference on Grid and Cooperative Computing (GCC 2008) (submitted)
%D 2008
%T Request Sequencing: Enabling Workflow for Efficient Problem Solving in GridSolve
%A Yinan Li
%A Jack Dongarra
%A Keith Seymour
%A Asim YarKhan
%B International Conference on Grid and Cooperative Computing (GCC 2008) (submitted)
%C Shenzhen, China
%8 10-2008
%G eng
%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2008
%T Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%K plasma
%B ACM Transactions on Mathematical Software
%V 34
%P 17-22
%8 00-2008
%G eng
%0 Generic
%D 2007
%T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report
%8 01-2007
%G eng
%0 Generic
%D 2007
%T Computing the Conditioning of the Components of a Linear Least Squares Solution
%A Marc Baboulin
%A Jack Dongarra
%A Serge Gratton
%A Julien Langou
%B University of Tennessee Computer Science Technical Report
%8 01-2007
%G eng
%0 Journal Article
%J in Petascale Computing: Algorithms and Applications (to appear)
%D 2007
%T Disaster Survival Guide in Petascale Computing: An Algorithmic Approach
%A Jack Dongarra
%A Zizhong Chen
%A George Bosilca
%A Julien Langou
%B in Petascale Computing: Algorithms and Applications (to appear)
%I Chapman & Hall - CRC Press
%8 00-2007
%G eng
%0 Journal Article
%J In High Performance Computing and Grids in Action (to appear)
%D 2007
%T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%A Julie Langou
%A Piotr Luszczek
%A Stanimire Tomov
%E Lucio Grandinetti
%B In High Performance Computing and Grids in Action (to appear)
%I IOS Press
%C Amsterdam
%8 00-2007
%G eng
%0 Journal Article
%J International Journal for High Performance Computer Applications
%D 2007
%T High Performance Development for High End Computing with Python Language Wrapper (PLW)
%A Jack Dongarra
%A Piotr Luszczek
%B International Journal for High Performance Computer Applications
%V 21
%P 360-369
%8 00-2007
%G eng
%0 Journal Article
%J in Beautiful Code Leading Programmers Explain How They Think
%D 2007
%T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination
%A Jack Dongarra
%A Piotr Luszczek
%E Andy Oram
%E Greg Wilson
%B in Beautiful Code Leading Programmers Explain How They Think
%I O'Reilly Media, Inc.
%8 06-2007
%G eng
%0 Journal Article
%J International Journal of High Performance Computer Applications (to appear)
%D 2007
%T Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems
%A Alfredo Buttari
%A Jack Dongarra
%A Julien Langou
%A Julie Langou
%A Piotr Luszczek
%A Jakub Kurzak
%B International Journal of High Performance Computer Applications (to appear)
%8 08-2007
%G eng
%0 Book Section
%B Distributed and Parallel Systems
%D 2007
%T A New Approach to MPI Collective Communication Implementations
%A Torsten Hoefler
%A Jeffrey M. Squyres
%A Graham Fagg
%A George Bosilca
%A Wolfgang Rehm
%A Andrew Lumsdaine
%K Automatic Selection
%K Collective Operation
%K Framework
%K Message Passing (MPI)
%K Open MPI
%X Recent research into the optimization of collective MPI operations has resulted in a wide variety of algorithms and corresponding implementations, each typically only applicable in a relatively narrow scope: on a specific architecture, on a specific network, with a specific number of processes, with a specific data size and/or data-type – or any combination of these (or other) factors. This situation presents an enormous challenge to portable MPI implementations which are expected to provide optimized collective operation performance on all platforms. Many portable implementations have attempted to provide a token number of algorithms that are intended to realize good performance on most systems. However, many platform configurations are still left without well-tuned collective operations. This paper presents a proposal for a framework that will allow a wide variety of collective algorithm implementations and a flexible, multi-tiered selection process for choosing which implementation to use when an application invokes an MPI collective function.
%B Distributed and Parallel Systems
%I Springer US
%P 45-54
%@ 978-0-387-69857-1
%G eng
%R 10.1007/978-0-387-69858-8_5
%0 Generic
%D 2007
%T Parallel Tiled QR Factorization for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-598 (also LAPACK Working Note 190)
%8 00-2007
%G eng
%0 Journal Article
%J SIAM SISC (to appear)
%D 2007
%T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
%A Julien Langou
%A Zizhong Chen
%A George Bosilca
%A Jack Dongarra
%B SIAM SISC (to appear)
%8 05-2007
%G eng
%0 Generic
%D 2007
%T SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3
%A Alfredo Buttari
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%A George Bosilca
%K multi-core
%B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-595
%8 00-2007
%G eng
%0 Journal Article
%J International Journal of Computational Science and Engineering
%D 2006
%T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures
%A Stanimire Tomov
%A Julien Langou
%A Jack Dongarra
%A Andrew Canning
%A Lin-Wang Wang
%B International Journal of Computational Science and Engineering
%V 2
%P 205-212
%8 00-2006
%G eng
%0 Journal Article
%J University of Tennessee Computer Science Tech Report
%D 2006
%T Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy
%A Julien Langou
%A Julien Langou
%A Piotr Luszczek
%A Jakub Kurzak
%A Alfredo Buttari
%A Jack Dongarra
%K iter-ref
%B University of Tennessee Computer Science Tech Report
%8 04-2006
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications (to appear)
%D 2006
%T High Performance Development for High End Computing with Python Language Wrapper (PLW)
%A Piotr Luszczek
%K hpcc
%K lfc
%B International Journal of High Performance Computing Applications (to appear)
%8 00-2006
%G eng
%0 Journal Article
%J HeteroPar 2006
%D 2006
%T A High-Performance, Heterogeneous MPI
%A Richard L. Graham
%A Galen M. Shipman
%A Brian Barrett
%A Ralph Castain
%A George Bosilca
%A Andrew Lumsdaine
%B HeteroPar 2006
%C Barcelona, Spain
%8 09-2006
%G eng
%0 Conference Proceedings
%B SC06 Conference Tutorial
%D 2006
%T The HPC Challenge (HPCC) Benchmark Suite
%A Piotr Luszczek
%A David Bailey
%A Jack Dongarra
%A Jeremy Kepner
%A Robert Lucas
%A Rolf Rabenseifner
%A Daisuke Takahashi
%K hpcc
%K hpcchallenge
%B SC06 Conference Tutorial
%I IEEE
%C Tampa, Florida
%8 11-2006
%G eng
%0 Journal Article
%J PARA 2006
%D 2006
%T The Impact of Multicore on Math Software
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%A Piotr Luszczek
%A Stanimire Tomov
%K plasma
%B PARA 2006
%C Umea, Sweden
%8 06-2006
%G eng
%0 Conference Proceedings
%B IEEE/ACM Proceedings of HPCNano SC06 (to appear)
%D 2006
%T Performance evaluation of eigensolvers in nano-structure computations
%A Andrew Canning
%A Jack Dongarra
%A Julien Langou
%A Osni Marques
%A Stanimire Tomov
%A Christof Voemel
%A Lin-Wang Wang
%K doe-nano
%B IEEE/ACM Proceedings of HPCNano SC06 (to appear)
%8 01-2006
%G eng
%0 Journal Article
%J J. Phys.: Conf. Ser. 46
%D 2006
%T Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures
%A Alex Zunger
%A Alberto Franceschetti
%A Gabriel Bester
%A Wesley B. Jones
%A Kwiseon Kim
%A Peter A. Graf
%A Lin-Wang Wang
%A Andrew Canning
%A Osni Marques
%A Christof Voemel
%A Jack Dongarra
%A Julien Langou
%A Stanimire Tomov
%K DOE_NANO
%B J. Phys.: Conf. Ser. 46
%V :101088/1742-6596/46/1/040
%P 292-298
%8 01-2006
%G eng
%0 Journal Article
%J PARA 2006
%D 2006
%T Prospectus for the Next LAPACK and ScaLAPACK Libraries
%A James Demmel
%A Jack Dongarra
%A B. Parlett
%A William Kahan
%A Ming Gu
%A David Bindel
%A Yozo Hida
%A Xiaoye Li
%A Osni Marques
%A Jason E. Riedy
%A Christof Voemel
%A Julien Langou
%A Piotr Luszczek
%A Jakub Kurzak
%A Alfredo Buttari
%A Julien Langou
%A Stanimire Tomov
%B PARA 2006
%C Umea, Sweden
%8 06-2006
%G eng
%0 Journal Article
%J IBM Journal of Research and Development
%D 2006
%T Self Adapting Numerical Software SANS Effort
%A George Bosilca
%A Zizhong Chen
%A Jack Dongarra
%A Victor Eijkhout
%A Graham Fagg
%A Erika Fuentes
%A Julien Langou
%A Piotr Luszczek
%A Jelena Pjesivac–Grbovic
%A Keith Seymour
%A Haihang You
%A Sathish Vadhiyar
%K gco
%B IBM Journal of Research and Development
%V 50
%P 223-238
%8 01-2006
%G eng
%0 Conference Proceedings
%B IEEE/ACM Proceedings of HPCNano SC06 (to appear)
%D 2006
%T Towards bulk based preconditioning for quantum dot computations
%A Andrew Canning
%A Jack Dongarra
%A Julien Langou
%A Osni Marques
%A Stanimire Tomov
%A Christof Voemel
%A Lin-Wang Wang
%K doe-nano
%B IEEE/ACM Proceedings of HPCNano SC06 (to appear)
%8 01-2006
%G eng
%0 Conference Proceedings
%B Proceedings of 5th International Conference on Computational Science (ICCS)
%D 2005
%T Comparison of Nonlinear Conjugate-Gradient methods for computing the Electronic Properties of Nanostructure Architectures
%A Stanimire Tomov
%A Julien Langou
%A Andrew Canning
%A Lin-Wang Wang
%A Jack Dongarra
%E V. S. Sunderman
%E Geert Dick van Albada
%E Peter M. Sloot
%E Jack Dongarra
%K doe-nano
%B Proceedings of 5th International Conference on Computational Science (ICCS)
%I Springer's Lecture Notes in Computer Science
%C Atlanta, GA, USA
%P 317-325
%8 01-2005
%G eng
%0 Journal Article
%J International Journal of Computational Science and Engineering (to appear)
%D 2005
%T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures
%A Stanimire Tomov
%A Julien Langou
%A Andrew Canning
%A Lin-Wang Wang
%A Jack Dongarra
%B International Journal of Computational Science and Engineering (to appear)
%8 01-2005
%G eng
%0 Conference Proceedings
%B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear)
%D 2005
%T Fault Tolerant High Performance Computing by a Coding Approach
%A Zizhong Chen
%A Graham Fagg
%A Edgar Gabriel
%A Julien Langou
%A Thara Angskun
%A George Bosilca
%A Jack Dongarra
%K ftmpi
%K grads
%K lacsi
%K sans
%B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear)
%C Chicago, Illinois
%8 01-2005
%G eng
%0 Conference Proceedings
%B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI
%D 2005
%T Hash Functions for Datatype Signatures in MPI
%A George Bosilca
%A Jack Dongarra
%A Graham Fagg
%A Julien Langou
%E Beniamino Di Martino
%K ftmpi
%B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI
%I Springer-Verlag Berlin
%C Sorrento (Naples), Italy
%V 3666
%P 76-83
%8 09-2005
%G eng
%0 Journal Article
%J SC|05 Tutorial - S13
%D 2005
%T HPC Challenge v1.x Benchmark Suite
%A Piotr Luszczek
%A David Koester
%K hpcc
%B SC|05 Tutorial - S13
%C Seattle, Washington
%8 01-2005
%G eng
%0 Journal Article
%D 2005
%T Introduction to the HPC Challenge Benchmark Suite
%A Piotr Luszczek
%A Jack Dongarra
%A David Koester
%A Rolf Rabenseifner
%A Robert Lucas
%A Jeremy Kepner
%A John McCalpin
%A David Bailey
%A Daisuke Takahashi
%K hpcc
%K hpcchallenge
%8 03-2005
%G eng
%0 Generic
%D 2005
%T Introduction to the HPCChallenge Benchmark Suite
%A Jack Dongarra
%A Piotr Luszczek
%K hpcc
%K hpcchallenge
%B ICL Technical Report
%8 01-2005
%G eng
%0 Journal Article
%J International Journal of Parallel Programming
%D 2005
%T New Grid Scheduling and Rescheduling Methods in the GrADS Project
%A Francine Berman
%A Henri Casanova
%A Andrew Chien
%A Keith Cooper
%A Holly Dail
%A Anshuman Dasgupta
%A Wei Deng
%A Jack Dongarra
%A Lennart Johnsson
%A Ken Kennedy
%A Charles Koelbel
%A Bo Liu
%A Xu Liu
%A Anirban Mandal
%A Gabriel Marin
%A Mark Mazina
%A John Mellor-Crummey
%A Celso Mendes
%A A. Olugbile
%A Jignesh M. Patel
%A Dan Reed
%A Zhiao Shi
%A Otto Sievert
%A H. Xia
%A Asim YarKhan
%K grads
%B International Journal of Parallel Programming
%I Springer
%V 33
%P 209-229
%8 06-2005
%G eng
%0 Journal Article
%J Journal of Computational Acoustics (to appear)
%D 2005
%T On the Parallel Solution of Large Industrial Wave Propagation Problems
%A Luc Giraud
%A Julien Langou
%A G. Sylvand
%B Journal of Computational Acoustics (to appear)
%8 01-2005
%G eng
%0 Generic
%D 2005
%T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
%A George Bosilca
%A Zizhong Chen
%A Jack Dongarra
%A Julien Langou
%K ft-la
%B University of Tennessee Computer Science Department Technical Report, UT-CS-04-538
%8 00-2005
%G eng
%0 Generic
%D 2005
%T Remote Software Toolkit Installer
%A Eric Meek
%A Jeff Larkin
%A Jack Dongarra
%K rest
%B ICL Technical Report
%8 06-2005
%G eng
%0 Journal Article
%J Numerische Mathematik
%D 2005
%T Rounding Error Analysis of the Classical Gram-Schmidt Orthogonalization Process
%A Luc Giraud
%A Julien Langou
%A Miroslav Rozložník
%A Jasper van den Eshof
%B Numerische Mathematik
%V 101
%P 87-100
%8 01-2005
%G eng
%0 Journal Article
%J Oak Ridge National Laboratory Report
%D 2004
%T Cray X1 Evaluation Status Report
%A Pratul Agarwal
%A R. A. Alexander
%A E. Apra
%A Satish Balay
%A Arthur S. Bland
%A James Colgan
%A Eduardo D'Azevedo
%A Jack Dongarra
%A Tom Dunigan
%A Mark Fahey
%A Al Geist
%A M. Gordon
%A Robert Harrison
%A Dinesh Kaushik
%A M. Krishnakumar
%A Piotr Luszczek
%A Tony Mezzacapa
%A Jeff Nichols
%A Jarek Nieplocha
%A Leonid Oliker
%A T. Packwood
%A M. Pindzola
%A Thomas C. Schulthess
%A Jeffrey Vetter
%A James B White
%A T. Windus
%A Patrick H. Worley
%A Thomas Zacharia
%B Oak Ridge National Laboratory Report
%V /-2004/13
%8 01-2004
%G eng
%0 Conference Proceedings
%B International Conference on Computational Science
%D 2004
%T Design of an Interactive Environment for Numerically Intensive Parallel Linear Algebra Calculations
%A Piotr Luszczek
%A Jack Dongarra
%E Marian Bubak
%E Geert Dick van Albada
%E Peter M. Sloot
%E Jack Dongarra
%K lacsi
%K lfc
%B International Conference on Computational Science
%I Springer Verlag
%C Poland
%8 06-2004
%G eng
%R 10.1007/978-3-540-25944-2_35
%0 Conference Proceedings
%B Proceedings of ISC2004 (to appear)
%D 2004
%T Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems
%A Graham Fagg
%A Edgar Gabriel
%A George Bosilca
%A Thara Angskun
%A Zizhong Chen
%A Jelena Pjesivac–Grbovic
%A Kevin London
%A Jack Dongarra
%K ftmpi
%K lacsi
%B Proceedings of ISC2004 (to appear)
%C Heidelberg, Germany
%8 06-2004
%G eng
%0 Conference Proceedings
%B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04')
%D 2004
%T LAPACK for Clusters Project: An Example of Self Adapting Numerical Software
%A Zizhong Chen
%A Jack Dongarra
%A Piotr Luszczek
%A Kenneth Roche
%K lacsi
%K lfc
%B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04')
%C Big Island, Hawaii
%V 9
%P 90282
%8 01-2004
%G eng
%0 Journal Article
%J Engineering the Grid (to appear)
%D 2004
%T An Overview of Heterogeneous High Performance and Grid Computing
%A Jack Dongarra
%A Alexey Lastovetsky
%E Beniamino Di Martino
%E Jack Dongarra
%E Adolfy Hoisie
%E Laurence Yang
%E Hans Zima
%B Engineering the Grid (to appear)
%I Nova Science Publishers, Inc.
%8 00-2004
%G eng
%0 Generic
%D 2004
%T Performance Optimization and Modeling of Blocked Sparse Kernels
%A Alfredo Buttari
%A Victor Eijkhout
%A Julien Langou
%A Salvatore Filippone
%K sans
%B ICL Technical Report
%8 00-2004
%G eng
%0 Generic
%D 2004
%T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
%A George Bosilca
%A Zizhong Chen
%A Jack Dongarra
%A Julien Langou
%B ICL Technical Report
%8 01-2004
%G eng
%0 Conference Proceedings
%B PADTAD Workshop, IPDPS 2003
%D 2003
%T Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters
%A Jack Dongarra
%A Kevin London
%A Shirley Moore
%A Phil Mucci
%A Dan Terpstra
%A Haihang You
%A Min Zhou
%K lacsi
%K papi
%B PADTAD Workshop, IPDPS 2003
%C Nice, France
%8 04-2003
%G eng
%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T Recent Advances in Parallel Virtual Machine and Message Passing Interface
%A Jack Dongarra
%A Domenico Laforenza
%A S. Orlando
%B Lecture Notes in Computer Science
%I Springer-Verlag, Berlin
%V 2840
%8 01-2003
%G eng
%0 Conference Proceedings
%B DOE/NSF Workshop on New Directions in Cyber-Security in Large-Scale Networks: Development Obstacles
%D 2003
%T Scalable, Trustworthy Network Computing Using Untrusted Intermediaries: A Position Paper
%A Micah Beck
%A Jack Dongarra
%A Victor Eijkhout
%A Mike Langston
%A Terry Moore
%A James Plank
%K netsolve
%B DOE/NSF Workshop on New Directions in Cyber-Security in Large-Scale Networks: Development Obstacles
%C National Conference Center - Landsdowne, Virginia
%8 03-2003
%G eng
%0 Journal Article
%J Resource Management in the Grid
%D 2003
%T Scheduling in the Grid Application Development Software Project
%A Holly Dail
%A Otto Sievert
%A Francine Berman
%A Henri Casanova
%A Asim YarKhan
%A Sathish Vadhiyar
%A Jack Dongarra
%A Chuang Liu
%A Lingyun Yang
%A Dave Angulo
%A Ian Foster
%K grads
%B Resource Management in the Grid
%I Kluwer Publishers
%8 03-2003
%G eng
%0 Generic
%D 2003
%T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters (LAPACK Working Note 160)
%A Zizhong Chen
%A Jack Dongarra
%A Piotr Luszczek
%A Kenneth Roche
%K lacsi
%B University of Tennessee Computer Science Technical Report, UT-CS-03-499
%8 01-2003
%G eng
%0 Journal Article
%J Parallel Computing
%D 2003
%T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters
%A Zizhong Chen
%A Jack Dongarra
%A Piotr Luszczek
%A Kenneth Roche
%K lacsi
%K lfc
%K sans
%B Parallel Computing
%V 29
%P 1723-1743
%8 11-2003
%G eng
%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T VisPerf: Monitoring Tool for Grid Computing
%A DongWoo Lee
%A Jack Dongarra
%E R. S. Ramakrishna
%K netsolve
%B Lecture Notes in Computer Science
%I Springer Verlag, Heidelberg
%V 2659
%P 233-243
%8 00-2003
%G eng
%0 Generic
%D 2002
%T GridRPC: A Remote Procedure Call API for Grid Computing
%A Keith Seymour
%A Hidemoto Nakada
%A Satoshi Matsuoka
%A Jack Dongarra
%A Craig Lee
%A Henri Casanova
%B ICL Technical Report
%8 11-2002
%G eng
%0 Conference Proceedings
%B Proceedings of the Third International Workshop on Grid Computing
%D 2002
%T Overview of GridRPC: A Remote Procedure Call API for Grid Computing
%A Keith Seymour
%A Hidemoto Nakada
%A Satoshi Matsuoka
%A Jack Dongarra
%A Craig Lee
%A Henri Casanova
%E Manish Parashar
%B Proceedings of the Third International Workshop on Grid Computing
%P 274-278
%8 01-2002
%G eng
%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2002
%T An Updated Set of Basic Linear Algebra Subprograms (BLAS)
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Iain Duff
%A Sven Hammarling
%A Greg Henry
%A Michael Heroux
%A Linda Kaufman
%A Andrew Lumsdaine
%A Antoine Petitet
%A Roldan Pozo
%A Karin Remington
%A Clint Whaley
%B ACM Transactions on Mathematical Software
%V 28
%P 135-151
%8 12-2002
%G eng
%R 10.1145/567806.567807
%0 Journal Article
%J (an update), submitted to ACM TOMS
%D 2001
%T Basic Linear Algebra Subprograms (BLAS)
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Iain Duff
%A Sven Hammarling
%A Greg Henry
%A Michael Heroux
%A Linda Kaufman
%A Andrew Lumsdaine
%A Antoine Petitet
%A Roldan Pozo
%A Karin Remington
%A Clint Whaley
%B (an update), submitted to ACM TOMS
%8 02-2001
%G eng
%0 Conference Proceedings
%B International Conference on Parallel and Distributed Computing Systems
%D 2001
%T End-user Tools for Application Performance Analysis, Using Hardware Counters
%A Kevin London
%A Jack Dongarra
%A Shirley Moore
%A Phil Mucci
%A Keith Seymour
%A T. Spencer
%K papi
%B International Conference on Parallel and Distributed Computing Systems
%C Dallas, TX
%8 08-2001
%G eng
%0 Generic
%D 2001
%T Internet Backplane Protocol - Test Language v. 1.0
%A Alessandro Bassi
%A Xiaoye Li
%B University of Tennessee Computer Science Technical Report
%8 01-2001
%G eng
%0 Conference Proceedings
%B Department of Defense Users' Group Conference Proceedings
%D 2001
%T The PAPI Cross-Platform Interface to Hardware Performance Counters
%A Kevin London
%A Shirley Moore
%A Phil Mucci
%A Keith Seymour
%A Richard Luczak
%K papi
%B Department of Defense Users' Group Conference Proceedings
%C Biloxi, Mississippi
%8 06-2001
%G eng
%0 Journal Article
%J Scientific Programming
%D 2001
%T Recursive Approach in Sparse Matrix LU Factorization
%A Jack Dongarra
%A Victor Eijkhout
%A Piotr Luszczek
%B Scientific Programming
%V 9
%P 51-60
%8 00-2001
%G eng
%0 Journal Article
%J 8th European PVM/MPI Users' Group Meeting, Lecture Notes in Computer Science 2131
%D 2001
%T Review of Performance Analysis Tools for MPI Parallel Programs
%A Shirley Moore
%A David Cronk
%A Kevin London
%A Jack Dongarra
%K papi
%B 8th European PVM/MPI Users' Group Meeting, Lecture Notes in Computer Science 2131
%I Springer Verlag, Berlin
%C Greece
%P 241-248
%8 09-2001
%G eng
%0 Conference Proceedings
%B Conference on Linux Clusters: The HPC Revolution
%D 2001
%T Using PAPI for Hardware Performance Monitoring on Linux Systems
%A Jack Dongarra
%A Kevin London
%A Shirley Moore
%A Phil Mucci
%A Dan Terpstra
%K papi
%B Conference on Linux Clusters: The HPC Revolution
%I Linux Clusters Institute
%C Urbana, Illinois
%8 06-2001
%G eng
%0 Generic
%D 2000
%T A Portable Programming Interface for Performance Evaluation on Modern Processors
%A Shirley Browne
%A Jack Dongarra
%A Nathan Garner
%A Kevin London
%A Phil Mucci
%B University of Tennessee Computer Science Technical Report, UT-CS-00-444
%8 07-2000
%G eng
%0 Journal Article
%J ASTC-HPC 2000
%D 2000
%T Providing Infrastructure and Interface to High Performance Applications in a Distributed Setting
%A Dorian Arnold
%A Wonsuck Lee
%A Jack Dongarra
%A Mary Wheeler
%B ASTC-HPC 2000
%C Washington, DC
%8 04-2000
%G eng
%0 Conference Proceedings
%B Proceedings of 1st SGI Users Conference
%D 2000
%T Recursive approach in sparse matrix LU factorization
%A Jack Dongarra
%A Victor Eijkhout
%A Piotr Luszczek
%B Proceedings of 1st SGI Users Conference
%C Cracow, Poland (ACC Cyfronet UMM, 2000)
%P 409-418
%8 01-2000
%G eng
%0 Conference Proceedings
%B Proceedings of SuperComputing 2000 (SC'00)
%D 2000
%T A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters
%A Shirley Browne
%A Jack Dongarra
%A Nathan Garner
%A Kevin London
%A Phil Mucci
%K papi
%B Proceedings of SuperComputing 2000 (SC'00)
%C Dallas, TX
%8 11-2000
%G eng