%0 Journal Article
%J Journal of Chemical Theory and Computation
%D 2023
%T Direct Determination of Optimal Real-Space Orbitals for Correlated Electronic Structure of Molecules
%A Valeev, Edward F.
%A Harrison, Robert J.
%A Holmes, Adam A.
%A Peterson, Charles C.
%A Penchoff, Deborah A.
%X We demonstrate how to determine numerically nearly exact orthonormal orbitals that are optimal for the evaluation of the energy of arbitrary (correlated) states of atoms and molecules by minimization of the energy Lagrangian. Orbitals are expressed in real space using a multiresolution spectral element basis that is refined adaptively to achieve the user-specified target precision while avoiding the ill-conditioning issues that plague AO basis set expansions traditionally used for correlated models of molecular electronic structure. For light atoms, the orbital solver, in conjunction with a variational electronic structure model [selected Configuration Interaction (CI)] provides energies of comparable precision to a state-of-the-art atomic CI solver. The computed electronic energies of atoms and molecules are significantly more accurate than the counterparts obtained with the orbital sets of the same rank expanded in Gaussian AO bases, and can be determined even when linear dependence issues preclude the use of the AO bases. It is feasible to optimize more than 100 fully correlated numerical orbitals on a single computer node, and significant room exists for additional improvement. These findings suggest that real-space orbital representations might be the preferred alternative to AO representations for high-end models of correlated electronic states of molecules and materials.
%B Journal of Chemical Theory and Computation
%V 19
%P 7230 - 7241
%8 2023-10
%G eng
%U https://pubs.acs.org/doi/10.1021/acs.jctc.3c00732
%N 20
%! J. Chem. Theory Comput.
%R 10.1021/acs.jctc.3c00732

%0 Conference Paper
%B 52nd International Conference on Parallel Processing (ICPP 2023)
%D 2023
%T O(N) distributed direct factorization of structured dense matrices using runtime systems
%A Sameer Deshmukh
%A Rio Yokota
%A George Bosilca
%A Qinxiang Ma
%B 52nd International Conference on Parallel Processing (ICPP 2023)
%I ACM
%C Salt Lake City, Utah
%8 2023-08
%@ 9798400708435
%G eng
%U https://dl.acm.org/doi/proceedings/10.1145/3605573
%R 10.1145/3605573.3605606

%0 Conference Proceedings
%B 2022 IEEE High Performance Extreme Computing Conference (HPEC)
%D 2022
%T Deep Gaussian process with multitask and transfer learning for performance optimization
%A Sid-Lakhdar, Wissam M.
%A Aznaveh, Mohsen
%A Luszczek, Piotr
%A Dongarra, Jack
%X We combine Deep Gaussian Processes with multitask and transfer learning for the performance modeling and optimization of HPC applications. Deep Gaussian processes merge the uncertainty quantification advantage of Gaussian Processes with the predictive power of deep learning. Multitask and transfer learning allow for improved learning efficiency when several similar tasks are to be learned simultaneously and when previous learned models are sought to help in the learning of new tasks, respectively. A comparison with state-of-the-art autotuners shows the advantage of our approach on two application problems.
%B 2022 IEEE High Performance Extreme Computing Conference (HPEC)
%P 1-7
%8 2022-09
%G eng
%R 10.1109/HPEC55821.2022.9926396

%0 Conference Paper
%B 35th IEEE International Parallel &  Distributed Processing Symposium (IPDPS 2021)
%D 2021
%T Distributed-Memory Multi-GPU Block-Sparse Tensor Contraction for Electronic Structure
%A Thomas Herault
%A Yves Robert
%A George Bosilca
%A Robert Harrison
%A Cannada Lewis
%A Edward Valeev
%A Jack Dongarra
%K block-sparse matrix multiplication
%K distributed-memory
%K Electronic structure
%K multi-GPU node
%K parsec
%K tensor contraction
%X Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.
%B 35th IEEE International Parallel &  Distributed Processing Symposium (IPDPS 2021)
%I IEEE
%C Portland, OR
%8 2021-05
%G eng
%U https://hal.inria.fr/hal-02970659/document

%0 Generic
%D 2021
%T DTE: PaRSEC Enabled Libraries and Applications
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%I 2021 Exascale Computing Project Annual Meeting
%8 2021-04
%G eng

%0 Journal Article
%J Int. J. of Networking and Computing
%D 2021
%T Dynamic DAG scheduling under memory constraints for shared-memory platforms
%A Gabriel Bathie
%A Loris Marchal
%A Yves Robert
%A Samuel Thibault
%B Int. J. of Networking and Computing
%V 11
%P 27-49
%G eng

%0 Conference Paper
%B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
%D 2020
%T DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models
%A Bogdan Nicolae
%A Jiali Li
%A Justin M. Wozniak
%A George Bosilca
%A Matthieu Dorier
%A Franck Cappello
%X In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.
%B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
%I IEEE
%C Melbourne, VIC, Australia
%8 2020-05
%G eng
%R https://doi.org/10.1109/CCGrid49817.2020.00-76

%0 Conference Paper
%B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020)
%D 2020
%T Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs
%A Anne Benoit
%A Valentin Le Fèvre
%A Padma Raghavan
%A Yves Robert
%A Hongyang Sun
%B 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020)
%I IEEE Computer Society Press
%C New Orleans, LA
%8 2020-05
%G eng

%0 Generic
%D 2020
%T Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs
%A Cade Brown
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%K AMD GPUs
%K GPU computing
%K HIP Runtime
%K HPC
%K numerical linear algebra
%K Portability
%X Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the HeterogeneousComputing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-08
%G eng

%0 Conference Paper
%B 2020 IEEE High Performance Extreme Computing Virtual Conference
%D 2020
%T Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs
%A Cade Brown
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%X Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the HeterogeneousComputing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements.
%B 2020 IEEE High Performance Extreme Computing Virtual Conference
%I IEEE
%8 2020-09
%G eng

%0 Conference Paper
%B Computer Modeling and Intelligent Systems CMIS-2020
%D 2020
%T Docker Container based PaaS Cloud Computing Comprehensive Benchmarks using LAPACK
%A Dmitry Zaitsev
%A Piotr Luszczek
%K docker containers
%K software containers
%X Platform as a Service (PaaS) cloud computing model becomes wide- spread implemented within Docker Containers. Docker uses operating system level virtualization to deliver software in packages called containers. Containers are isolated from one another and comprise all the required software, including operating system API, libraries and configuration files. With such advantageous integrity one can doubt on Docker performance. The present paper applies packet LAPACK, which is widely used for performance benchmarks of super- computers, to collect and compare benchmarks of Docker on Linux Ubuntu and MS Windows platforms. After a brief overview of Docker and LAPACK, a se- ries of Docker images containing LAPACK is created and run, abundant benchmarks obtained and represented in tabular and graphical form. From the final discussion, we conclude that Docker runs with nearly the same perfor- mance on both Linux and Windows platforms, the slowdown does not exceed some ten percent. Though Docker performance in Windows is essentially lim- ited by the amount of RAM allocated to Docker Engine.
%B Computer Modeling and Intelligent Systems CMIS-2020
%C Zaporizhzhoa
%8 2020-03
%G eng

%0 Generic
%D 2020
%T DTE: PaRSEC Enabled Libraries and Applications (Poster)
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Generic
%D 2020
%T DTE: PaRSEC Systems and Interfaces (Poster)
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Conference Paper
%B 5th EAI International Conference on Smart Objects and Technologies for Social Good
%D 2019
%T Data Logistics: Toolkit and Applications
%A Micah Beck
%A Terry Moore
%A Nancy French
%A Erza Kissel
%A Martin Swany
%B 5th EAI International Conference on Smart Objects and Technologies for Social Good
%C Valencia, Spain
%8 2019-09
%G eng

%0 Generic
%D 2019
%T Design and Implementation for FFT-ECP on Distributed Accelerated Systems
%A Stanimire Tomov
%A Azzam Haidar
%A Alan Ayala
%A Daniel Schultz
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-04
%G eng
%9 ECP WBS 2.3.3.09 Milestone Report

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T Distributed-Memory Lattice H-Matrix Factorization
%A Ichitaro Yamazaki
%A Akihiro Ida
%A Rio Yokota
%A Jack Dongarra
%X We parallelize the LU factorization of a hierarchical low-rank matrix (ℋ-matrix) on a distributed-memory computer. This is much more difficult than the ℋ-matrix-vector multiplication due to the dataflow of the factorization, and it is much harder than the parallelization of a dense matrix factorization due to the irregular hierarchical block structure of the matrix. Block low-rank (BLR) format gets rid of the hierarchy and simplifies the parallelization, often increasing concurrency. However, this comes at a price of losing the near-linear complexity of the ℋ-matrix factorization. In this work, we propose to factorize the matrix using a “lattice ℋ-matrix” format that generalizes the BLR format by storing each of the blocks (both diagonals and off-diagonals) in the ℋ-matrix format. These blocks stored in the ℋ-matrix format are referred to as lattices. Thus, this lattice format aims to combine the parallel scalability of BLR factorization with the near-linear complexity of ℋ-matrix factorization. We first compare factorization performances using the ℋ-matrix, BLR, and lattice ℋ-matrix formats under various conditions on a shared-memory computer. Our performance results show that the lattice format has storage and computational complexities similar to those of the ℋ-matrix format, and hence a much lower cost of factorization than BLR. We then compare the BLR and lattice ℋ-matrix factorization on distributed-memory computers. Our performance results demonstrate that compared with BLR, the lattice format with the lower cost of factorization may lead to faster factorization on the distributed-memory computer.
%B The International Journal of High Performance Computing Applications
%V 33
%P 1046–1063
%8 2019-08
%G eng
%N 5
%R https://doi.org/10.1177/1094342019861139

%0 Generic
%D 2019
%T Does your tool support PAPI SDEs yet?
%A Anthony Danalis
%A Heike Jagode
%A Jack Dongarra
%I 13th Scalable Tools Workshop
%C Tahoe City, CA
%8 2019-07
%G eng

%0 Generic
%D 2018
%T Data Movement Interfaces to Support Dataflow Runtimes
%A Aurelien Bouteiller
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%X This document presents the design study and reports on the implementation of a portable hosted accelerator device support in the PaRSEC Dataflow Tasking at Exascale runtime, undertaken as part of the ECP contract 17-SC-20-SC. The document discusses different technological approaches to transfer data to/from hosted accelerators, issues recommendations for technology providers, and presents the design of an OpenMP-based accelerator support in PaRSEC.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-05
%G eng

%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2018)
%D 2018
%T The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Panruo Wu
%A Srikara Pranesh
%A Stanimire Tomov
%A Jack Dongarra
%X As parallel computers approach exascale, power efficiency in high-performance computing (HPC) systems is of increasing concern. Exploiting both the hardware features and algorithms is an effective solution to achieve power efficiency, and to address the energy constraints in modern and future HPC systems. In this work, we present a novel design and implementation of an energy-efficient solution for dense linear systems of equations, which are at the heart of large-scale HPC applications. The proposed energy-efficient linear system solvers are based on two main components: (1) iterative refinement techniques, and (2) reduced-precision computing features in modern accelerators and coprocessors. While most of the energy efficiency approaches aim to reduce the consumption with a minimal performance penalty, our method improves both the performance and the energy efficiency. Compared to highly-optimized linear system solvers, our kernels deliver the same accuracy solution up to   2×  faster and reduce the energy consumption up to half on Intel Knights Landing (KNL) architectures. By efficiently using the Tensor Cores available in the NVIDIA V100 PCIe GPUs, the speedups can be up to   4× , with more than 80% reduction in the energy consumption.
%B International Conference on Computational Science (ICCS 2018)
%I Springer
%C Wuxi, China
%V 10860
%P 586–600
%8 2018-06
%G eng
%U https://rdcu.be/bcKSC
%R https://doi.org/10.1007/978-3-319-93698-7_45

%0 Generic
%D 2018
%T Distributed Termination Detection for HPC Task-Based Environments
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Valentin Le Fèvre
%A Yves Robert
%A Jack Dongarra
%X This paper revisits distributed termination detection algorithms in the context of high-performance computing applications in task systems. We first outline the need to efficiently detect termination in workflows for which the total number of tasks is data dependent and therefore not known statically but only revealed dynamically during execution. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). On the theoretical side, we analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. On the practical side, we provide a highly tuned implementation of each termination detection algorithm within PaRSEC and compare their performance for a variety of benchmarks, extracted from scientific applications that exhibit dynamic behaviors.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-06
%G eng

%0 Conference Paper
%B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids
%D 2018
%T Do moldable applications perform better on failure-prone HPC platforms?
%A Valentin Le Fèvre
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Atsushi Hori
%A Yves Robert
%A Jack Dongarra
%X This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) Rigid applications, which use a constant number of processors throughout execution; (ii) Moldable applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GridShaped applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage.
%B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids
%S LNCS
%I Springer Verlag
%C Turin, Italy
%8 2018-08
%G eng

%0 Generic
%D 2017
%T Dataflow Programming Paradigms for Computational Chemistry Methods
%A Heike Jagode
%X The transition to multicore and heterogeneous architectures has shaped the High Performance Computing (HPC) landscape over the past decades. With the increase in scale, complexity, and heterogeneity of modern HPC platforms, one of the grim challenges for traditional programming models is to sustain the expected performance at scale. By contrast, dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. This work introduces dataflow programming models for computational chemistry methods, and compares different dataflow executions in terms of programmability, resource utilization, and scalability.      This effort is driven by computational chemistry applications, considering that they comprise one of the driving forces of HPC. In particular, many-body methods, such as Coupled Cluster methods (CC), which are the "gold standard" to compute energies in quantum chemistry, are of particular interest for the applied chemistry community. On that account, the latest development for CC methods is used as the primary vehicle for this research, but our effort is not limited to CC and can be applied across other application domains.      Two programming paradigms for expressing CC methods into a dataflow form, in order to make them capable of utilizing task scheduling systems, are presented. Explicit dataflow, is the programming model where the dataflow is explicitly specified by the developer, is contrasted with implicit dataflow, where a task scheduling runtime derives the dataflow. An abstract model is derived to explore the limits of the different dataflow programming paradigms.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%C Knoxville, TN
%8 2017-05
%U http://trace.tennessee.edu/utk_graddiss/4469/
%9 PhD Dissertation (Computer Science)

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2017
%T Design and Implementation of the PULSAR Programming System for Large Scale Computing
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Yves Robert
%A Jack Dongarra
%X The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.
%B Supercomputing Frontiers and Innovations
%V 4
%G eng
%U http://superfri.org/superfri/article/view/121/210
%N 1
%R 10.14529/jsfi170101

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
%A Jack Dongarra
%A Sven Hammarling
%A Nicholas J. Higham
%A Samuel Relton
%A Pedro Valero-Lara
%A Mawussi Zounon
%K Batched BLAS
%K BLAS
%K High-performance computing
%K Memory management
%K Parallel processing
%K Scientific computing
%X A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on a general matrix-matrix multiplication (GEMM), to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL1 CuBLAS2 by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, for large numbers of double precision GEMM operations using matrices of size 2 × 2 to 20 × 20.
%B International Conference on Computational Science (ICCS 2017)
%I Elsevier
%C Zürich, Switzerland
%8 2017-06
%G eng
%R DOI:10.1016/j.procs.2017.05.138

%0 Generic
%D 2017
%T Designing SLATE: Software for Linear Algebra Targeting Exascale
%A Jakub Kurzak
%A Panruo Wu
%A Mark Gates
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Gerald Ragghianti
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-10
%G eng
%9 SLATE Working Notes
%1 03

%0 Conference Proceedings
%B ScalA17
%D 2017
%T Dynamic Task Discovery in PaRSEC- A data-flow task-based Runtime
%A Reazul Hoque
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%K data-flow
%K dynamic task-graph
%K parsec
%K task-based runtime
%X Successfully exploiting distributed collections of heterogeneous many-cores architectures with complex memory hierarchy through a portable programming model is a challenge for application developers. The literature is not short of proposals addressing this problem, including many evolutionary solutions that seek to extend the capabilities of current message passing paradigms with intranode features (MPI+X). A different, more revolutionary, solution explores data-flow task-based runtime systems as a substitute to both local and distributed data dependencies management. The solution explored in this paper, PaRSEC, is based on such a programming paradigm, supported by a highly efficient task-based runtime. This paper compares two programming paradigms present in PaRSEC, Parameterized Task Graph (PTG) and Dynamic Task Discovery (DTD) in terms of capabilities, overhead and potential benefits.
%B ScalA17
%I ACM
%C Denver
%8 2017-09
%@ 978-1-4503-5125-6
%G eng
%U https://dl.acm.org/citation.cfm?doid=3148226.3148233
%R 10.1145/3148226.3148233

%0 Book Section
%B Lecture Notes in Computer Science
%D 2016
%T Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures
%A Marc Baboulin
%A Jack Dongarra
%A Adrien Remy
%A Stanimire Tomov
%A Ichitaro Yamazaki
%E Roman Wyrzykowski
%E Ewa Deelman
%E Konrad Karczewski
%E Jacek Kitowski
%E Kazimierz Wiatr
%K Communication-avoiding
%K Dense symmetric indefinite factorization
%K gpu computation
%K randomization
%X We study the performance of dense symmetric indefinite factorizations (Bunch-Kaufman and Aasen’s algorithms) on multicore CPUs with a Graphics Processing Unit (GPU). Though such algorithms are needed in many scientific and engineering simulations, obtaining high performance of the factorization on the GPU is difficult because the pivoting that is required to ensure the numerical stability of the factorization leads to frequent synchronizations and irregular data accesses. As a result, until recently, there has not been any implementation of these algorithms on hybrid CPU/GPU architectures. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive communication and synchronization between the CPU and GPU, or on the GPU. We also study the performance of an LDL^T factorization with no pivoting combined with the preprocessing technique based on Random Butterfly Transformations. Though such transformations only have probabilistic results on the numerical stability, they avoid the pivoting and obtain a great performance on the GPU.
%B Lecture Notes in Computer Science
%S 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I
%I Springer International Publishing
%V 9573
%P 86-95
%8 2015-09
%@ 978-3-319-32149-3
%G eng
%& Parallel Processing and Applied Mathematics
%R 10.1007/978-3-319-32149-3_9

%0 Conference Paper
%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016
%D 2016
%T On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K batched computation
%K GPUs
%K variable small sizes
%X <p>  Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required.  </p>  <p>  This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors’ knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2:5 against two Sandy Bridge CPUs (8-core each) running Intel MKL library.  </p>
%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng

%0 Conference Proceedings
%B Software for Exascale Computing - SPPEXA
%D 2016
%T Domain Overlap for Iterative Sparse Triangular Solves on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Daniel Szyld
%A Jack Dongarra
%E Hans-Joachim Bungartz
%E Philipp Neumann
%E Wolfgang E. Nagel
%X Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution.
%B Software for Exascale Computing - SPPEXA
%S Lecture Notes in Computer Science and Engineering
%I Springer International Publishing
%V 113
%P 527–545
%8 2016-09
%G eng
%R 10.1007/978-3-319-40528-5_24

%0 Conference Paper
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2015
%T A Data Flow Divide and Conquer Algorithm for Multicore Architecture
%A Azzam Haidar
%A Jakub Kurzak
%A Gregoire Pichon
%A Mathieu Faverge
%K Eigensolver
%K lapack
%K Multicore
%K plasma
%K task-based programming
%X Computing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automobiles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this article, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the Intel MKL library, and outperforms the best MRRR implementation for many matrices.
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Hyderabad, India
%8 2015-05
%G eng

%0 Generic
%D 2015
%T On the Design, Autotuning, and Optimization of GPU Kernels for Kinetic Network Simulations Using Fast Explicit Integration and GPU Batched Computation
%A Michael Guidry
%A Azzam Haidar
%I Joint Institute for Computational Sciences Seminar Series, Presentation
%C Oak Ridge, TN
%8 2015-09
%G eng

%0 Conference Paper
%B ISC High Performance 2015
%D 2015
%T On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors
%A Khairul Kabir
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X The dramatic change in computer architecture due to the manycore paradigm shift, made the development of numerical routines that are optimal extremely challenging. In this work, we target the development of numerical algorithms and implementations for Xeon Phi coprocessor architecture designs. In particular, we examine and optimize the general and symmetric matrix-vector multiplication routines (gemv/symv), which are some of the most heavily used linear algebra kernels in many important engineering and physics applications. We describe a successful approach on how to address the challenges for this problem, starting from our algorithm design, performance analysis and programing model, to kernel optimization. Our goal, by targeting low-level, easy to understand fundamental kernels, is to develop new optimization strategies that can be effective elsewhere for the use on manycore coprocessors, and to show significant performance improvements compared to the existing state-of-the-art implementations. Therefore, in addition to the new optimization strategies, analysis, and optimal performance results, we finally present the significance of using these routines/strategies to accelerate higher-level numerical algorithms for the eigenvalue problem (EVP) and the singular value decomposition (SVD) that by themselves are foundational for many important applications.
%B ISC High Performance 2015
%C Frankfurt, Germany
%8 2015-07
%G eng

%0 Conference Paper
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2015
%T Design for a Soft Error Resilient Dynamic Task-based Runtime
%A Chongxiao Cao
%A George Bosilca
%A Thomas Herault
%A Jack Dongarra
%X As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Hyderabad, India
%8 2015-05
%G eng

%0 Conference Paper
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2014
%T Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C New Orleans, LA
%8 2014-11
%G eng

%0 Conference Paper
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%D 2014
%T Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K dataflow
%K message-passing
%K multithreading
%K QR decomposition
%K runtime
%K systolic array
%X A systolic array provides an alternative computing paradigm to the von Neuman architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems.
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Generic
%D 2014
%T Design for a Soft Error Resilient Dynamic Task-based Runtime
%A Chongxiao Cao
%A Thomas Herault
%A George Bosilca
%A Jack Dongarra
%X Abstract—As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.
%B ICL Technical Report
%I University of Tennessee
%8 2014-11
%G eng

%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Designing LU-QR Hybrid Solvers for Performance and Stability
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%K plasma
%X This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. Ideally, the choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the Parsec software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%@ 978-1-4799-3800-1
%G eng
%R 10.1109/IPDPS.2014.108

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14)
%D 2014
%T Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster
%A Ichitaro Yamazaki
%A Sivasankaran Rajamanickam
%A Eric G. Boman
%A Mark Hoemmen
%A Michael A. Heroux
%A Stanimire Tomov
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14)
%I IEEE
%C New Orleans, LA
%8 2014-11
%G eng

%0 Conference Paper
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%D 2014
%T Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs
%A Simplice Donfack
%A Stanimire Tomov
%A Jack Dongarra
%X Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer’s characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on a 24 cores AMD Opteron 6172 show that by adding one GPU (Tesla S2050) we accelerate LU up to 2.4x compared to the corresponding routine in MKL using 24 cores. The comparisons with MAGMA also show significant improvements.
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%8 2014-05
%G eng

%0 Journal Article
%J Scalable Computing and Communications: Theory and Practice
%D 2013
%T Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Piotr Luszczek
%A Jack Dongarra
%E Samee Khan
%E Lin-Wang Wang
%E Albert Zomaya
%B Scalable Computing and Communications: Theory and Practice
%I John Wiley & Sons
%P 699-735
%8 2013-03
%G eng

%0 Generic
%D 2013
%T Designing LU-QR hybrid solvers for performance and stability
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report (also LAWN 282)
%I University of Tennessee
%8 2013-10
%G eng

%0 Conference Paper
%B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
%D 2013
%T Diagnosis and Optimization of Application Prefetching Performance
%A Gabriel Marin
%A Colin McCurdy
%A Jeffrey Vetter
%E Allen D. Malony
%E Nemirovsky, Mario
%E Midkiff, Sam
%X Hardware prefetchers are effective at recognizing streaming memory access patterns and at moving data closer to the processing units to hide memory latency. However, hardware prefetchers can track only a limited number of data streams due to finite hardware resources. In this paper, we introduce the term <em>streaming concurrency</em> to characterize the number of parallel, logical data streams in an application. We present a simulation algorithm for understanding the streaming concurrency at any point in an application, and we show that this metric is a good predictor of the number of memory requests initiated by streaming prefetchers. Next, we try to understand the causes behind poor prefetching performance. We identified four prefetch unfriendly conditions and we show how to classify an application's memory references based on these conditions. We evaluated our analysis using the SPEC CPU2006 benchmark suite. We selected two benchmarks with unfavorable access patterns and transformed them to improve their prefetching effectiveness. Results show that making applications more prefetcher friendly can yield meaningful performance gains.
%B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
%I ACM Press
%C Eugene, Oregon, USA
%8 2013-06
%@ 9781450321303
%G eng
%U http://dl.acm.org/citation.cfm?doid=2464996.2465014
%R 10.1145/2464996.2465014

%0 Generic
%D 2013
%T Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs
%A Simplice Donfack
%A Stanimire Tomov
%A Jack Dongarra
%X Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU computing approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on high-end hybrid CPU/GPU systems show that our dynamically balanced synchronization-avoiding LU is both multicore and GPU scalable. Comparisons with state-of-the-art libraries like MKL (for multicore) and MAGMA (for hybrid systems) are provided, demonstrating significant performance improvements. The approach is applicable to other linear algebra algorithms. The scheduling mechanisms and tuning models can be incorporated into respectively dynamic runtime systems/schedulers and autotuning frameworks for hybrid CPU/MIC/GPU architectures.
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng

%0 Journal Article
%J Parallel Computing
%D 2012
%T DAGuE: A generic distributed DAG Engine for High Performance Computing.
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%K dague
%K parsec
%B Parallel Computing
%I Elsevier
%V 38
%P 27-51
%8 2012-00
%G eng

%0 Journal Article
%J High Performance Scientific Computing: Algorithms and Applications
%D 2012
%T Dense Linear Algebra on Accelerated Multicore Hardware
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%E Michael Berry
%E et al.,
%B High Performance Scientific Computing: Algorithms and Applications
%I Springer-Verlag
%C London, UK
%8 2012-00
%G eng

%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2012
%T Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems
%A Christof Voemel
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B SIAM Journal on Scientific Computing
%V 34(2)
%P C70-C82
%8 2012-04
%G eng

%0 Generic
%D 2012
%T Dynamic Task Execution on Shared and Distributed Memory Architectures
%A Asim YarKhan
%X Multicore architectures with high core counts have come to dominate the world of high performance computing, from shared memory machines to the largest distributed memory clusters. The multicore route to increased performance has a simpler design and better power efficiency than the traditional approach of increasing processor frequencies. But, standard programming techniques are not well adapted to this change in computer architecture design.      In this work, we study the use of dynamic runtime environments executing data driven applications as a solution to programming multicore architectures. The goals of our runtime environments are productivity, scalability and performance. We demonstrate productivity by defining a simple programming interface to express code. Our runtime environments are experimentally shown to be scalable and give competitive performance on large multicore and distributed memory machines.      This work is driven by linear algebra algorithms, where state-of-the-art libraries (e.g., LAPACK and ScaLAPACK) using a fork-join or block-synchronous execution style do not use the available resources in the most efficient manner. Research work in linear algebra has reformulated these algorithms as tasks acting on tiles of data, with data dependency relationships between the tasks. This results in a task-based DAG for the reformulated algorithms, which can be executed via asynchronous data-driven execution paths analogous to dataflow execution.      We study an API and runtime environment for shared memory architectures that efficiently executes serially presented tile based algorithms. This runtime is used to enable linear algebra applications and is shown to deliver performance competitive with state-ofthe-art commercial and research libraries.      We develop a runtime environment for distributed memory multicore architectures extended from our shared memory implementation. The runtime takes serially presented algorithms designed for the shared memory environment, and schedules and executes them on distributed memory architectures in a scalable and high performance manner. We design a distributed data coherency protocol and a distributed task scheduling mechanism which avoid global coordination. Experimental results with linear algebra applications show the scalability and performance of our runtime environment.
%9 Dissertation

%0 Conference Proceedings
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%D 2011
%T DAGuE: A Generic Distributed DAG Engine for High Performance Computing
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%K dague
%K parsec
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%I IEEE
%C Anchorage, Alaska, USA
%P 1151-1158
%8 2011-00
%G eng

%0 Conference Proceedings
%B Cray Users Group Conference (CUG'11) (Best Paper Finalist)
%D 2011
%T The Design of an Auto-tuning I/O Framework on Cray XT5 System
%A Haihang You
%A Qing Liu
%A Zhiqiang Li
%A Shirley Moore
%K gco
%B Cray Users Group Conference (CUG'11) (Best Paper Finalist)
%C Fairbanks, Alaska
%8 2011-05
%G eng

%0 Generic
%D 2010
%T DAGuE: A generic distributed DAG engine for high performance computing
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%K dague
%B Innovative Computing Laboratory Technical Report
%8 2010-04
%G eng

%0 Book Section
%B Scientific Computing with Multicore and Accelerators
%D 2010
%T Dense Linear Algebra for Hybrid GPU-based Systems
%A Stanimire Tomov
%A Jack Dongarra
%B Scientific Computing with Multicore and Accelerators
%S Chapman & Hall/CRC Computational Science
%I CRC Press
%C Boca Raton, Florida
%@ 9781439825365
%G eng
%& 3

%0 Generic
%D 2010
%T Dense Linear Algebra Solvers for Multicore with GPU Accelerators
%A Stanimire Tomov
%I International Parallel and Distributed Processing Symposium (IPDPS 2010)
%C Atlanta, GA
%8 2010-04
%G eng

%0 Conference Proceedings
%B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
%D 2010
%T Dense Linear Algebra Solvers for Multicore with GPU Accelerators
%A Stanimire Tomov
%A Rajib Nath
%A Hatem Ltaeif
%A Jack Dongarra
%X Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators. We describe how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures. The approach taken is based on hybridization techniques in the context of Cholesky, LU, and QR factorizations. We use a high-level parallel programming model and leverage existing software infrastructure, e.g. optimized BLAS for CPU and GPU, and LAPACK for sequential CPU processing. Included also are architecture and algorithm-specific optimizations for standard solvers as well as mixed-precision iterative refinement solvers. The new algorithms, depending on the hardware configuration and routine parameters, can lead to orders of magnitude acceleration when compared to the same algorithms on standard multicore architectures that do not contain GPU accelerators. The newly developed DLA solvers are integrated and freely available through the MAGMA library.
%B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
%C Atlanta, GA
%P 1-8
%G eng
%R 10.1109/IPDPSW.2010.5470941

%0 Generic
%D 2010
%T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-10-660
%8 2010-09
%G eng

%0 Generic
%D 2010
%T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K plasma
%B Innovative Computing Laboratory Technical Report
%8 2010-00
%G eng

%0 Journal Article
%J SIAM Journal on Scientific Computing (submitted)
%D 2010
%T Divide & Conquer on Hybrid GPU-Accelerated Multicore Systems
%A Christof Voemel
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B SIAM Journal on Scientific Computing (submitted)
%8 2010-08
%G eng

%0 Conference Proceedings
%B Proceedings of EuroMPI 2010
%D 2010
%T Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%A Pierre Lemariner
%A Jack Dongarra
%E Jack Dongarra
%E Michael Resch
%E Rainer Keller
%E Edgar Gabriel
%K ftmpi
%B Proceedings of EuroMPI 2010
%I Springer
%C Stuttgart, Germany
%8 2010-09
%G eng

%0 Journal Article
%J PPAM 2009
%D 2009
%T Dependency-Driven Scheduling of Dense Matrix Factorizations on Shared-Memory Systems
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%B PPAM 2009
%C Poland
%8 2009-09
%G eng

%0 Conference Proceedings
%B International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09)
%D 2009
%T Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems
%A Fengguang Song
%A Asim YarKhan
%A Jack Dongarra
%K mumi
%K plasma
%B International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09)
%C Portland, OR
%8 2009-11
%G eng

%0 Journal Article
%J in Advances in Computers
%D 2008
%T DARPA's HPCS Program: History, Models, Tools, Languages
%A Jack Dongarra
%A Robert Graybill
%A William Harrod
%A Robert Lucas
%A Ewing Lusk
%A Piotr Luszczek
%A Janice McMahon
%A Allan Snavely
%A Jeffrey Vetter
%A Katherine Yelick
%A Sadaf Alam
%A Roy Campbell
%A Laura Carrington
%A Tzu-Yi Chen
%A Omid Khalili
%A Jeremy Meredith
%A Mustafa Tikir
%E M. Zelkowitz
%B in Advances in Computers
%I Elsevier
%V 72
%8 2008-01
%G eng

%0 Conference Proceedings
%B Proceedings of the 2008 International Conference on Computational Science (ICCS 2008)
%D 2008
%T Detection and Analysis of Iterative Behavior in Parallel Applications
%A Karl Fürlinger
%A Shirley Moore
%K point
%B Proceedings of the 2008 International Conference on Computational Science (ICCS 2008)
%C Krakow, Poland
%V 5103
%P 261-267
%8 2008-01
%G eng

%0 Journal Article
%J Euro-Par 2007
%D 2007
%T Decision Trees and MPI Collective Algorithm Selection Problem
%A Jelena Pjesivac–Grbovic
%A George Bosilca
%A Graham Fagg
%A Thara Angskun
%A Jack Dongarra
%K ftmpi
%B Euro-Par 2007
%I Springer
%C Rennes, France
%P 105–115
%8 2007-08
%G eng

%0 Journal Article
%J in Petascale Computing: Algorithms and Applications (to appear)
%D 2007
%T Disaster Survival Guide in Petascale Computing: An Algorithmic Approach
%A Jack Dongarra
%A Zizhong Chen
%A George Bosilca
%A Julien Langou
%B in Petascale Computing: Algorithms and Applications (to appear)
%I Chapman & Hall - CRC Press
%8 2007-00
%G eng

%0 Conference Proceedings
%B Proceedings of DoD HPCMP UGC 2005 (to appear)
%D 2005
%T Dynamic Process Management for Pipelined Applications
%A David Cronk
%A Graham Fagg
%A Susan Emeny
%A Scott Tucker
%B Proceedings of DoD HPCMP UGC 2005 (to appear)
%I IEEE
%C Nashville, TN
%8 2005-01
%G eng

%0 Conference Proceedings
%B International Conference on Computational Science
%D 2004
%T Design of an Interactive Environment for Numerically Intensive Parallel Linear Algebra Calculations
%A Piotr Luszczek
%A Jack Dongarra
%E Marian Bubak
%E Geert Dick van Albada
%E Peter M. Sloot
%E Jack Dongarra
%K lacsi
%K lfc
%B International Conference on Computational Science
%I Springer Verlag
%C Poland
%8 2004-06
%G eng
%R 10.1007/978-3-540-25944-2_35

%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T Distributed Probablistic Model-Building Genetic Algorithm
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Masaki Sano
%A Hisashi Shimosaka
%A Shigeyoshi Tsutsui
%A Jack Dongarra
%B Lecture Notes in Computer Science
%I Springer-Verlag, Heidelberg
%V 2723
%P 1015-1028
%8 2003-01
%G eng

%0 Journal Article
%J ICL Tech Report
%D 2003
%T Distributed Storage in RIB
%A Thomas B. Boehmann
%K rib
%B ICL Tech Report
%8 2003-03
%G eng

%0 Conference Proceedings
%B Parallel Computing: Advances and Current Issues:Proceedings of the International Conference ParCo2001
%D 2002
%T Deploying Parallel Numerical Library Routines to Cluster Computing in a Self Adapting Fashion
%A Kenneth Roche
%A Jack Dongarra
%E Gerhard R. Joubert
%E Almerica Murli
%E Frans Peters
%E Marco Vanneschi
%K lfc
%K sans
%B Parallel Computing: Advances and Current Issues:Proceedings of the International Conference ParCo2001
%I Imperial College Press
%C London, England
%8 2002-01
%G eng

%0 Generic
%D 2002
%T Development of the PICMSS NetSolve Service
%A Matthew Kelleher Jr.
%K netsolve
%B ICL Technical Report
%8 2002-04
%G eng

%0 Generic
%D 2000
%T Design and Implementation of NetSolve using DCOM as the Remoting Layer
%A Ganapathy Raman
%A Jack Dongarra
%K netsolve
%B University of Tennessee Computer Science Department Technical Report
%8 2000-05
%G eng

%0 Journal Article
%J Concurrency: Practice and Experience
%D 2000
%T The Design and Implementation of the Parallel Out of Core ScaLAPACK LU, QR, and Cholesky Factorization Routines
%A Eduardo D'Azevedo
%A Jack Dongarra
%B Concurrency: Practice and Experience
%V 12
%P 1481-1493
%8 2000-01
%G eng

%0 Conference Proceedings
%B to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications
%D 2000
%T Developing an Architecture to Support the Implementation and Development of Scientific Computing Applications
%A Dorian Arnold
%A Jack Dongarra
%K netsolve
%B to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications
%C Ottawa, Canada
%8 2000-10
%G eng

%0 Journal Article
%J Future Generation Computer Systems
%D 1999
%T Deploying Fault-tolerance and Task Migration with NetSolve
%A Henri Casanova
%A James Plank
%A Micah Beck
%A Jack Dongarra
%K netsolve
%B Future Generation Computer Systems
%I Elsevier
%V 15
%P 745-755
%8 1999-10
%G eng