%0 Journal Article
%J Concurrency and Computation: Practice and Experience: Special Issue on Parallel and Distributed Algorithms
%D 2018
%T Evaluation of Dataflow Programming Models for Electronic Structure Theory
%A Heike Jagode
%A Anthony Danalis
%A Reazul Hoque
%A Mathieu Faverge
%A Jack Dongarra
%K CCSD
%K coupled cluster methods
%K dataflow
%K NWChem
%K OpenMP
%K parsec
%K StarPU
%K task-based runtime
%X Dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. In this paper, we evaluate different dataflow programming models for electronic structure methods and compare them in terms of programmability, resource utilization, and scalability. In particular, we evaluate two programming paradigms for expressing scientific applications in a dataflow form: (1) explicit dataflow, where the dataflow is specified explicitly by the developer, and (2) implicit dataflow, where a task scheduling runtime derives the dataflow using per-task data-access information embedded in a serial program. We discuss our findings and present a thorough experimental analysis using methods from the NWChem quantum chemistry application as our case study, and OpenMP, StarPU, and PaRSEC as the task-based runtimes that enable the different forms of dataflow execution. Furthermore, we derive an abstract model to explore the limits of the different dataflow programming paradigms.
%B Concurrency and Computation: Practice and Experience: Special Issue on Parallel and Distributed Algorithms
%V 2018
%P 1–20
%8 05-2018
%G eng
%N e4490
%R https://doi.org/10.1002/cpe.4490
%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2017
%T Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation
%A Mathieu Faverge
%A Julien Langou
%A Yves Robert
%A Jack Dongarra
%K Algorithm design and analysis
%K Approximation algorithms
%K Kernel
%K Multicore processing
%K Shape
%K Software algorithms
%K Transforms
%X We study tiled algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthog-onal transformations: (i) the tiled bidiagonalization algorithm BIDIAG, which is a tiled version of the standard scalar bidiago-nalization algorithm; and (ii) the R-bidiagonalization algorithm R-BIDIAG, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R- factor. For both BIDIAG and R-BIDIAG, we use four main types of reduction trees, namely FLATTS, FLATTT, GREEDY, and a newly introduced auto-adaptive tree, AUTO. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BIDIAG has a shorter critical path length than BIDIAG for tall and skinny matrices, and (ii) GREEDY based schemes are much better than earlier proposed algorithms with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared- memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Orlando, FL
%8 05-2017
%G eng
%R 10.1109/IPDPS.2017.46
%0 Conference Paper
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2015
%T A Data Flow Divide and Conquer Algorithm for Multicore Architecture
%A Azzam Haidar
%A Jakub Kurzak
%A Gregoire Pichon
%A Mathieu Faverge
%K Eigensolver
%K lapack
%K Multicore
%K plasma
%K task-based programming
%X Computing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automobiles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this article, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the Intel MKL library, and outperforms the best MRRR implementation for many matrices.
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Hyderabad, India
%8 05-2015
%G eng
%0 Conference Paper
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2015
%T Hierarchical DAG scheduling for Hybrid Distributed Systems
%A Wei Wu
%A Aurelien Bouteiller
%A George Bosilca
%A Mathieu Faverge
%A Jack Dongarra
%K dense linear algebra
%K gpu
%K heterogeneous architecture
%K PaRSEC runtime
%X Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the pro-gramming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel effi-ciency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments.
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Hyderabad, India
%8 05-2015
%G eng
%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2015
%T Mixing LU-QR Factorization Algorithms to Design High-Performance Dense Linear Algebra Solvers
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%K lu factorization
%K Numerical algorithms
%K QR factorization
%K Stability; Performance
%X This paper introduces hybrid LU–QR algorithms for solving dense linear systems of the form Ax=b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floating-point operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. The choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. A comprehensive set of experiments shows that hybrid LU–QR algorithms provide a continuous range of trade-offs between stability and performances.
%B Journal of Parallel and Distributed Computing
%V 85
%P 32-46
%8 11-2015
%G eng
%R doi:10.1016/j.jpdc.2015.06.007
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%K Gaussian elimination
%K lu factorization
%K Multicore
%K parallel
%K shared memory
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 1292-1309
%8 04-2015
%G eng
%N 5
%R 10.1002/cpe.3306
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2014
%T Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K factorization
%K parallel linear algebra
%K recursion
%K shared memory synchronization
%K threaded parallelism
%X The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS.
%B Concurrency and Computation: Practice and Experience
%V 26
%P 1408-1431
%8 05-2014
%G eng
%U http://doi.wiley.com/10.1002/cpe.3110
%N 7
%! Concurrency Computat.: Pract. Exper.
%& 1408
%R 10.1002/cpe.3110
%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Designing LU-QR Hybrid Solvers for Performance and Stability
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%X This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. Ideally, the choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the Parsec software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 05-2014
%@ 978-1-4799-3800-1
%G eng
%R 10.1109/IPDPS.2014.108
%0 Conference Paper
%B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014
%D 2014
%T Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes
%A Xavier Lacoste
%A Mathieu Faverge
%A Pierre Ramet
%A Samuel Thibault
%A George Bosilca
%K DAG based runtime
%K gpu
%K Multicore
%K Sparse linear solver
%X The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of the computing resources. The pressure to maintain reasonable levels of performance and portability, forces the application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical architectures. In this paper, we study the replacement of the highly specialized internal scheduler in PaStiX by two generic runtime frameworks: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them with the opportunity to optimize it in order to maximize the algorithm eefficiency for a predefined execution environment. A comparative study of the performance of the PaStiX solver with the three schedulers { native PaStiX, StarPU and PaRSEC schedulers { on different execution contexts is performed. The analysis highlights the similarities from a performance point of view between the different execution supports. These results demonstrate that these generic DAG-based runtimes provide a uniform and portable programming interface across heterogeneous environments, and are, therefore, a sustainable solution for hybrid environments.
%B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 05-2014
%G eng
%0 Generic
%D 2013
%T Designing LU-QR hybrid solvers for performance and stability
%A Mathieu Faverge
%A Julien Herrmann
%A Julien Langou
%A Bradley Lowery
%A Yves Robert
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report (also LAWN 282)
%I University of Tennessee
%8 10-2013
%G eng
%0 Journal Article
%J Parallel Computing
%D 2013
%T Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Mathias Jacquelin
%A Julien Langou
%A Yves Robert
%K Cluster
%K Distributed memory
%K Hierarchical architecture
%K multi-core
%K numerical linear algebra
%K QR factorization
%X This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ‘‘communication-avoiding’’), it is natural to consider hierarchical trees composed of an ‘‘inter-node’’ tree which acts on top of ‘‘intra-node’’ trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ‘‘TS level’’ for cache-friendliness, (1) ‘‘low-level’’ for decoupled highly parallel inter-node reductions, (2) ‘‘domino level’’ to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGUE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.
%B Parallel Computing
%V 39
%P 212-232
%8 05-2013
%G eng
%N 4-5
%0 Generic
%D 2013
%T Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC
%A Guillaume Aupy
%A Mathieu Faverge
%A Yves Robert
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter-node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures
%B Lawn 277
%8 05-2013
%G eng
%0 Journal Article
%J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%D 2013
%T Multithreading in the PLASMA Library
%A Jakub Kurzak
%A Piotr Luszczek
%A Asim YarKhan
%A Mathieu Faverge
%A Julien Langou
%A Henricus Bouwmeester
%A Jack Dongarra
%E Mohamed Ahmed
%E Reda Ammar
%E Sanguthevar Rajasekaran
%B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%I Taylor & Francis
%8 00-2013
%G eng
%0 Journal Article
%J IEEE Computing in Science and Engineering
%D 2013
%T PaRSEC: Exploiting Heterogeneity to Enhance Scalability
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Thomas Herault
%A Jack Dongarra
%X New high-performance computing system designs with steeply escalating processor and core counts, burgeoning heterogeneity and accelerators, and increasingly unpredictable memory access times call for dramatically new programming paradigms. These new approaches must react and adapt quickly to unexpected contentions and delays, and they must provide the execution environment with sufficient intelligence and flexibility to rearrange the execution to improve resource utilization.
%B IEEE Computing in Science and Engineering
%V 15
%P 36-45
%8 11-2013
%G eng
%N 6
%R 10.1109/MCSE.2013.98
%0 Generic
%D 2012
%T On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B University of Tennessee Computer Science Technical Report
%8 07-2013
%G eng
%0 Conference Proceedings
%B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium
%D 2012
%T Hierarchical QR factorization algorithms for multi-core cluster systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Julien Langou
%A Yves Robert
%B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium
%I IEEE Computer Society Press
%C Shanghai, China
%8 05-2012
%G eng
%0 Conference Proceedings
%B Proceedings of VECPAR’12
%D 2012
%T Programming the LU Factorization for a Multicore System with Accelerators
%A Jakub Kurzak
%A Piotr Luszczek
%A Mathieu Faverge
%A Jack Dongarra
%K plasma
%K quark
%B Proceedings of VECPAR’12
%C Kobe, Japan
%8 04-2012
%G eng
%0 Generic
%D 2011
%T Achieving Numerical Accuracy and High Performance using Recursive Tile LU Factorization
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K plasma
%K quark
%B University of Tennessee Computer Science Technical Report (also as a LAWN)
%8 09-2011
%G eng
%0 Conference Proceedings
%B Proceedings of PARCO'11
%D 2011
%T Exploiting Fine-Grain Parallelism in Recursive LU Factorization
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K plasma
%B Proceedings of PARCO'11
%C Gent, Belgium
%8 04-2011
%G eng
%0 Conference Proceedings
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%D 2011
%T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%I IEEE
%C Anchorage, Alaska, USA
%P 1432-1441
%8 05-2011
%G eng
%0 Generic
%D 2011
%T Hierarchical QR factorization algorithms for multi-core cluster systems
%A Jack Dongarra
%A Mathieu Faverge
%A Thomas Herault
%A Julien Langou
%A Yves Robert
%K magma
%K plasma
%B University of Tennessee Computer Science Technical Report (also Lawn 257)
%8 10-2011
%G eng
%0 Conference Proceedings
%B Proceedings of MTAGS11
%D 2011
%T High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%B Proceedings of MTAGS11
%C Seattle, WA
%8 11-2011
%G eng
%0 Journal Article
%J IEEE/ACS AICCSA 2011
%D 2011
%T LU Factorization for Accelerator-based Systems
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Mathieu Faverge
%A Julien Langou
%A Hatem Ltaeif
%A Stanimire Tomov
%K magma
%K morse
%B IEEE/ACS AICCSA 2011
%C Sharm-El-Sheikh, Egypt
%8 12-2011
%G eng
%0 Conference Proceedings
%B Parallel Tools Workshop
%D 2011
%T An open-source tool-chain for performance analysis
%A Kevin Coulomb
%A Augustin Degomme
%A Mathieu Faverge
%A Francois Trahay
%B Parallel Tools Workshop
%C Dresden, Germany
%8 09-2011
%G eng
%0 Generic
%D 2011
%T Towards a Parallel Tile LDL Factorization for Multicore Architectures
%A Dulceneia Becker
%A Mathieu Faverge
%A Jack Dongarra
%K plasma
%K quark
%B ICL Technical Report
%C Seattle, WA
%8 04-2011
%G eng
%0 Generic
%D 2010
%T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-10-660
%8 09-2010
%G eng
%0 Generic
%D 2010
%T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K plasma
%B Innovative Computing Laboratory Technical Report
%8 00-2010
%G eng
%0 Generic
%D 2010
%T EZTrace: a generic framework for performance analysis
%A Jack Dongarra
%A Mathieu Faverge
%A Yutaka Ishikawa
%A Raymond Namyst
%A François Rue
%A Francois Trahay
%B ICL Technical Report
%8 12-2010
%G eng
%0 Conference Proceedings
%B Proceedings of IPDPS 2011
%D 2010
%T QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Samuel Thibault
%A Stanimire Tomov
%K magma
%K morse
%K plasma
%B Proceedings of IPDPS 2011
%C Anchorage, AK
%8 10-2010
%G eng
%0 Journal Article
%J PGI Insider
%D 2010
%T Using MAGMA with PGI Fortran
%A Stanimire Tomov
%A Mathieu Faverge
%A Piotr Luszczek
%A Jack Dongarra
%K magma
%B PGI Insider
%8 11-2010
%G eng