Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required.

This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors’ knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2:5 against two Sandy Bridge CPUs (8-core each) running Intel MKL library.

%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016 %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Conference Proceedings %B Software for Exascale Computing - SPPEXA %D 2016 %T Domain Overlap for Iterative Sparse Triangular Solves on GPUs %A Hartwig Anzt %A Edmond Chow %A Daniel Szyld %A Jack Dongarra %E Hans-Joachim Bungartz %E Philipp Neumann %E Wolfgang E. Nagel %X Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution. %B Software for Exascale Computing - SPPEXA %S Lecture Notes in Computer Science and Engineering %I Springer International Publishing %V 113 %P 527–545 %8 2016-09 %G eng %R 10.1007/978-3-319-40528-5_24 %0 Conference Proceedings %B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2016 %T Efficiency of General Krylov Methods on GPUs – An Experimental Study %A Hartwig Anzt %A Jack Dongarra %A Moritz Kreutzer %A Gerhard Wellein %A Martin Kohler %K algorithmic bombardment %K BiCGSTAB %K CGS %K Convergence %K Electric breakdown %K gpu %K graphics processing units %K Hardware %K IDR(s) %K Krylov solver %K Libraries %K linear systems %K QMR %K Sparse matrices %X This paper compares different Krylov methods based on short recurrences with respect to their efficiency whenimplemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergencecharacteristics. For a large set of test matrices taken from theUniversity of Florida Matrix Collection, we evaluate the methods'performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and executiontime. We also analyze whether the methods are "orthogonal"in terms of problem suitability. We propose best practicesfor choosing methods in a "black box" scenario, where noinformation about the optimal solver is available. %B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %P 683-691 %8 2016-05 %G eng %R 10.1109/IPDPSW.2016.45 %0 Conference Paper %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) %D 2016 %T Efficiency of General Krylov Methods on GPUs – An Experimental Study %A Hartwig Anzt %A Jack Dongarra %A Moritz Kreutzer %A Gerhard Wellein %A Martin Kohler %K algorithmic bombardment %K BiCGSTAB %K CGS %K gpu %K IDR(s) %K Krylov solver %K QMR %X This paper compares different Krylov methods based on short recurrences with respect to their efficiency when implemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergence characteristics. For a large set of test matrices taken from the University of Florida Matrix Collection, we evaluate the methods’ performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and execution time. We also analyze whether the methods are “orthogonal” in terms of problem suitability. We propose best practices for choosing methods in a “black box” scenario, where no information about the optimal solver is available. %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) %I IEEE %C Chicago, IL %8 2016-05 %G eng %R 10.1109/IPDPSW.2016.45 %0 Conference Proceedings %B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16) %D 2016 %T Failure Detection and Propagation in HPC Systems %A George Bosilca %A Aurelien Bouteiller %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Pierre Sens %A Jack Dongarra %K failure detection %K fault-tolerance %K MPI %B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16) %I IEEE Press %C Salt Lake City, Utah %P 27:1-27:11 %8 2016-11 %@ 978-1-4673-8815-3 %G eng %U http://dl.acm.org/citation.cfm?id=3014904.3014941 %0 Journal Article %J Journal of Computational Science %D 2016 %T Fine-grained Bit-Flip Protection for Relaxation Methods %A Hartwig Anzt %A Jack Dongarra %A Enrique S. Quintana-Orti %K Bit flips %K Fault tolerance %K High Performance Computing %K iterative solvers %K Jacobi method %K sparse linear systems %X Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance. %B Journal of Computational Science %8 2016-11 %G eng %R https://doi.org/10.1016/j.jocs.2016.11.013 %0 Conference Paper %B 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'16) %D 2016 %T GPU-Aware Non-contiguous Data Movement In Open MPI %A Wei Wu %A George Bosilca %A Rolf vandeVaart %A Sylvain Jeaugey %A Jack Dongarra %K datatype %K gpu %K hybrid architecture %K MPI %K non-contiguous data %XDue to better parallel density and power efficiency, GPUs have become more popular for use in scientific applications. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non-contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance.

To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype packing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Unied Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.

%B 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'16) %I ACM %C Kyoto, Japan %8 2016-06 %G eng %R http://dx.doi.org/10.1145/2907294.2907317 %0 Conference Paper %B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2016 %T Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures %A Yulu Jia %A Piotr Luszczek %A Jack Dongarra %X Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction. %B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Conference Paper %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %D 2016 %T Heterogeneous Streaming %A Chris J. Newburn %A Gaurav Bansal %A Michael Wood %A Luis Crivelli %A Judit Planas %A Alejandro Duran %A Paulo Souza %A Leonardo Borges %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %A Hartwig Anzt %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Ichitaro Yamazaki %A Jesus Labarta %X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application. %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2016 %T High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems %A Jack Dongarra %A Michael A. Heroux %A Piotr Luszczek %B International Journal of High Performance Computing Applications %V 30 %P 3 - 10 %8 2016-02 %G eng %U http://hpc.sagepub.com/cgi/doi/10.1177/1094342015593158 %N 1 %! International Journal of High Performance Computing Applications %R 10.1177/1094342015593158 %0 Generic %D 2016 %T High Performance Realtime Convex Solver for Embedded Systems %A Ichitaro Yamazaki %A Saeid Nooshabadi %A Stanimire Tomov %A Jack Dongarra %K KKT %K Realtime embedded convex optimization solver %X Convex optimization solvers for embedded systems find widespread use. This letter presents a novel technique to reduce the run-time of decomposition of KKT matrix for the convex optimization solver for an embedded system, by two orders of magnitude. We use the property that although the KKT matrix changes, some of its block sub-matrices are fixed during the solution iterations and the associated solving instances. %B University of Tennessee Computer Science Technical Report %8 2016-10 %G eng %0 Conference Paper %B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16) %D 2016 %T High-performance Matrix-matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Joël Falcou %A Jack Dongarra %X The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries. %B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16) %I Springer International Publishing %C Grenoble, France %8 2016-08 %G eng %0 Generic %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-01 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %K Applications %K Batched linear algebra %K FEM %K gpu %K Tensor contractions %K Tensor HPC %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Generic %D 2016 %T The HPL Benchmark: Past, Present & Future %A Jack Dongarra %C ISC High Performance, Frankfurt, Germany %8 2016-07 %G eng %9 Conference Presentation %0 Journal Article %J Acta Numerica %D 2016 %T Linear Algebra Software for Large-Scale Accelerated Multicore Computing %A Ahmad Abdelfattah %A Hartwig Anzt %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A undefined %A Asim YarKhan %X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems. %B Acta Numerica %V 25 %P 1-160 %8 2016-05 %G eng %R 10.1017/S0962492916000015 %0 Conference Paper %B IEEE High Performance Extreme Computing Conference (HPEC'16) %D 2016 %T LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi %A Azzam Haidar %A Stanimire Tomov %A Konstantin Arturov %A Murat Guney %A Shane Story %A Jack Dongarra %X A wide variety of heterogeneous compute resources, ranging from multicore CPUs to GPUs and coprocessors, are available to modern computers, making it challenging to design unified numerical libraries that efficiently and productively use all these varied resources. For example, in order to efficiently use Intel’s Knights Langing (KNL) processor, the next-generation of Xeon Phi architectures, one must design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance. We propose a productive and portable programming model that allows us to write a serial-looking code, which, however, achieves parallelism and scalability by using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and the parallel execution. This is done through multiple techniques ranging from multi-level data partitioning to adaptive task grain sizes, and dynamic task scheduling. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. Finally, we outline the strengths and the effectiveness of this approach – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate current work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. %B IEEE High Performance Extreme Computing Conference (HPEC'16) %I IEEE %C Waltham, MA %8 2016-09 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2016 %T Non-GPU-resident Dense Symmetric Indefinite Factorization %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %X We study various algorithms to factorize a symmetric indefinite matrix that does not fit in the core memory of a computer. There are two sources of the data movement into the memory: one needed for selecting and applying pivots and the other needed to update each column of the matrix for the factorization. It is a challenge to obtain high performance of such an algorithm when the pivoting is required to ensure the numerical stability of the factorization. For example, when factorizing each column of the matrix, a diagonal entry, which ensures the stability, may need to be selected as a pivot among the remaining diagonals, and moved to the leading diagonal by swapping both the corresponding rows and columns of the matrix. If the pivot is not in the core memory, then it must be loaded into the core memory. For updating the matrix, the data locality may be improved by partitioning the matrix. For example, a right-looking partitioned algorithm first factorizes the leading columns, called panel, and then uses the factorized panel to update the trailing submatrix. This algorithm only accesses the trailing submatrix after each panel factorization (instead of after each column factorization) and performs most of its floating-point operations (flops) using BLAS-3, which can take advantage of the memory hierarchy. However, because the pivots cannot be predetermined, the whole trailing submatrix must be updated before the next panel factorization can start. When the whole submatrix does not fit in the core memory all at once, loading the block columns into the memory can become the performance bottleneck. Similarly, the left-looking variant of the algorithm would require to update each panel with all of the previously factorized columns. This makes it a much greater challenge to implement an efficient out-of-core symmetric indefinite factorization compared with an out-of-core nonsymmetric LU factorization with partial pivoting, which only requires to swap the rows of the matrix and accesses the trailing submatrix after each in-core factorization (instead of after each panel factorization by the symmetric factorization). To reduce the amount of the data transfer, in this paper we uses the recently proposed left-looking communication-avoiding variant of the symmetric factorization algorithm to factorize the columns in the core memory, and then perform the partitioned right-looking out-of-core trailing submatrix updates. This combination may still require to load the pivots into the core memory, but it only updates the trailing submatrix after each in-core factorization, while the previous algorithm updates it after each panel factorization.Although these in-core and out-of-core algorithms can be applied at any level of the memory hierarchy, we apply our designs to the GPU and CPU memory, respectively. We call this specific implementation of the algorithm a non–GPU-resident implementation. Our performance results on the current hybrid CPU/GPU architecture demonstrate that when the matrix is much larger than the GPU memory, the proposed algorithm can obtain significant speedups over the communication-hiding implementations of the previous algorithms. %B Concurrency and Computation: Practice and Experience %8 2016-11 %G eng %R 10.1002/cpe.4012 %0 Conference Paper %B 2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16) %D 2016 %T Performance Analysis and Acceleration of Explicit Integration for Large Kinetic Networks using Batched GPU Computations %A Azzam Haidar %A Benjamin Brock %A Stanimire Tomov %A Michael Guidry %A Jay Jay Billings %A Daniel Shyles %A Jack Dongarra %X We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms that solve efficiently N coupled ordinary differential equations (subject to initial conditions) on modern GPUs. We take representative test cases (Type Ia supernova explosions) and demonstrate two or more orders of magnitude increase in efficiency for solving such systems (of realistic thermonuclear networks coupled to fluid dynamics). This implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications we present the computational techniques developed for our ongoing deployment of these new methods on modern GPU accelerators. We show that similarly to many other scientific applications, ranging from national security to medical advances, the computation can be split into many independent computational tasks, each of relatively small-size. As the size of each individual task does not provide sufficient parallelism for the underlying hardware, especially for accelerators, these tasks must be computed concurrently as a single routine, that we call batched routine, in order to saturate the hardware with enough work. %B 2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16) %I IEEE %C Waltham, MA %8 2016-09 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2016 %T On the performance and energy efficiency of sparse linear algebra on GPUs %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based supercomputers. We compare the resource efficiency of different sparse matrix–vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel’s MKL for multicore CPUs, and develop a GPU sparse matrix–matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers. %B International Journal of High Performance Computing Applications %8 2016-10 %G eng %U http://hpc.sagepub.com/content/early/2016/10/05/1094342016672081.abstract %R 10.1177/1094342016672081 %0 Book Section %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %E Julian M. Kunkel %E Pavan Balaji %E Jack Dongarra %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %I Springer International Publishing %P 21–38 %@ 978-3-319-41321-1 %G eng %U http://dx.doi.org/10.1007/978-3-319-41321-1_2 %R 10.1007/978-3-319-41321-1_2 %0 Conference Paper %B The International Supercomputing Conference (ISC High Performance 2016) %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Autotuning %K Batched GEMM %K GEMM %K GPU computing %K HPC %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B The International Supercomputing Conference (ISC High Performance 2016) %C Frankfurt, Germany %8 2016-06 %G eng %0 Generic %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Autotuning %K Batched GEMM %K GEMM %K GPU computing %K HPC %X Abstract. The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra. It is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, there becomes a need to have a high performance GEMM kernel for a batch of small matrices. Such kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both xed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance test reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-02 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2016 %T Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs %A Ahmad Abdelfattah %A Hatem Ltaeif %A David Keyes %A Jack Dongarra %X Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications. %B Concurrency and Computation: Practice and Experience %V 28 %P 3447 - 3465 %8 2016-05 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/cpe.3874/full %N 12 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.v28.1210.1002/cpe.3874 %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K batched computation %K Cholesky Factorization %K GPUs %K Tuning %XSolving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually.

This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2 faster.

%B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Journal Article %J International Journal of Parallel Programming %D 2016 %T Porting the PLASMA Numerical Library to the OpenMP Standard %A Asim YarKhan %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %X PLASMA is a numerical library intended as a successor to LAPACK for solving problems in dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for efficient multithreading of algorithms expressed in a serial fashion. QUARK is a superscalar scheduler and implements automatic parallelization by tracking data dependencies and resolving data hazards at runtime. Recently, this type of scheduling has been incorporated in the OpenMP standard, which allows to transition PLASMA from the proprietary solution offered by QUARK to the standard solution offered by OpenMP. This article studies the feasibility of such transition. %B International Journal of Parallel Programming %8 2016-06 %G eng %U http://link.springer.com/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6.pdfhttp://link.springer.com/article/10.1007/s10766-016-0441-6/fulltext.html %! Int J Parallel Prog %R 10.1007/s10766-016-0441-6 %0 Conference Proceedings %B Tools for High Performance Computing 2015: Proceedings of the 9th International Workshop on Parallel Tools for High Performance Computing, September 2015, Dresden, Germany %D 2016 %T Power Management and Event Verification in PAPI %A Heike Jagode %A Asim YarKhan %A Anthony Danalis %A Jack Dongarra %X For more than a decade, the PAPI performance monitoring library has helped to implement the familiar maxim attributed to Lord Kelvin: “If you cannot measure it, you cannot improve it.” Widely deployed and widely used, PAPI provides a generic, portable interface for the hardware performance counters available on all modern CPUs and some other components of interest that are scattered across the chip and system. Recent and radical changes in processor and system design—systems that combine multicore CPUs and accelerators, shared and distributed memory, PCI- express and other interconnects—as well as the emergence of power efficiency as a primary design constraint, and reduced data movement as a primary programming goal, pose new challenges and bring new opportunities to PAPI. We discuss new developments of PAPI that allow for multiple sources of performance data to be measured simultaneously via a common software interface. Specifically, a new PAPI component that controls power is discussed. We explore the challenges of shared hardware counters that include system-wide measurements in existing multicore architectures. We conclude with an exploration of future directions for the PAPI interface. %B Tools for High Performance Computing 2015: Proceedings of the 9th International Workshop on Parallel Tools for High Performance Computing, September 2015, Dresden, Germany %I Springer International Publishing %C Dresden, Germany %P pp. 41-51 %@ 978-3-319-39589-0 %G eng %R https://doi.org/10.1007/978-3-319-39589-0_4 %0 Generic %D 2016 %T Report on the Sunway TaihuLight System %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-06 %G eng %U http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf %0 Conference Paper %B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2016 %T Search Space Generation and Pruning System for Autotuners %A Piotr Luszczek %A Mark Gates %A Jakub Kurzak %A Anthony Danalis %A Jack Dongarra %X This work tackles two simultaneous challenges faced by autotuners: the ease of describing a complex, multidimensional search space, and the speed of evaluating that space, while applying a multitude of pruning constraints. This article presents a declarative notation for describing a search space and a translation system for conversion to a standard C code for fast and multithreaded, as necessary, evaluation. The notation is Python-based and thus simple in syntax and easy to assimilate by the user interested in tuning rather than learning a new programming language. A large number of dimensions and a large number of pruning constraints may be expressed with little effort. The system is discussed in the context of autotuning the canonical matrix multiplication kernel for NVIDIA GPUs, where the search space has 15 dimensions and involves application of 10 complex pruning constrains. The speed of evaluation is compared against generators created using imperative programming style in various scripting and compiled languages. %B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2016 %T Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %X To orthonormalize a set of dense vectors, Singular Value QR (SVQR) requires only one global reduction between the parallel processing units, and uses BLAS-3 kernels to perform most of its local computation. As a result, compared to other orthogonalization schemes, SVQR obtains superior performance on many of the current computers. In this paper, we study the stability and performance of various SVQR implementations on multicore CPUs with a GPU, focusing on the dense triangular solve, which performs half of the total floating-point operations in SVQR. As a part of this study, we examine its adaptive mixed-precision variant that decides if a lower-precision arithmetic can be used for the triangular solution at runtime without increasing the order of its orthogonality error. Since the backward error of this adaptive mixed-precision variant is significantly greater than that of the standard SVQR, we study its effects on the solution convergence of several subspace projection methods for solving a linear system of equations and for computing singular values or eigenvalues of a sparse matrix. Our experimental results indicate that in some cases, the convergence rate of the solver may not be affected by the larger backward errors, while reducing the time to solution. %B ACM Transactions on Mathematical Software (TOMS) %V 43 %8 2016-10 %G eng %N 2 %0 Generic %D 2016 %T A Standard for Batched BLAS Routines %A Pedro Valero-Lara %A Jack Dongarra %A Azzam Haidar %A Samuel D. Relton %A Stanimire Tomov %A Mawussi Zounon %I 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16) %C Paris, France %8 2016-04 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD) %D 2016 %T Towards Achieving Performance Portability Using Directives for Accelerators %A M. Graham Lopez %A Larrea, V %A Joubert, W %A Hernandez, O %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %X In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer of- fload features of OpenMP 4.5 are for moving codes between architectures, how much tuning might be required and what lessons we can learn from this experience. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement these kernels using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both X86 64 with attached NVIDIA GPUs, self-hosted Intel Xeon Phi KNL, as well as an X86 64 host system with Intel Xeon Phi coprocessors. In this paper, we explain what factors affected the performance portability such as how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimize and target to multiple platforms. %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD) %I Innovative Computing Laboratory, University of Tennessee %C Salt Lake City, Utah %8 2016-11 %G eng %0 Journal Article %J Numerical Algorithms %D 2016 %T Updating Incomplete Factorization Preconditioners for Model Order Reduction %A Hartwig Anzt %A Edmond Chow %A Jens Saak %A Jack Dongarra %K key publication %X When solving a sequence of related linear systems by iterative methods, it is common to reuse the preconditioner for several systems, and then to recompute the preconditioner when the matrix has changed significantly. Rather than recomputing the preconditioner from scratch, it is potentially more efficient to update the previous preconditioner. Unfortunately, it is not always known how to update a preconditioner, for example, when the preconditioner is an incomplete factorization. A recently proposed iterative algorithm for computing incomplete factorizations, however, is able to exploit an initial guess, unlike existing algorithms for incomplete factorizations. By treating a previous factorization as an initial guess to this algorithm, an incomplete factorization may thus be updated. We use a sequence of problems from model order reduction. Experimental results using an optimized GPU implementation show that updating a previous factorization can be inexpensive and effective, making solving sequences of linear systems a potential niche problem for the iterative incomplete factorization algorithm. %B Numerical Algorithms %V 73 %P 611–630 %8 2016-02 %G eng %N 3 %R 10.1007/s11075-016-0110-2 %0 Conference Paper %B 2015 IEEE International Conference on Big Data (IEEE BigData 2015) %D 2015 %T Accelerating Collaborative Filtering for Implicit Feedback Datasets using GPUs %A Mark Gates %A Hartwig Anzt %A Jakub Kurzak %A Jack Dongarra %X In this paper we accelerate the Alternating Least Squares (ALS) algorithm used for generating product recommendations on the basis of implicit feedback datasets. We approach the algorithm with concepts proven to be successful in High Performance Computing. This includes the formulation of the algorithm as a mix of cache-optimized algorithm-specific kernels and standard BLAS routines, acceleration via graphics processing units (GPUs), use of parallel batched kernels, and autotuning to identify performance winners. For benchmark datasets, the multi-threaded CPU implementation we propose achieves more than a 10 times speedup over the implementations available in the GraphLab and Spark MLlib software packages. For the GPU implementation, the parameters of an algorithm-specific kernel were optimized using a comprehensive autotuning sweep. This results in an additional 2 times speedup over our CPU implementation. %B 2015 IEEE International Conference on Big Data (IEEE BigData 2015) %I IEEE %C Santa Clara, CA %8 2015-11 %G eng %0 Conference Paper %B 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) %D 2015 %T Accelerating NWChem Coupled Cluster through dataflow-based Execution %A Heike Jagode %A Anthony Danalis %A George Bosilca %A Jack Dongarra %K CCSD %K dag %K dataflow %K NWChem %K parsec %K ptg %K tasks %X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWChem, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelised in coarse chunks. In this paper, we present our effort of converting the NWChem’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PaRSEC (Parallel Runtime Scheduling and Execution Controller) – a software package designed to enable high performance computing at scale. We discuss the modularity of our approach and explain how the PaRSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWChem codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer grained tasks (compared to the original version of NWChem); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWChem, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation. %B 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) %I Springer International Publishing %C Krakow, Poland %8 2015-09 %G eng %0 Conference Paper %B Spring Simulation Multi-Conference 2015 (SpringSim'15) %D 2015 %T Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative eigensolver the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For the key routine generating the Krylov search spaces via the product of a sparse matrix and a block of vectors, we propose a GPU kernel based on a modied sliced ELLPACK format. Blocking a set of vectors and processing them simultaneously accelerates the computation of a set of consecutive SpMVs significantly. Comparing the performance against similar routines from Intel's MKL and NVIDIA's cuSPARSE library we identify appealing performance improvements. We integrate it into the highly optimized LOBPCG implementation. Compared to the BLOBEX CPU implementation running on two eight-core Intel Xeon E5-2690s, we accelerate the computation of a small set of eigenvectors using NVIDIA's K40 GPU by typically more than an order of magnitude. %B Spring Simulation Multi-Conference 2015 (SpringSim'15) %I SCS %C Alexandria, VA %8 2015-04 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2015 %T Acceleration of GPU-based Krylov solvers via Data Transfer Reduction %A Hartwig Anzt %A William Sawyer %A Stanimire Tomov %A Piotr Luszczek %A Jack Dongarra %B International Journal of High Performance Computing Applications %G eng %0 Conference Paper %B 3rd International Workshop on Energy Efficient Supercomputing (E2SC '15) %D 2015 %T Adaptive Precision Solvers for Sparse Linear Systems %A Hartwig Anzt %A Jack Dongarra %A Enrique S. Quintana-Orti %B 3rd International Workshop on Energy Efficient Supercomputing (E2SC '15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Journal Article %J ACM Transactions on Parallel Computing %D 2015 %T Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures, and Accuracy %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Peng Du %A Jack Dongarra %E Phillip B. Gibbons %K ABFT %K algorithms %K fault-tolerance %K High Performance Computing %K linear algebra %X Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable node and the possibility of losing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead decreases inversely to the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. Applicability to tolerate multiple failures and accuracy after multiple recovery is also considered. %B ACM Transactions on Parallel Computing %V 1 %P 10:1-10:28 %8 2015-01 %G eng %N 2 %R 10.1145/2686892 %0 Conference Paper %B International Supercomputing Conference (ISC 2015) %D 2015 %T Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs %A Edmond Chow %A Hartwig Anzt %A Jack Dongarra %B International Supercomputing Conference (ISC 2015) %C Frankfurt, Germany %8 2015-07 %G eng %0 Conference Paper %B EuroMPI/Asia 2015 Workshop %D 2015 %T Batched Matrix Computations on Hardware Accelerators %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for effective approach to develop energy efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations: Cholesky, LU, and QR for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybridMAGMAfactorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient for in our applications’ context. We illustrate all these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared to a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5x speedup on the NVIDIA K40 GPU. %B EuroMPI/Asia 2015 Workshop %C Bordeaux, France %8 2015-09 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2015 %T Batched matrix computations on hardware accelerators based on GPUs %A Azzam Haidar %A Tingxing Dong %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K batched factorization %K hardware accelerators %K numerical linear algebra %K numerical software libraries %K one-sided factorization algorithms %X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU. %B International Journal of High Performance Computing Applications %8 2015-02 %G eng %R 10.1177/1094342014567546 %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %D 2015 %T Batched Matrix Computations on Hardware Accelerators Based on GPUs %A Azzam Haidar %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %X We will present techniques for small matrix computations on GPUs and their use for energy efficient, high-performance solvers. Work on small problems delivers high performance through improved data reuse. Many numerical libraries and applications need this functionality further developed. We describe the main factorizations LU, QR, and Cholesky for a set of small dense matrices in parallel. We achieve significant acceleration and reduced energy consumption against other solutions. Our techniques are of interest to GPU application developers in general. %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Conference Paper %B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015) %D 2015 %T Cholesky Across Accelerators %A Asim YarKhan %A Azzam Haidar %A Chongxiao Cao %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015) %I IEEE %C Elizabeth, NJ %8 2015-08 %G eng %0 Journal Article %J International Journal of Networking and Computing %D 2015 %T Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Yves Robert %A Jack Dongarra %K ABFT %K checkpoint %K fault-tolerance %K High-performance computing %K model %K performance evaluation %K resilience %X Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFT- aware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic program- ming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint. %B International Journal of Networking and Computing %V 5 %P 2-15 %8 2015-01 %G eng %0 Journal Article %J Scientific Programming %D 2015 %T Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %X Low-rank matrices arise in many scientific and engineering computation. Both computational and storage costs of manipulating such matrices may be reduced by taking advantages of their low-rank properties. To compute a low-rank approximation of a dense matrix, in this paper, we study the performance of QR factorization with column pivoting or with restricted pivoting on multicore CPUs with a GPU. We first propose several techniques to reduce the postprocessing time, which is required for restricted pivoting, on a modern CPU. We then examine the potential of using a GPU to accelerate the factorization process with both column and restricted pivoting. Our performance results on two eight-core Intel Sandy Bridge CPUs with one NVIDIA Kepler GPU demonstrate that using the GPU, the factorization time can be reduced by a factor of more than two. In addition, to study the performance of our implementations in practice, we integrate them into a recently-developed software StruMF which algebraically exploits such low-rank structures for solving a general sparse linear system of equations. Our performance results for solving Poisson's equations demonstrate that the proposed techniques can significantly reduce the preconditioner construction time of StruMF on the CPUs, and the construction time can be further reduced by 10%-50% using the GPU. %B Scientific Programming %G eng %0 Conference Paper %B ISC High Performance 2015 %D 2015 %T On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors %A Khairul Kabir %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %X The dramatic change in computer architecture due to the manycore paradigm shift, made the development of numerical routines that are optimal extremely challenging. In this work, we target the development of numerical algorithms and implementations for Xeon Phi coprocessor architecture designs. In particular, we examine and optimize the general and symmetric matrix-vector multiplication routines (gemv/symv), which are some of the most heavily used linear algebra kernels in many important engineering and physics applications. We describe a successful approach on how to address the challenges for this problem, starting from our algorithm design, performance analysis and programing model, to kernel optimization. Our goal, by targeting low-level, easy to understand fundamental kernels, is to develop new optimization strategies that can be effective elsewhere for the use on manycore coprocessors, and to show significant performance improvements compared to the existing state-of-the-art implementations. Therefore, in addition to the new optimization strategies, analysis, and optimal performance results, we finally present the significance of using these routines/strategies to accelerate higher-level numerical algorithms for the eigenvalue problem (EVP) and the singular value decomposition (SVD) that by themselves are foundational for many important applications. %B ISC High Performance 2015 %C Frankfurt, Germany %8 2015-07 %G eng %0 Conference Paper %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2015 %T Design for a Soft Error Resilient Dynamic Task-based Runtime %A Chongxiao Cao %A George Bosilca %A Thomas Herault %A Jack Dongarra %X As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms. %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Hyderabad, India %8 2015-05 %G eng %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %D 2015 %T Efficient Eigensolver Algorithms on Accelerator Based Architectures %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges -starting from our algorithm design, kernel optimization and tuning, to our programming model- in the development of a scalable high-performance symmetric eigenvalue and singular value solver. %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Efficient Implementation Of Quantum Materials Simulations On Distributed CPU-GPU Systems %A Raffaele Solcà %A Anton Kozhevnikov %A Azzam Haidar %A Stanimire Tomov %A Thomas C. Schulthess %A Jack Dongarra %X We present a scalable implementation of the Linearized Augmented Plane Wave method for distributed memory systems, which relies on an efficient distributed, block-cyclic setup of the Hamiltonian and overlap matrices and allows us to turn around highly accurate 1000+ atom all-electron quantum materials simulations on clusters with a few hundred nodes. The implementation runs efficiently on standard multicore CPU nodes, as well as hybrid CPU-GPU nodes. The key for the latter is a novel algorithm to solve the generalized eigenvalue problem for dense, complex Hermitian matrices on distributed hybrid CPU-GPU systems. Performance tests for Li-intercalated CoO2 supercells containing 1501 atoms demonstrate that high-accuracy, transferable quantum simulations can now be used in throughput materials search problems. While our application can benefit and get scalable performance through CPU-only libraries like ScaLAPACK or ELPA2, our new hybrid solver enables the efficient use of GPUs and shows that a hybrid CPU-GPU architecture scales to a desired performance using substantially fewer cluster nodes, and notably, is considerably more energy efficient than the traditional multicore CPU only systems for such complex applications. %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Paper %B Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15) %D 2015 %T Energy Efficiency and Performance Frontiers for Sparse Computations on GPU Supercomputers %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6x performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5x faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers. %B Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15) %I ACM %C San Francisco, CA %8 2015-02 %@ 978-1-4503-3404-4 %G eng %R 10.1145/2712386.2712387 %0 Generic %D 2015 %T Exascale Computing and Big Data %A Dan Reed %A Jack Dongarra %X Scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics. %B Communications of the ACM %I ACM %V 58 %P 56-68 %8 2015-07 %G eng %9 Magazine Article %R 10.1145/2699414 %0 Journal Article %J Concurrency in Computation: Practice and Experience %D 2015 %T Experiences in autotuning matrix multiplication for energy minimization on GPUs %A Hartwig Anzt %A Blake Haugen %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %B Concurrency in Computation: Practice and Experience %V 27 %P 5096-5113 %8 2015-12 %G eng %N 17 %R 10.1002/cpe.3516 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2015 %T Experiences in Autotuning Matrix Multiplication for Energy Minimization on GPUs %A Hartwig Anzt %A Blake Haugen %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %K Autotuning %K energy efficiency %K hardware accelerators %K matrix multiplication %K power %X In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use. %B Concurrency and Computation: Practice and Experience %V 27 %P 5096 - 5113 %8 12-Oct %G eng %U http://doi.wiley.com/10.1002/cpe.3516https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2Fcpe.3516 %N 17 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.3516 %0 Generic %D 2015 %T Fault Tolerance Techniques for High-performance Computing %A Jack Dongarra %A Thomas Herault %A Yves Robert %X This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the report by discussing techniques to cope with silent errors (or silent data corruption). %B University of Tennessee Computer Science Technical Report (also LAWN 289) %I University of Tennessee %8 2015-05 %G eng %U http://www.netlib.org/lapack/lawnspdf/lawn289.pdf %0 Conference Paper %B 17th IEEE International Conference on High Performance Computing and Communications %D 2015 %T Flexible Linear Algebra Development and Scheduling with Cholesky Factorization %A Azzam Haidar %A Asim YarKhan %A Chongxiao Cao %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore-CPUs, GPUs, and coprocessors. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. On the other hand, a Xeon Phi coprocessor will work most effectively with degree of parallelism between GPUs and multicore-CPUs. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization. %B 17th IEEE International Conference on High Performance Computing and Communications %C Newark, NJ %8 2015-08 %G eng %0 Conference Paper %B ISC High Performance %D 2015 %T Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations %A Azzam Haidar %A Tingxing Dong %A Stanimire Tomov %A Piotr Luszczek %A Jack Dongarra %B ISC High Performance %I Springer %C Frankfurt, Germany %8 2015-07 %G eng %0 Conference Paper %B 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing %D 2015 %T GPU-accelerated Co-design of Induced Dimension Reduction: Algorithmic Fusion and Kernel Overlap %A Hartwig Anzt %A Eduardo Ponce %A Gregory D. Peterson %A Jack Dongarra %X In this paper we present an optimized GPU co-design of the Induced Dimension Reduction (IDR) algorithm for solving linear systems. Starting from a baseline implementation based on the generic BLAS routines from the MAGMA software library, we apply optimizations that are based on kernel fusion and kernel overlap. Runtime experiments are used to investigate the benefit of the distinct optimization techniques for different variants of the IDR algorithm. A comparison to the reference implementation reveals that the interplay between them can succeed in cutting the overall runtime by up to about one third. %B 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Paper %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2015 %T Hierarchical DAG scheduling for Hybrid Distributed Systems %A Wei Wu %A Aurelien Bouteiller %A George Bosilca %A Mathieu Faverge %A Jack Dongarra %K dense linear algebra %K gpu %K heterogeneous architecture %K PaRSEC runtime %X Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the pro-gramming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel effi-ciency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments. %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Hyderabad, India %8 2015-05 %G eng %0 Book Section %B The Princeton Companion to Applied Mathematics %D 2015 %T High-Performance Computing %A Jack Dongarra %A Nicholas J. Higham %A Mark R. Dennis %A Paul Glendinning %A Paul A. Martin %A Fadil Santosa %A Jared Tanner %B The Princeton Companion to Applied Mathematics %I Princeton University Press %C Princeton, New Jersey %P 839-842 %@ 9781400874477 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2015 %T High-Performance Conjugate-Gradient Benchmark: A New Metric for Ranking High-Performance Computing Systems %A Jack Dongarra %A Michael A. Heroux %A Piotr Luszczek %K Additive Schwarz %K HPC Benchmarking %K Multigrid smoothing %K Preconditioned Conjugate Gradient %K Validation and Verification %X We describe a new high-performance conjugate-gradient (HPCG) benchmark. HPCG is composed of computations and data-access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and to be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement. %B The International Journal of High Performance Computing Applications %G eng %R 10.1177/1094342015593158 %0 Journal Article %J Scientific Programming %D 2015 %T HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi %A Azzam Haidar %A Jack Dongarra %A Khairul Kabir %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %A Yulu Jia %K communication and computation overlap %K dynamic runtime scheduling using dataflow dependences %K hardware accelerators and coprocessors %K Intel Xeon Phi processor %K Many Integrated Cores %K numerical linear algebra %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA. %B Scientific Programming %V 23 %8 2015-01 %G eng %N 1 %R 10.3233/SPR-140404 %0 Generic %D 2015 %T HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems %A Jack Dongarra %A Michael A. Heroux %A Piotr Luszczek %K Additive Schwarz %K HPC Benchmarking %K Multigrid smoothing %K Preconditioned Conjugate Gradient %K Validation and Verification %X We describe a new high performance conjugate gradient (HPCG) benchmark. HPCG is composed of computations and data access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2015-01 %G eng %U http://www.eecs.utk.edu/resources/library/file/1047/ut-eecs-15-736.pdf %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2015 %T Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs %A Jakub Kurzak %A Hartwig Anzt %A Mark Gates %A Jack Dongarra %B IEEE Transactions on Parallel and Distributed Systems %8 2015-11 %G eng %0 Conference Paper %B EuroPar 2015 %D 2015 %T Iterative Sparse Triangular Solves for Preconditioning %A Hartwig Anzt %A Edmond Chow %A Jack Dongarra %X Sparse triangular solvers are typically parallelized using level scheduling techniques, but parallel eciency is poor on high-throughput architectures like GPUs. We propose using an iterative approach for solving sparse triangular systems when an approximation is suitable. This approach will not work for all problems, but can be successful for sparse triangular matrices arising from incomplete factorizations, where an approximate solution is acceptable. We demonstrate the performance gains that this approach can have on GPUs in the context of solving sparse linear systems with a preconditioned Krylov subspace method. We also illustrate the effect of using asynchronous iterations. %B EuroPar 2015 %I Springer Berlin %C Vienna, Austria %8 2015-08 %G eng %U http://dx.doi.org/10.1007/978-3-662-48096-0_50 %R 10.1007/978-3-662-48096-0_50 %0 Conference Paper %B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award) %D 2015 %T MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing %A Azzam Haidar %A Stanimire Tomov %A Piotr Luszczek %A Jack Dongarra %X Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries for embedded systems challenging. In this paper, we present the design and implementation of embedded system aware algorithms, that target these challenges in the area of dense linear algebra. We consider the fundamental problems of solving linear systems of equations and least squares problems, using the LU, QR, and Cholesky factorizations, and illustrate our results, both in terms of performance and energy efficiency, on the Jetson TK1 development kit. We developed performance optimizations for both small and large problems. In contrast to the corresponding LAPACK algorithms, the new designs target the use of many-cores, readily available now even in mobile devices like the Jetson TK1, e.g., featuring 192 CUDA cores. The implementations presented will form the core of a MAGMA Embedded library, to be released as part of the MAGMA libraries. %B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award) %I IEEE %C Waltham, MA %8 2015-09 %G eng %0 Generic %D 2015 %T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi %A Hartwig Anzt %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %I ISC High Performance (ISC15), Intel Booth Presentation %C Frankfurt, Germany %8 2015-06 %G eng %0 Conference Paper %B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2015 %T Mixed-precision Block Gram Schmidt Orthogonalization %A Ichitaro Yamazaki %A Stanimire Tomov %A Jakub Kurzak %A Jack Dongarra %A Jesse Barlow %X The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a significant impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7:1 while maintaining about the same order of the numerical errors. %B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %I ACM %C Austin, TX %8 2015-11 %G eng %0 Journal Article %J SIAM Journal on Scientific Computing %D 2015 %T Mixed-Precision Cholesky QR Factorization and its Case Studies on Multicore CPU with Multiple GPUs %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %X To orthonormalize the columns of a dense matrix, the Cholesky QR (CholQR) requires only one global reduction between the parallel processing units and performs most of its computation using BLAS-3 kernels. As a result, compared to other orthogonalization algorithms, CholQR obtains superior performance on many of the current computer architectures, where the communication is becoming increasingly expensive compared to the arithmetic operations. This is especially true when the input matrix is tall-skinny. Unfortunately, the orthogonality error of CholQR depends quadratically on the condition number of the input matrix, and it is numerically unstable when the matrix is ill-conditioned. To enhance the stability of CholQR, we recently used mixed-precision arithmetic; the input and output matrices are in the working precision, but some of its intermediate results are accumulated in the doubled precision. In this paper, we analyze the numerical properties of this mixed-precision CholQR. Our analysis shows that by selectively using the doubled precision, the orthogonality error of the mixed-precision CholQR only depends linearly on the condition number of the input matrix. We provide numerical results to demonstrate the improved numerical stability of the mixed-precision CholQR in practice. We then study its performance. When the target hardware does not support the desired higher precision, software emulation is needed. For example, using software-emulated double-double precision for the working 64-bit double precision, the mixed-precision CholQR requires about 8.5x more floating-point instructions than that required by the standard CholQR. On the other hand, the increase in the communication cost using the double-double precision is less significant, and our performance results on multicore CPU with a different graphics processing unit (GPU) demonstrate that the overhead of using the double-double arithmetic is decreasing on a newer architecture, where the computation is becoming less expensive compared to the communication. As a result, with a latest NVIDIA GPU, the mixed-precision CholQR was only 1.4x slower than the standard CholQR. Finally, we present case studies of using the mixed-precision CholQR within communication-avoiding variants of Krylov subspace projection methods for solving a nonsymmetric linear system of equations and for solving a symmetric eigenvalue problem, on a multicore CPU with multiple GPUs. These case studies demonstrate that by using the higher precision for this small but critical segment of the Krylov methods, we can improve not only the overall numerical stability of the solvers but also, in some cases, their performance. %B SIAM Journal on Scientific Computing %V 37 %P C203-C330 %8 2015-05 %G eng %R DOI:10.1137/14M0973773 %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra %D 2015 %T Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs %A Ichitaro Yamazaki %A Jesse Barlow %A Stanimire Tomov %A Jakub Kurzak %A Jack Dongarra %X Orthogonalizing a set of dense vectors is an important computational kernel in subspace projection methods for solving large-scale problems. In this talk, we discuss our efforts to improve the performance of the kernel, while maintaining its numerical accuracy. Our experimental results demonstrate the effectiveness of our approaches. %B 2015 SIAM Conference on Applied Linear Algebra %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2015 %T Mixing LU-QR Factorization Algorithms to Design High-Performance Dense Linear Algebra Solvers %A Mathieu Faverge %A Julien Herrmann %A Julien Langou %A Bradley Lowery %A Yves Robert %A Jack Dongarra %K lu factorization %K Numerical algorithms %K QR factorization %K Stability; Performance %X This paper introduces hybrid LU–QR algorithms for solving dense linear systems of the form Ax=b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floating-point operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. The choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. A comprehensive set of experiments shows that hybrid LU–QR algorithms provide a continuous range of trade-offs between stability and performances. %B Journal of Parallel and Distributed Computing %V 85 %P 32-46 %8 2015-11 %G eng %R doi:10.1016/j.jpdc.2015.06.007 %0 Conference Paper %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) %D 2015 %T Optimization for Performance and Energy for Batched Matrix Computations on GPUs %A Azzam Haidar %A Tingxing Dong %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K batched factorization %K hardware accelerators %K numerical linear algebra %K numerical software libraries %K one-sided factorization algorithms %X As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU’s significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2.5 speedup on the K40 GPU. %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) %I ACM %C San Francisco, CA %8 2015-02 %G eng %R 10.1145/2716282.2716288 %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2015 %T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems %A Maksims Abalenkovs %A Ahmad Abdelfattah %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %A Asim YarKhan %K dense linear algebra %K gpu %K HPC %K Multicore %K Programming models %K runtime %X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. %B Supercomputing Frontiers and Innovations %V 2 %8 2015-10 %G eng %R 10.14529/jsfi1504 %0 Conference Paper %B 2015 IEEE International Conference on Cluster Computing %D 2015 %T PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution %A Anthony Danalis %A Heike Jagode %A George Bosilca %A Jack Dongarra %K dag %K parsec %K ptg %K tasks %X Task-based execution has been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. The Parallel Runtime Scheduling and Execution Control (PARSEC) framework is a task-based runtime system that we designed to achieve high performance computing at scale. PARSEC offers a programming paradigm that is different than what has been traditionally used to develop large scale parallel scientific applications. In this paper, we discuss the use of PARSEC to convert a part of the Coupled Cluster (CC) component of the Quantum Chemistry package NWCHEM into a task-based form. We explain how we organized the computation of the CC methods in individual tasks with explicitly defined data dependencies between them and re-integrated the modified code into NWCHEM. We present a thorough performance evaluation and demonstrate that the modified code outperforms the original by more than a factor of two. We also compare the performance of different variants of the modified code and explain the different behaviors that lead to the differences in performance. %B 2015 IEEE International Conference on Cluster Computing %I IEEE %C Chicago, IL %8 2015-09 %G eng %0 Conference Paper %B The Spring Simulation Multi-Conference 2015 (SpringSim'15), Best Paper Award %D 2015 %T Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures %A Khairul Kabir %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Eigenvalues problem %K Hessenberg reduction %K Multi/Many-core %K Stabilized Elementary Transformations %X The solution of nonsymmetric eigenvalue problems, Ax = λx, can be accelerated substantially by first reducing A to an upper Hessenberg matrix H that has the same eigenvalues as A. This can be done using Householder orthogonal transformations, which is a well established standard, or stabilized elementary transformations. The latter approach, although having half the flops of the former, has been used less in practice, e.g., on computer architectures with well developed hierarchical memories, because of its memory-bound operations and the complexity in stabilizing it. In this paper we revisit the stabilized elementary transformations approach in the context of new architectures – both multicore CPUs and Xeon Phi coprocessors. We derive for a first time a blocking version of the algorithm. The blocked version reduces the memory-bound operations and we analyze its performance. A performance model is developed that shows the limitations of both approaches. The competitiveness of using stabilized elementary transformations has been quantified, highlighting that it can be 20 to 30% faster on current high-end multicore CPUs and Xeon Phi coprocessors. %B The Spring Simulation Multi-Conference 2015 (SpringSim'15), Best Paper Award %C Alexandria, VA %8 2015-04 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS 2015) %D 2015 %T Performance Analysis and Optimization of Two-Sided Factorization Algorithms for Heterogeneous Platform %A Khairul Kabir %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %B International Conference on Computational Science (ICCS 2015) %C Reykjavík, Iceland %8 2015-06 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs %A Theo Mary %A Ichitaro Yamazaki %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Paper %B 22nd European MPI Users' Group Meeting %D 2015 %T Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %X Advanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods. %B 22nd European MPI Users' Group Meeting %I ACM %C Bordeaux, France %8 2015-09 %G eng %R 10.1145/2802658.2802668 %0 Generic %D 2015 %T Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems: Formal Proof %A Thomas Herault %A Aurelien Bouteiller %A George Bosilca %A Marc Gamell %A Keita Teranishi %A Manish Parashar %A Jack Dongarra %B Innovative Computing Laboratory Technical Report %8 2015-04 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems %A Thomas Herault %A Aurelien Bouteiller %A George Bosilca %A Marc Gamell %A Keita Teranishi %A Manish Parashar %A Jack Dongarra %X The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications. %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Randomized Algorithms to Update Partial Singular Value Decomposition on a Hybrid CPU/GPU Cluster %A Ichitaro Yamazaki %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %D 2015 %T Random-Order Alternating Schwarz for Sparse Triangular Solves %A Hartwig Anzt %A Edmond Chow %A Daniel Szyld %A Jack Dongarra %X Block-asynchronous Jacobi is an iteration method where a locally synchronous iteration is embedded in an asynchronous global iteration. The unknowns are partitioned into small subsets, and while the components within the same subset are iterated in Jacobi fashion, no update order in-between the subsets is enforced. The values of the non-local entries remain constant during the local iterations, which can result in slow inter-subset information propagation and slow convergence. Interpreting of the subsets as subdomains allows to transfer the concept of domain overlap typically enhancing the information propagation to block-asynchronous solvers. In this talk we explore the impact of overlapping domains to convergence and performance of block-asynchronous Jacobi iterations, and present results obtained by running this solver class on state-of-the-art HPC systems. %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2015 %T A Scalable Approach to Solving Dense Linear Algebra Problems on Hybrid CPU-GPU Systems %A Fengguang Song %A Jack Dongarra %K dense linear algebra %K distributed dataﬂow scheduling %K heterogeneous HPC systems %K runtime systems %X Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs. %B Concurrency and Computation: Practice and Experience %V 27 %P 3702-3723 %8 2015-09 %G eng %N 14 %R 10.1002/cpe.3403 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2015 %T A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination %A Simplice Donfack %A Jack Dongarra %A Mathieu Faverge %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Ichitaro Yamazaki %K Gaussian elimination %K lu factorization %K Multicore %K parallel %K shared memory %X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed. %B Concurrency and Computation: Practice and Experience %V 27 %P 1292-1309 %8 2015-04 %G eng %N 5 %R 10.1002/cpe.3306 %0 Journal Article %J IEEE Computer %D 2015 %T The TOP500 List and Progress in High-Performance Computing %A Erich Strohmaier %A Hans Meuer %A Jack Dongarra %A Horst D. Simon %K application performance %K Benchmark testing %K benchmarks %K Computer architecture %K High Performance Computing %K High-performance computing %K Linpack %K Market research %K Parallel computing %K Program processors %K Scientific computing %K Supercomputers %K top500 %X For more than two decades, the TOP500 list has enjoyed incredible success as a metric for supercomputing performance and as a source of data for identifying technological trends. The project's editors reflect on its usefulness and limitations for guiding large-scale scientific computing into the exascale era. %B IEEE Computer %V 48 %P 42-49 %8 2015-11 %G eng %N 11 %R doi:10.1109/MC.2015.338 %0 Generic %D 2015 %T Towards a High-Performance Tensor Algebra Package for Accelerators %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %I moky Mountains Computational Sciences and Engineering Conference (SMC15) %C Gatlinburg, TN %8 2015-09 %G eng %0 Conference Paper %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015 %D 2015 %T Towards Batched Linear Solvers on Accelerated Hardware Platforms %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K batched factorization %K hardware accelerators %K numerical linear algebra %K numerical software libraries %K one-sided factorization algorithms %X As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU’s symmetric multiprocessors factorizes a single problem at a time.We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA’s CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU. %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015 %I ACM %C San Francisco, CA %8 2015-02 %G eng %0 Conference Paper %B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15) %D 2015 %T Tuning Stationary Iterative Solvers for Fault Resilience %A Hartwig Anzt %A Jack Dongarra %A Enrique S. Quintana-Orti %X As the transistor’s feature size decreases following Moore’s Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community. In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation. %B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Paper %B 2nd Workshop on Visual Performance Analysis (VPA '15) %D 2015 %T Visualizing Execution Traces with Task Dependencies %A Blake Haugen %A Stephen Richmond %A Jakub Kurzak %A Chad A. Steed %A Jack Dongarra %X Task-based scheduling has emerged as one method to reduce the complexity of parallel computing. When using task-based schedulers, developers must frame their computation as a series of tasks with various data dependencies. The scheduler can take these tasks, along with their input and output dependencies, and schedule the task in parallel across a node or cluster. While these schedulers simplify the process of parallel software development, they can obfuscate the performance characteristics of the execution of an algorithm. The execution trace has been used for many years to give developers a visual representation of how their computations are performed. These methods can be employed to visualize when and where each of the tasks in a task-based algorithm is scheduled. In addition, the task dependencies can be used to create a directed acyclic graph (DAG) that can also be visualized to demonstrate the dependencies of the various tasks that make up a workload. The work presented here aims to combine these two data sets and extend execution trace visualization to better suit task-based workloads. This paper presents a brief description of task-based schedulers and the performance data they produce. It will then describe an interactive extension to the current trace visualization methods that combines the trace and DAG data sets. This new tool allows users to gain a greater understanding of how their tasks are scheduled. It also provides a simplified way for developers to evaluate and debug the performance of their scheduler. %B 2nd Workshop on Visual Performance Analysis (VPA '15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Proceedings %B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15) %D 2015 %T Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators %A Azzam Haidar %A Yulu Jia %A Piotr Luszczek %A Stanimire Tomov %A Asim YarKhan %A Jack Dongarra %K dataflow scheduling %K hardware accelerators %K multi-grain parallelism %X A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications. %B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15) %I ACM %C Austin, TX %V No. 5 %8 2015-11 %G eng %0 Conference Paper %B VECPAR 2014 %D 2014 %T Accelerating Eigenvector Computation in the Nonsymmetric Eigenvalue Problem %A Mark Gates %A Azzam Haidar %A Jack Dongarra %X In the nonsymmetric eigenvalue problem, work has focused on the Hessenberg reduction and QR iteration, using efficient algorithms and fast, Level 3 BLAS routines. Comparatively, computation of eigenvectors performs poorly, limited to slow, Level 2 BLAS performance with little speedup on multi-core systems. It has thus become a dominant cost in the eigenvalue problem. To address this, we present improvements for the eigenvector computation to use Level 3 BLAS where applicable and parallelize the remaining triangular solves, achieving good parallel scaling and accelerating the overall eigenvalue problem more than three-fold. %B VECPAR 2014 %C Eugene, OR %8 2014-06 %G eng %0 Book Section %B Numerical Computations with GPUs %D 2014 %T Accelerating Numerical Dense Linear Algebra Calculations with GPUs %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %B Numerical Computations with GPUs %I Springer International Publishing %P 3-28 %@ 978-3-319-06547-2 %G eng %& 1 %R 10.1007/978-3-319-06548-9_1 %0 Generic %D 2014 %T Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iterative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU data structures and kernels to the higher-level algorithmic choices and overall heterogeneous design. Most notably, the eigensolver leverages the high-performance of a new GPU kernel developed for the simultaneous multiplication of a sparse matrix and a set of vectors (SpMM). This is a building block that serves as a backbone for not only block-Krylov, but also for other methods relying on blocking for acceleration in general. The heterogeneous LOBPCG developed here reveals the potential of this type of eigensolver by highly optimizing all of its components, and can be viewed as a benchmark for other SpMM-dependent applications. Compared to non-blocked algorithms, we show that the performance speedup factor of SpMM vs. SpMV-based algorithms is up to six on GPUs like NVIDIA’s K40. In particular, a typical SpMV performance range in double precision is 20 to 25 GFlop/s, while the SpMM is in the range of 100 to 120 GFlop/s. Compared to highly-optimized CPU implementations, e.g., the SpMM from MKL on two eight-core Intel Xeon E5-2690s, our kernel is 3 to 5x. faster on a K40 GPU. For comparison to other computational loads, the same GPU to CPU performance acceleration is observed for the SpMV product, as well as dense linear algebra, e.g., matrix-matrix multiplication and factorizations like LU, QR, and Cholesky. Thus, the modeled GPU (vs. CPU) acceleration for the entire solver is also 3 to 5x. In practice though, currently available CPU implementations are much slower due to missed optimization opportunities, as we show. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2014-10 %G eng %0 Conference Paper %B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining %D 2014 %T Access-averse Framework for Computing Low-rank Matrix Approximations %A Ichitaro Yamazaki %A Theo Mary %A Jakub Kurzak %A Stanimire Tomov %A Jack Dongarra %B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining %C Washington, DC %8 2014-10 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2014 %T Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Piotr Luszczek %K factorization %K parallel linear algebra %K recursion %K shared memory synchronization %K threaded parallelism %X The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. %B Concurrency and Computation: Practice and Experience %V 26 %P 1408-1431 %8 2014-05 %G eng %U http://doi.wiley.com/10.1002/cpe.3110 %N 7 %! Concurrency Computat.: Pract. Exper. %& 1408 %R 10.1002/cpe.3110 %0 Conference Paper %B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014 %D 2014 %T Assessing the Impact of ABFT and Checkpoint Composite Strategies %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Yves Robert %A Jack Dongarra %K ABFT %K checkpoint %K fault-tolerance %K High-performance computing %K resilience %X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFTunaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed. %B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B International Workshop on OpenCL %D 2014 %T clMAGMA: High Performance Dense Linear Algebra with OpenCL %A Chongxiao Cao %A Jack Dongarra %A Peng Du %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments presented, and in general provides to heterogeneous architectures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance OpenCL BLAS, hardware and OpenCL-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. %B International Workshop on OpenCL %C Bristol University, England %8 2014-05 %G eng %0 Journal Article %J SIAM Journal on Matrix Analysis and Application %D 2014 %T Communication-Avoiding Symmetric-Indefinite Factorization %A Grey Ballard %A Dulceneia Becker %A James Demmel %A Jack Dongarra %A Alex Druinsky %A I Peled %A Oded Schwartz %A Sivan Toledo %A Ichitaro Yamazaki %X We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen’s triangular tridiagonalization. It factors a dense symmetric matrix A as the product A = P LT L T P T where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. The current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal. %B SIAM Journal on Matrix Analysis and Application %V 35 %P 1364-1406 %8 2014-07 %G eng %N 4 %0 Conference Paper %B International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS) %D 2014 %T Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems %A Marc Baboulin %A Jack Dongarra %A Remi Lacroix %X This paper presents an efficient computation for least squares conditioning or estimates of it. We propose performance results using new routines on top of the multicore-GPU library MAGMA. This set of routines is based on an efficient computation of the variance-covariance matrix for which, to our knowledge, there is no implementation in current public domain libraries LAPACK and ScaLAPACK. %B International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS) %C Waterloo, Ontario, CA %8 2014-08 %G eng %0 Conference Paper %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2014 %T Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %C New Orleans, LA %8 2014-11 %G eng %0 Conference Paper %B Workshop on Large-Scale Parallel Processing, IPDPS 2014 %D 2014 %T Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime %A Ichitaro Yamazaki %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %K dataflow %K message-passing %K multithreading %K QR decomposition %K runtime %K systolic array %X A systolic array provides an alternative computing paradigm to the von Neuman architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems. %B Workshop on Large-Scale Parallel Processing, IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Generic %D 2014 %T Design for a Soft Error Resilient Dynamic Task-based Runtime %A Chongxiao Cao %A Thomas Herault %A George Bosilca %A Jack Dongarra %X Abstract—As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms. %B ICL Technical Report %I University of Tennessee %8 2014-11 %G eng %0 Conference Paper %B IPDPS 2014 %D 2014 %T Designing LU-QR Hybrid Solvers for Performance and Stability %A Mathieu Faverge %A Julien Herrmann %A Julien Langou %A Bradley Lowery %A Yves Robert %A Jack Dongarra %X This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. Ideally, the choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the Parsec software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %@ 978-1-4799-3800-1 %G eng %R 10.1109/IPDPS.2014.108 %0 Conference Paper %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %D 2014 %T Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs %A Simplice Donfack %A Stanimire Tomov %A Jack Dongarra %X Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer’s characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on a 24 cores AMD Opteron 6172 show that by adding one GPU (Tesla S2050) we accelerate LU up to 2.4x compared to the corresponding routine in MKL using 24 cores. The comparisons with MAGMA also show significant improvements. %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %8 2014-05 %G eng %0 Journal Article %J Parallel Computing %D 2014 %T An Efficient Distributed Randomized Algorithm for Solving Large Dense Symmetric Indefinite Linear Systems %A Marc Baboulin %A Du Becker %A George Bosilca %A Anthony Danalis %A Jack Dongarra %K Distributed linear algebra solvers %K LDLT factorization %K PaRSEC runtime %K Randomized algorithms %K Symmetric indefinite systems %X Randomized algorithms are gaining ground in high-performance computing applications as they have the potential to outperform deterministic methods, while still providing accurate results. We propose a randomized solver for distributed multicore architectures to efficiently solve large dense symmetric indefinite linear systems that are encountered, for instance, in parameter estimation problems or electromagnetism simulations. The contribution of this paper is to propose efficient kernels for applying random butterfly transformations and a new distributed implementation combined with a runtime (PaRSEC) that automatically adjusts data structures, data mappings, and the scheduling as systems scale up. Both the parallel distributed solver and the supporting runtime environment are innovative. To our knowledge, the randomization approach associated with this solver has never been used in public domain software for symmetric indefinite systems. The underlying runtime framework allows seamless data mapping and task scheduling, mapping its capabilities to the underlying hardware features of heterogeneous distributed architectures. The performance of our software is similar to that obtained for symmetric positive definite systems, but requires only half the execution time and half the amount of data storage of a general dense solver. %B Parallel Computing %V 40 %P 213-223 %8 2014-07 %G eng %N 7 %R 10.1016/j.parco.2013.12.003 %0 Conference Paper %B International Conference on Parallel Processing (ICPP-2014) %D 2014 %T A Fast Batched Cholesky Factorization on a GPU %A Tingxing Dong %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %X Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms – nonblocked, blocked, and recursive blocked – were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1:8 speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMA by 1.5 in performance-per-watt for large matrices. %B International Conference on Parallel Processing (ICPP-2014) %C Minneapolis, MN %8 2014-09 %G eng %0 Conference Paper %B VECPAR 2014 %D 2014 %T Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K Computer science %K factorization %K Heterogeneous systems %K Intel Xeon Phi %K linear algebra %X We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs and coprocessors using a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components. %B VECPAR 2014 %C Eugene, OR %8 2014-06 %G eng %0 Conference Paper %B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014 %D 2014 %T Hybrid Multi-Elimination ILU Preconditioners on GPUs %A Dimitar Lukarski %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X Abstract—Iterative solvers for sparse linear systems often benefit from using preconditioners. While there are implementations for many iterative methods that leverage the computing power of accelerators, porting the latest developments in preconditioners to accelerators has been challenging. In this paper we develop a selfadaptive multi-elimination preconditioner for graphics processing units (GPUs). The preconditioner is based on a multi-level incomplete LU factorization and uses a direct dense solver for the bottom-level system. For test matrices from the University of Florida matrix collection, we investigate the influence of handling the triangular solvers in the distinct iteration steps in either single or double precision arithmetic. Integrated into a Conjugate Gradient method, we show that our multi-elimination algorithm is highly competitive against popular preconditioners, including multi-colored symmetric Gauss-Seidel relaxation preconditioners, and (multi-colored symmetric) ILU for numerous problems. %B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Generic %D 2014 %T Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X Numerical methods in sparse linear algebra typically rely on a fast and efficient matrix vector product, as this usually is the backbone of iterative algorithms for solving eigenvalue problems or linear systems. Against the background of a large diversity in the characteristics of high performance computer architectures, it is a challenge to derive a cross-platform efficient storage format along with fast matrix vector kernels. Recently, attention focused on the SELL-C- format, a sliced ELLPACK format enhanced by row-sorting to reduce the fill in when padding rows with zeros. In this paper we propose an additional modification resulting in the padded sliced ELLPACK (SELLP) format, for which we develop a sparse matrix vector CUDA kernel that is able to efficiently exploit the computing power of NVIDIA GPUs. We show that the kernel we developed outperforms straight-forward implementations for the widespread CSR and ELLPACK formats, and is highly competitive to the implementations in the highly optimized CUSPARSE library. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2014-04 %G eng %0 Conference Paper %B IPDPS 2014 %D 2014 %T Improving the performance of CA-GMRES on multicores with multiple GPUs %A Ichitaro Yamazaki %A Hartwig Anzt %A Stanimire Tomov %A Mark Hoemmen %A Jack Dongarra %X Abstract—The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2014 %T Looking Back at Dense Linear Algebra Software %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %K decompositional approach %K dense linear algebra %K parallel algorithms %X Over the years, computational physics and chemistry served as an ongoing source of problems that demanded the ever increasing performance from hardware as well as the software that ran on top of it. Most of these problems could be translated into solutions for systems of linear equations: the very topic of numerical linear algebra. Seemingly then, a set of efficient linear solvers could be solving important scientific problems for years to come. We argue that dramatic changes in hardware designs precipitated by the shifting nature of the marketplace of computer hardware had a continuous effect on the software for numerical linear algebra. The extraction of high percentages of peak performance continues to require adaptation of software. If the past history of this adaptive nature of linear algebra software is any guide then the future theme will feature changes as well–changes aimed at harnessing the incredible advances of the evolving hardware infrastructure. %B Journal of Parallel and Distributed Computing %V 74 %P 2548–2560 %8 2014-07 %G eng %N 7 %& 2548 %R 10.1016/j.jpdc.2013.10.005 %0 Conference Paper %B 16th IEEE International Conference on High Performance Computing and Communications (HPCC) %D 2014 %T LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU %A Tingxing Dong %A Azzam Haidar %A Piotr Luszczek %A James Harris %A Stanimire Tomov %A Jack Dongarra %X Gaussian Elimination is commonly used to solve dense linear systems in scientific models. In a large number of applications, a need arises to solve many small size problems, instead of few large linear systems. The size of each of these small linear systems depends, for example, on the number of the ordinary differential equations (ODEs) used in the model, and can be on the order of hundreds of unknowns. To efficiently exploit the computing power of modern accelerator hardware, these linear systems are processed in batches. To improve the numerical stability of the Gaussian Elimination, at least partial pivoting is required, most often accomplished with row pivoting. However, row pivoting can result in a severe performance penalty on GPUs because it brings in thread divergence and non-coalesced memory accesses. The state-of-the-art libraries for linear algebra that target GPUs, such as MAGMA, focus on large matrix sizes. They change the data layout by transposing the matrix to avoid these divergence and non-coalescing penalties. However, the data movement associated with transposition is very expensive for small matrices. In this paper, we propose a batched LU factorization for GPUs by using a multi-level blocked right looking algorithm that preserves the data layout but minimizes the penalty of partial pivoting. Our batched LU achieves up to 2:5-fold speedup when compared to the alternative CUBLAS solutions on a K40c GPU and 3:6-fold speedup over MKL on a node of the Titan supercomputer at ORNL in a nuclear reaction network simulation. %B 16th IEEE International Conference on High Performance Computing and Communications (HPCC) %I IEEE %C Paris, France %8 2014-08 %G eng %0 Conference Paper %B IPASS-2014 %D 2014 %T MIAMI: A Framework for Application Performance Diagnosis %A Gabriel Marin %A Jack Dongarra %A Dan Terpstra %X A typical application tuning cycle repeats the following three steps in a loop: performance measurement, analysis of results, and code refactoring. While performance measurement is well covered by existing tools, analysis of results to understand the main sources of inefficiency and to identify opportunities for optimization is generally left to the user. Today's state of the art performance analysis tools use instrumentation or hardware counter sampling to measure the performance of interactions between code and the target architecture during execution. Such measurements are useful to identify hotspots in applications, places where execution time is spent or where cache misses are incurred. However, explanatory understanding of tuning opportunities requires a more detailed, mechanistic modeling approach. This paper presents MIAMI (Machine Independent Application Models for performance Insight), a set of tools for automatic performance diagnosis. MIAMI uses application characterization and models of target architectures to reason about an application's performance. MIAMI uses a modeling approach based on first-order principles to identify performance bottlenecks, pinpoint optimization opportunities, and compute bounds on the potential for improvement. %B IPASS-2014 %I IEEE %C Monterey, CA %8 2014-03 %@ 978-1-4799-3604-5 %G eng %R 10.1109/ISPASS.2014.6844480 %0 Conference Paper %B VECPAR 2014 (Best Paper) %D 2014 %T Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs %A Ichitaro Yamazaki %A Stanimire Tomov %A Tingxing Dong %A Jack Dongarra %X We propose a mixed-precision orthogonalization scheme that takes the input matrix in a standard 32 or 64-bit floating-point precision, but uses higher-precision arithmetics to accumulate its intermediate results. For the 64-bit precision, our scheme uses software emulation for the higher-precision arithmetics, and requires about 20x more computation but about the same amount of communication as the standard orthogonalization scheme. Since the computation is becoming less expensive compared to the communication on new and emerging architectures, the relative cost of our mixed-precision scheme is decreasing. Our case studies with CA-GMRES on a GPU demonstrate that using mixed-precision for this small but critical segment of CA-GMRES can improve not only its overall numerical stability but also, in some cases, its performance. %B VECPAR 2014 (Best Paper) %C Eugene, OR %8 2014-06 %G eng %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2014 %T Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems %A Jack Dongarra %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Asim YarKhan %K dense linear algebra %K hardware accelerators %K task superscalar scheduling %X Hardware heterogeneity of the HPC platforms is no longer considered unusual but instead have become the most viable way forward towards Exascale. In fact, the multitude of the heterogeneous resources available to modern computers are designed for different workloads and their efficient use is closely aligned with the specialized role envisaged by their design. Commonly in order to efficiently use such GPU resources, the workload in question must have a much greater degree of parallelism than workloads often associated with multicore processors (CPUs). Available GPU variants differ in their internal architecture and, as a result, are capable of handling workloads of varying degrees of complexity and a range of computational patterns. This vast array of applicable workloads will likely lead to an ever accelerated mixing of multicore-CPUs and GPUs in multi-user environments with the ultimate goal of offering adequate computing facilities for a wide range of scientific and technical workloads. In the following paper, we present a research prototype that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems. Our lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. This concept is reminiscent of dataflow and systolic architectures in its conceptualization of a workload as a set of side-effect-free tasks that pass data items whenever the associated work assignment have been completed. Additionally, our task abstractions and their parametrization enable uniformity in the algorithmic development across all the heterogeneous resources without sacrificing precious compute cycles. We include performance results for dense linear algebra functions which demonstrate the practicality and effectiveness of our approach that is aptly capable of full utilization of a wide range of accelerator hardware. %B Supercomputing Frontiers and Innovations %V 1 %G eng %N 1 %R http://dx.doi.org/10.14529/jsfi1401 %0 Conference Paper %B Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper) %D 2014 %T New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem %A Azzam Haidar %A Piotr Luszczek %A Jack Dongarra %X We describe a design and implementation of a multi-stage algorithm for computing eigenvectors of a dense symmetric matrix. We show that reformulating the existing algorithms is beneficial in terms of performance even if that doubles the computational complexity. Through detailed analysis, we show that the effect of the increase in the asymptotic operation count may be compensated by a much improved performance rate. Our performance results indicate that using our approach achieves very good speedup and scalability even when directly compared with the existing state-of-the-art software. %B Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper) %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %R 10.1109/IPDPSW.2014.130 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2014 %T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks %A Azzam Haidar %A Raffaele Solcà %A Mark Gates %A Stanimire Tomov %A Thomas C. Schulthess %A Jack Dongarra %K Eigensolver %K electronic structure calculations %K generalized eigensolver %K gpu %K high performance %K hybrid %K Multicore %K two-stage %X The adoption of hybrid CPU–GPU nodes in traditional supercomputing platforms such as the Cray-XK6 opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium-sized generalized eigenvalue problems must be solved many times. These eigenvalue problems are too small to effectively solve on distributed systems, but can benefit from the massive computing power concentrated on a single-node, hybrid CPU–GPU system. However, hybrid systems call for the development of new algorithms that efficiently exploit heterogeneity and massive parallelism of not just GPUs, but of multicore/manycore CPUs as well. Addressing these demands, we developed a generalized eigensolver featuring novel algorithms of increased computational intensity (compared with the standard algorithms), decomposition of the computation into fine-grained memory aware tasks, and their hybrid execution. The resulting eigensolvers are state-of-the-art in high-performance computing, significantly outperforming existing libraries. We describe the algorithm and analyze its performance impact on applications of interest when different fractions of eigenvectors are needed by the host electronic structure code. %B International Journal of High Performance Computing Applications %V 28 %P 196-209 %8 2014-05 %G eng %N 2 %& 196 %R 10.1177/1094342013502097 %0 Conference Paper %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %D 2014 %T Optimizing Krylov Subspace Solvers on Graphics Processing Units %A Stanimire Tomov %A Piotr Luszczek %A Ichitaro Yamazaki %A Jack Dongarra %A Hartwig Anzt %A William Sawyer %X Krylov subspace solvers are often the method of choice when solving sparse linear systems iteratively. At the same time, hardware accelerators such as graphics processing units (GPUs) continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to leverage the full potential of the accelerator. In this paper we target the acceleration of the BiCGSTAB solver for GPUs, showing that significant improvement can be achieved by reformulating the method and developing application-specific kernels instead of using the generic CUBLAS library provided by NVIDIA. We propose an implementation that benefits from a significantly reduced number of kernel launches and GPUhost communication events, by means of increased data locality and a simultaneous reduction of multiple scalar products. Using experimental data, we show that, depending on the dominance of the untouched sparse matrix vector products, significant performance improvements can be achieved compared to a reference implementation based on the CUBLAS library. We feel that such optimizations are crucial for the subsequent development of highlevel sparse linear algebra libraries. %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14) %D 2014 %T Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors %A Azzam Haidar %A Chongxiao Cao %A Ichitaro Yamazaki %A Jack Dongarra %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly. %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14) %I IEEE %C New Orleans, LA %8 2014-11 %G eng %R 10.1109/ScalA.2014.8 %0 Journal Article %J International Journal of Networking and Computing %D 2014 %T Performance and Reliability Trade-offs for the Double Checkpointing Algorithm %A Jack Dongarra %A Thomas Herault %A Yves Robert %K communication contention %K in-memory checkpoint %K performance %K resilience %K risk %X Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach based upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé [23], with the non-blocking algorithm of Ni, Meneses and Kalé [15] in terms of both performance and risk. We also extend the model proposedcan provide a better efficiency in [23, 15] to assess the impact of the overhead associated to non-blocking communications. In addition, we deal with arbitrary failure distributions (as opposed to uniform distributions in [23]). We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work without additional memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations. %B International Journal of Networking and Computing %V 4 %P 32-41 %8 2014 %G eng %& 32 %0 Generic %D 2014 %T Performance of Various Computers Using Standard Linear Equations Software, (Linpack Benchmark Report) %A Jack Dongarra %X This report compares the performance of different computer systems in solving dense systems of linear equations. The comparison involves approximately a hundred computers, ranging from the Earth Simulator to personal computers. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2014-06 %G eng %0 Conference Paper %B 2014 IEEE International Conference on Cluster Computing %D 2014 %T Power Monitoring with PAPI for Extreme Scale Architectures and Dataflow-based Programming Models %A Heike McCraw %A James Ralph %A Anthony Danalis %A Jack Dongarra %X For more than a decade, the PAPI performance-monitoring library has provided a clear, portable interface to the hardware performance counters available on all modern CPUs and other components of interest (e.g., GPUs, network, and I/O systems). Most major end-user tools that application developers use to analyze the performance of their applications rely on PAPI to gain access to these performance counters. One of the critical road-blockers on the way to larger, more complex high performance systems, has been widely identified as being the energy efficiency constraints. With modern extreme scale machines having hundreds of thousands of cores, the ability to reduce power consumption for each CPU at the software level becomes critically important, both for economic and environmental reasons. In order for PAPI to continue playing its well established role in HPC, it is pressing to provide valuable performance data that not only originates from within the processing cores but also delivers insight into the power consumption of the system as a whole. An extensive effort has been made to extend the Performance API to support power monitoring capabilities for various platforms. This paper provides detailed information about three components that allow power monitoring on the Intel Xeon Phi and Blue Gene/Q. Furthermore, we discuss the integration of PAPI in PARSEC – a taskbased dataflow-driven execution engine – enabling hardware performance counter and power monitoring at true task granularity. %B 2014 IEEE International Conference on Cluster Computing %I IEEE %C Madrid, Spain %8 2014-09 %G eng %R 10.1109/CLUSTER.2014.6968672 %0 Conference Paper %B International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) %D 2014 %T PTG: An Abstraction for Unhindered Parallelism %A Anthony Danalis %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Jack Dongarra %XIncreased parallelism and use of heterogeneous computing resources is now an established trend in High Performance Computing (HPC), a trend that, looking forward to Exascale, seems bound to intensify. Despite the evolution of hardware over the past decade, the programming paradigm of choice was invariably derived from Coarse Grain Parallelism with explicit data movements. We argue that message passing has remained the de facto standard in HPC because, until now, the ever increasing challenges that application developers had to address to create efficient portable applications remained manageable for expert programmers.

Data-flow based programming is an alternative approach with significant potential. In this paper, we discuss the Parameterized Task Graph (PTG) abstraction and present the specialized input language that we use to specify PTGs in our data-flow task-based runtime system, PaRSEC. This language and the corresponding execution model are in contrast with the execution model of explicit message passing as well as the model of alternative task based runtime systems. The Parameterized Task Graph language decouples the expression of the parallelism in the algorithm from the control-flow ordering, load balance, and data distribution. Thus, programs are more adaptable and map more efficiently on challenging hardware, as well as maintain portability across diverse architectures. To support these claims, we discuss the different challenges of HPC programming and how PaRSEC can address them, and we demonstrate that in today’s large scale supercomputers, PaRSEC can significantly outperform state-of-the-art MPI applications and libraries, a trend that will increase with future architectural evolution.

%B International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) %I IEEE Press %C New Orleans, LA %8 2014-11 %G eng %0 Generic %D 2014 %T PULSAR Users’ Guide, Parallel Ultra-Light Systolic Array Runtime %A Jack Dongarra %A Jakub Kurzak %A Piotr Luszczek %A Ichitaro Yamazaki %X PULSAR version 2.0, released in November 2014, is a complete programming platform for large-scale distributed memory systems with multicore processors and hardware accelerators. PULSAR provides a simple abstraction layer over multithreading, message passing, and multi-GPU, multi-stream programming. PULSAR offers a general-purpose programming model, suitable for a wide range of scientific and engineering applications. PULSAR was inspired by systolic arrays, popularized by Hsiang-Tsung Kung and Charles E. Leiserson. %B University of Tennessee EECS Technical Report %I University of Tennessee %8 2014-11 %G eng %0 Conference Proceedings %B International conference on Supercomputing %D 2014 %T Scaling Up Matrix Computations on Shared-Memory Manycore Systems with 1000 CPU Cores %A Fengguang Song %A Jack Dongarra %X While the growing number of cores per chip allows researchers to solve larger scientific and engineering problems, the parallel efficiency of the deployed parallel software starts to decrease. This unscalability problem happens to both vendor-provided and open-source software and wastes CPU cycles and energy. By expecting CPUs with hundreds of cores to be imminent, we have designed a new framework to perform matrix computations for massively many cores. Our performance analysis on manycore systems shows that the unscalability bottleneck is related to Non-Uniform Memory Access (NUMA): memory bus contention and remote memory access latency. To overcome the bottleneck, we have designed NUMA-aware tile algorithms with the help of a dynamic scheduling runtime system to minimize NUMA memory accesses. The main idea is to identify the data that is, either read a number of times or written once by a thread resident on a remote NUMA node, then utilize the runtime system to conduct data caching and movement between different NUMA nodes. Based on the experiments with QR factorizations, we demonstrate that our framework is able to achieve great scalability on a 48-core AMD Opteron system (e.g., parallel efficiency drops only 3% from one core to 48 cores). We also deploy our framework to an extreme-scale shared-memory SGI machine which has 1024 CPU cores and runs a single Linux operating system image. Our framework continues to scale well, and can outperform the vendor-optimized Intel MKL library by up to 750%. %B International conference on Supercomputing %I ACM %C Munich, Germany %P 333-342 %8 2014-06 %@ 978-1-4503-2642-1 %G eng %R 10.1145/2597652.2597670 %0 Conference Paper %B VECPAR 2014 %D 2014 %T Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures %A Hartwig Anzt %A Dimitar Lukarski %A Stanimire Tomov %A Jack Dongarra %X Based on the premise that preconditioners needed for scientific computing are not only required to be robust in the numerical sense, but also scalable for up to thousands of light-weight cores, we argue that this two-fold goal is achieved for the recently developed self-adaptive multi-elimination preconditioner. For this purpose, we revise the underlying idea and analyze the performance of implementations realized in the PARALUTION and MAGMA open-source software libraries on GPU architectures (using either CUDA or OpenCL), Intel’s Many Integrated Core Architecture, and Intel’s Sandy Bridge processor. The comparison with other well-established preconditioners like multi-coloured Gauss-Seidel, ILU(0) and multi-colored ILU(0), shows that the twofold goal of a numerically stable cross-platform performant algorithm is achieved. %B VECPAR 2014 %C Eugene, OR %8 2014-06 %G eng %0 Conference Paper %B IPDPS 2014 %D 2014 %T A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU %A Tingxing Dong %A Veselin Dobrev %A Tzanio Kolev %A Robert Rieben %A Stanimire Tomov %A Jack Dongarra %K Computer science %K CUDA %K FEM %K Finite element method %K linear algebra %K nVidia %K Tesla K20 %X Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs [10, 1]. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor’s library routines substantially in performance. We proposed an autotuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2:5x overall speedup and 1:42x energy efficiency (greenup) using 4th order (Q4) finite elements, and a 1:9x speedup and 1:27x greenup using 2nd order (Q2) finite elements. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B IPDPS 2014 %D 2014 %T Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment %A Azzam Haidar %A Chongxiao Cao %A Jack Dongarra %A Piotr Luszczek %A Stanimire Tomov %K algorithms %K Computer science %K CUDA %K Heterogeneous systems %K Intel Xeon Phi %K linear algebra %K nVidia %K Tesla K20 %K Tesla M2090 %X Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resourcespecific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B 2014 IEEE International Conference on Cluster Computing %D 2014 %T Utilizing Dataflow-based Execution for Coupled Cluster Methods %A Heike McCraw %A Anthony Danalis %A George Bosilca %A Jack Dongarra %A Karol Kowalski %A Theresa Windus %X Computational chemistry comprises one of the driving forces of High Performance Computing. In particular, many-body methods, such as Coupled Cluster (CC) methods of the quantum chemistry package NWCHEM, are of particular interest for the applied chemistry community. Harnessing large fractions of the processing power of modern large scale computing platforms has become increasingly difficult. With the increase in scale, complexity, and heterogeneity of modern platforms, traditional programming models fail to deliver the expected performance scalability. On our way to Exascale and with these extremely hybrid platforms, dataflow-based programming models may be the only viable way for achieving and maintaining computation at scale. In this paper, we discuss a dataflow-based programming model and its applicability to NWCHEM’s CC methods. Our dataflow version of the CC kernels breaks down the algorithm into fine-grained tasks with explicitly defined data dependencies. As a result, many of the traditional synchronization points can be eliminated, allowing for a dynamic reshaping of the execution based on the ongoing availability of computational resources. We build this experiment using PARSEC – a task-based dataflow-driven execution engine – that enables efficient task scheduling on distributed systems, providing a desirable portability layer for application developers. %B 2014 IEEE International Conference on Cluster Computing %I IEEE %C Madrid, Spain %8 2014-09 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (also LAWN 246) %D 2013 %T Accelerating Linear System Solutions Using Randomization Techniques %A Marc Baboulin %A Jack Dongarra %A Julien Herrmann %A Stanimire Tomov %K algorithms %K dense linear algebra %K experimentation %K graphics processing units %K linear systems %K lu factorization %K multiplicative preconditioning %K numerical linear algebra %K performance %K randomization %X We illustrate how linear algebra calculations can be enhanced by statistical techniques in the case of a square linear system Ax = b. We study a random transformation of A that enables us to avoid pivoting and then to reduce the amount of communication. Numerical experiments show that this randomization can be performed at a very affordable computational price while providing us with a satisfying accuracy when compared to partial pivoting. This random transformation called Partial Random Butterfly Transformation (PRBT) is optimized in terms of data storage and flops count. We propose a solver where PRBT and the LU factorization with no pivoting take advantage of the current hybrid multicore/GPU machines and we compare its Gflop/s performance with a solver implemented in a current parallel library. %B ACM Transactions on Mathematical Software (also LAWN 246) %V 39 %8 2013-02 %G eng %U http://dl.acm.org/citation.cfm?id=2427025 %N 2 %R 10.1145/2427023.2427025 %0 Generic %D 2013 %T Assessing the impact of ABFT and Checkpoint composite strategies %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Yves Robert %A Jack Dongarra %K ABFT %K checkpoint %K fault-tolerance %K High-performance computing %K resilience %X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFT-unaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed. %B University of Tennessee Computer Science Technical Report %G eng %0 Conference Paper %B International Supercomputing Conference 2013 (ISC'13) %D 2013 %T Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q %A Heike McCraw %A Dan Terpstra %A Jack Dongarra %A Kris Davis %A Roy Musselman %B International Supercomputing Conference 2013 (ISC'13) %I Springer %C Leipzig, Germany %8 2013-06 %G eng %0 Journal Article %J The Computer Journal %D 2013 %T BlackjackBench: Portable Hardware Characterization with Automated Results Analysis %A Anthony Danalis %A Piotr Luszczek %A Gabriel Marin %A Jeffrey Vetter %A Jack Dongarra %K hardware characterization %K micro-benchmarks %K statistical analysis %X DARPA's AACE project aimed to develop Architecture Aware Compiler Environments. Such a compiler automatically characterizes the targeted hardware and optimizes the application codes accordingly. We present the BlackjackBench suite, a collection of portable micro-benchmarks that automate system characterization, plus statistical analysis techniques for interpreting the results. The BlackjackBench benchmarks discover the effective sizes and speeds of the hardware environment rather than the often unattainable peak values. We aim at hardware characteristics that can be observed by running executables generated by existing compilers from standard C codes. We characterize the memory hierarchy, including cache sharing and non-uniform memory access characteristics of the system, properties of the processing cores affecting the instruction execution speed and the length of the operating system scheduler time slot. We show how these features of modern multicores can be discovered programmatically. We also show how the features could potentially interfere with each other resulting in incorrect interpretation of the results, and how established classification and statistical analysis techniques can reduce experimental noise and aid automatic interpretation of results. We show how effective hardware metrics from our probes allow guided tuning of computational kernels that outperform an autotuning library further tuned by the hardware vendor. %B The Computer Journal %8 2013-03 %G eng %R 10.1093/comjnl/bxt057 %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2013 %T A Block-Asynchronous Relaxation Method for Graphics Processing Units %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %A Vincent Heuveline %X In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the ‘‘subdomain’’ handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing. %B Journal of Parallel and Distributed Computing %V 73 %P 1613–1626 %8 2013-12 %G eng %N 12 %R http://dx.doi.org/10.1016/j.jpdc.2013.05.008 %0 Generic %D 2013 %T clMAGMA: High Performance Dense Linear Algebra with OpenCL %A Chongxiao Cao %A Jack Dongarra %A Peng Du %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X This paper presents the design and implementation of sev- eral fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments pre- sented, and in general provides to heterogeneous architec- tures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is ob- tained through use of the high-performance OpenCL BLAS, hardware and OpenCL-speci c tuning, and a hybridization methodology where we split the algorithm into computa- tional tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. %B University of Tennessee Technical Report (Lawn 275) %I University of Tennessee %8 2013-03 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Correlated Set Coordination in Fault Tolerant Message Logging Protocols %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes. %B Concurrency and Computation: Practice and Experience %V 25 %P 572-585 %8 2013-03 %G eng %N 4 %R 10.1002/cpe.2859 %0 Conference Proceedings %B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2013 %T CPU-GPU Hybrid Bidiagonal Reduction With Soft Error Resilience %A Yulu Jia %A Piotr Luszczek %A George Bosilca %A Jack Dongarra %X Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur. %B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %C Montpellier, France %8 2013-11 %G eng %0 Journal Article %J Scalable Computing and Communications: Theory and Practice %D 2013 %T Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Piotr Luszczek %A Jack Dongarra %E Samee Khan %E Lin-Wang Wang %E Albert Zomaya %B Scalable Computing and Communications: Theory and Practice %I John Wiley & Sons %P 699-735 %8 2013-03 %G eng %0 Generic %D 2013 %T Designing LU-QR hybrid solvers for performance and stability %A Mathieu Faverge %A Julien Herrmann %A Julien Langou %A Bradley Lowery %A Yves Robert %A Jack Dongarra %B University of Tennessee Computer Science Technical Report (also LAWN 282) %I University of Tennessee %8 2013-10 %G eng %0 Generic %D 2013 %T Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs %A Simplice Donfack %A Stanimire Tomov %A Jack Dongarra %X Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU computing approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on high-end hybrid CPU/GPU systems show that our dynamically balanced synchronization-avoiding LU is both multicore and GPU scalable. Comparisons with state-of-the-art libraries like MKL (for multicore) and MAGMA (for hybrid systems) are provided, demonstrating significant performance improvements. The approach is applicable to other linear algebra algorithms. The scheduling mechanisms and tuning models can be incorporated into respectively dynamic runtime systems/schedulers and autotuning frameworks for hybrid CPU/MIC/GPU architectures. %B University of Tennessee Computer Science Technical Report %8 2013-07 %G eng %0 Conference Paper %B 7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems %D 2013 %T Efficient Parallelization of Batch Pattern Training Algorithm on Many-core and Cluster Architectures %A Volodymyr Turchenko %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %K many-core system %K parallel batch pattern training %K parallelization efficiency %K recirculation neural network %X Abstract—The experimental research of the parallel batch pattern back propagation training algorithm on the example of recirculation neural network on many-core high performance computing systems is presented in this paper. The choice of recirculation neural network among the multilayer perceptron, recurrent and radial basis neural networks is proved. The model of a recirculation neural network and usual sequential batch pattern algorithm of its training are theoretically described. An algorithmic description of the parallel version of the batch pattern training method is presented. The experimental research is fulfilled using the Open MPI, Mvapich and Intel MPI message passing libraries. The results obtained on many-core AMD system and Intel MIC are compared with the results obtained on a cluster system. Our results show that the parallelization efficiency is about 95% on 12 cores located inside one physical AMD processor for the considered minimum and maximum scenarios. The parallelization efficiency is about 70-75% on 48 AMD cores for the minimum and maximum scenarios. These results are higher by 15-36% (depending on the version of MPI library) in comparison with the results obtained on 48 cores of a cluster system. The parallelization efficiency obtained on Intel MIC architecture is surprisingly low, asking for deeper analysis. %B 7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems %C Berlin, Germany %8 2013-09 %G eng %0 Journal Article %J Journal of Supercomputing %D 2013 %T Enabling Workflows in GridSolve: Request Sequencing and Service Trading %A Yinan Li %A Asim YarKhan %A Jack Dongarra %A Keith Seymour %A Aurlie Hurault %K grid computing %K gridpac %K netsolve %K service trading %K workflow applications %X GridSolve employs a RPC-based client-agent-server model for solving computational problems. There are two deficiencies associated with GridSolve when a computational problem essentially forms a workflow consisting of a sequence of tasks with data dependencies between them. First, intermediate results are always passed through the client, resulting in unnecessary data transport. Second, since the execution of each individual task is a separate RPC session, it is difficult to enable any potential parallelism among tasks. This paper presents a request sequencing technique that addresses these deficiencies and enables workflow executions. Building on the request sequencing work, one way to generate workflows is by taking higher level service requests and decomposing them into a sequence of simpler service requests using a technique called service trading. A service trading component is added to GridSolve to take advantage of the new dynamic request sequencing. The features described here include automatic DAG construction and data dependency analysis, direct interserver data transfer, parallel task execution capabilities, and a service trading component. %B Journal of Supercomputing %V 64 %P 1133-1152 %8 2013-06 %G eng %N 3 %& 1133 %R 10.1007/s11227-010-0549-1 %0 Journal Article %J Computing %D 2013 %T An evaluation of User-Level Failure Mitigation support in MPI %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A Joshua Hursey %A George Bosilca %A Jack Dongarra %K Fault tolerance %K MPI %K User-level fault mitigation %X As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures. %B Computing %V 95 %P 1171-1184 %8 2013-12 %G eng %N 12 %R 10.1007/s00607-013-0331-3 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X Most predictions of exascale machines picture billion ways parallelism, encompassing not only millions of cores but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: (i) traditional checkpoint-based approaches incur a steep overhead on failure free operations and (ii) the dominant programming paradigm for parallel applications (the message passing interface (MPI) Standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI Standard, to enable advanced forward recovery techniques, without incurring the overhead of customary periodic checkpointing. With our approach, when failure strikes, applications regain control to make a checkpoint before quitting execution. This checkpoint is in reaction to the failure occurrence rather than periodic. This checkpoint is reloaded in a new MPI application, which restores a sane environment for the forward, application-based recovery technique to repair the failure-damaged dataset. The validity and performance of this approach are evaluated on large-scale systems, using the QR factorization as an example. Published 2013. This article is a US Government work and is in the public domain in the USA. %B Concurrency and Computation: Practice and Experience %8 2013-07 %G eng %U http://doi.wiley.com/10.1002/cpe.3100 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.3100 %0 Journal Article %J Parallel Computing %D 2013 %T Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems %A Jack Dongarra %A Mathieu Faverge %A Thomas Herault %A Mathias Jacquelin %A Julien Langou %A Yves Robert %K Cluster %K Distributed memory %K Hierarchical architecture %K multi-core %K numerical linear algebra %K QR factorization %X This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ‘‘communication-avoiding’’), it is natural to consider hierarchical trees composed of an ‘‘inter-node’’ tree which acts on top of ‘‘intra-node’’ trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ‘‘TS level’’ for cache-friendliness, (1) ‘‘low-level’’ for decoupled highly parallel inter-node reductions, (2) ‘‘domino level’’ to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGUE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms. %B Parallel Computing %V 39 %P 212-232 %8 2013-05 %G eng %N 4-5 %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2013 %T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures %A Hatem Ltaeif %A Piotr Luszczek %A Jack Dongarra %K algorithms %K bidiagional reduction %K bulge chasing %K data translation layer %K dynamic scheduling %K high performance kernels %K performance %K tile algorithms %K two-stage approach %X This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2. %B ACM Transactions on Mathematical Software (TOMS) %V 39 %G eng %N 3 %R 10.1145/2450153.2450154 %0 Book Section %B Contemporary High Performance Computing: From Petascale Toward Exascale %D 2013 %T HPC Challenge: Design, History, and Implementation Highlights %A Jack Dongarra %A Piotr Luszczek %K exascale %K hpc challenge %K hpcc %B Contemporary High Performance Computing: From Petascale Toward Exascale %I Taylor and Francis %C Boca Raton, FL %@ 978-1-4665-6834-1 %G eng %& 2 %0 Generic %D 2013 %T Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters %A Tingxing Dong %A Veselin Dobrev %A Tzanio Kolev %A Robert Rieben %A Stanimire Tomov %A Jack Dongarra %X The explosion of parallelism and heterogeneity in today's computer architectures has created opportunities as well as challenges for redesigning legacy numerical software to harness the power of new hardware. In this paper we address the main challenges in redesigning BLAST { a numerical library that solves the equations of compressible hydrodynamics using high order nite element methods (FEM) in a moving Lagrangian frame { to support CPU-GPU clusters. We use a hybrid MPI + OpenMP + CUDA programming model that includes two layers: domain decomposed MPI parallelization and OpenMP + CUDA acceleration in a given domain. To optimize the code, we implemented custom linear algebra kernels and introduced an auto-tuning technique to deal with heterogeneity and load balancing at runtime. Our tests show that 12 Intel Xeon cores and two M2050 GPUs deliver a 24x speedup compared to a single core, and a 2.5x speedup compared to 12 MPI tasks in one node. Further, we achieve perfect weak scaling, demonstrated on a cluster with up to 64 GPUs in 32 nodes. Our choice of programming model and proposed solutions, as related to parallelism and load balancing, specifically targets high order FEM discretizations, and can be used equally successfully for applications beyond hydrodynamics. A major accomplishment is that we further establish the appeal of high order FEMs, which despite their better approximation properties, are often avoided due to their high computational cost. GPUs, as we show, have the potential to make them the method of choice, as the increased computational cost is also localized, e.g., cast as Level 3 BLAS, and thus can be done very efficiently (close to \free" relative to the usual overheads inherent in sparse computations). %B University of Tennessee Computer Science Technical Report %8 2013-07 %G eng %0 Journal Article %J IPDPS 2013 (submitted) %D 2013 %T Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures %A Ichitaro Yamazaki %A Dulceneia Becker %A Jack Dongarra %A Alex Druinsky %A I. Peled %A Sivan Toledo %A Grey Ballard %A James Demmel %A Oded Schwartz %X Factorization of a dense symmetric indeﬁnite matrix is a key computational kernel in many scientiﬁc and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symmetry and guarantees numerical stability through pivoting at the same time. This is because such an algorithm exhibits many of the fundamental challenges in parallel programming like irregular data accesses and irregular task dependencies. In this paper, we address these challenges in a tiled implementation of a blocked Aasen’s algorithm using a dynamic scheduler. To fully exploit the limited parallelism in this left-looking algorithm, we study several performance enhancing techniques; e.g., parallel reduction to update a panel, tall-skinny LU factorization algorithms to factorize the panel, and a parallel implementation of symmetric pivoting. Our performance results on up to 48 AMD Opteron processors demonstrate that our implementation obtains speedups of up to 2.8 over MKL, while losing only one or two digits in the computed residual norms. %B IPDPS 2013 (submitted) %C Boston, MA %8 2013-00 %G eng %0 Generic %D 2013 %T Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC %A Guillaume Aupy %A Mathieu Faverge %A Yves Robert %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %X This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter-node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures %B Lawn 277 %8 2013-05 %G eng %0 Conference Paper %B Supercomputing 2013 %D 2013 %T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware %A Azzam Haidar %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %B Supercomputing 2013 %C Denver, CO %8 2013-11 %G eng %0 Generic %D 2013 %T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware %A Azzam Haidar %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %B University of Tennessee Computer Science Technical Report (also LAWN 283) %I University of Tennessee %8 2013-10 %G eng %0 Book Section %B Contemporary High Performance Computing: From Petascale Toward Exascale %D 2013 %T Keeneland: Computational Science Using Heterogeneous GPU Computing %A Jeffrey Vetter %A Richard Glassbrook %A Karsten Schwan %A Sudha Yalamanchili %A Mitch Horton %A Ada Gavrilovska %A Magda Slawinska %A Jack Dongarra %A Jeremy Meredith %A Philip Roth %A Kyle Spafford %A Stanimire Tomov %A John Wynkoop %X The Keeneland Project is a five year Track 2D grant awarded by the National Science Foundation (NSF) under solicitation NSF 08-573 in August 2009 for the development and deployment of an innovative high performance computing system. The Keeneland project is led by the Georgia Institute of Technology (Georgia Tech) in collaboration with the University of Tennessee at Knoxville, National Institute of Computational Sciences, and Oak Ridge National Laboratory. %B Contemporary High Performance Computing: From Petascale Toward Exascale %S CRC Computational Science Series %I Taylor and Francis %C Boca Raton, FL %G eng %& 7 %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2013 %T Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %K Cluster %K Collective communication %K Hierarchical %K HPC %K MPI %K Multicore %X Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies. This calls for a collaborative approach between multiple layers of collective algorithms, dedicated to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications. In this work, we present HierKNEM, a kernel-assisted topology-aware collective framework, and the mechanisms deployed by this framework to orchestrate the collaboration between multiple layers of collective algorithms. The resulting scheme maximizes the overlap of intra- and inter-node communications. We demonstrate experimentally, by considering three of the most used collective operations (Broadcast, Allgather and Reduction), that (1) this approach is immune to modifications of the underlying process-core binding; (2) it outperforms state-of-art MPI libraries (Open MPI, MPICH2 and MVAPICH2) demonstrating up to a 30x speedup for synthetic benchmarks, and up to a 3x acceleration for a parallel graph application (ASP); (3) it furthermore demonstrates a linear speedup with the increase of the number of cores per compute node, a paramount requirement for scalability on future many-core hardware. %B Journal of Parallel and Distributed Computing %V 73 %P 1000-1010 %8 2013-07 %G eng %U http://www.sciencedirect.com/science/article/pii/S0743731513000166 %N 7 %R 10.1016/j.jpdc.2013.01.015 %0 Book Section %B Handbook of Linear Algebra %D 2013 %T LAPACK %A Zhaojun Bai %A James Demmel %A Jack Dongarra %A Julien Langou %A Jenny Wang %X With a substantial amount of new material, the Handbook of Linear Algebra, Second Edition provides comprehensive coverage of linear algebra concepts, applications, and computational software packages in an easy-to-use format. It guides you from the very elementary aspects of the subject to the frontiers of current research. Along with revisions and updates throughout, the second edition of this bestseller includes 20 new chapters. %B Handbook of Linear Algebra %7 Second %I CRC Press %C Boca Raton, FL %@ 9781466507289 %G eng %0 Conference Proceedings %B International Supercomputing Conference (ISC) %D 2013 %T Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %A Raffaele Solcà %A Thomas C. Schulthess %X Today’s high computational demands from engineering fields and complex hardware development make it necessary to develop and optimize new algorithms toward achieving high performance and good scalability on the next generation of computers. The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe and analyze a successful methodology to address the challenges—starting from our algorithm design, kernel optimization and tuning, to our programming model—in the development of a scalable high-performance generalized eigenvalue solver in the context of electronic structure calculations in materials science applications. We developed a set of leading edge dense linear algebra algorithms, as part of a generalized eigensolver, featuring fine grained memory aware kernels, a task based approach and hybrid execution/scheduling. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. We report the performance impact on the generalized eigensolver when different fractions of eigenvectors are needed. The algorithm described provides an enormous performance boost compared to current GPU-based solutions, and performance comparable to state-of-the-art distributed solutions, using a single node with multiple GPUs. %B International Supercomputing Conference (ISC) %7 Lecture Notes in Computer Science %I Springer Berlin Heidelberg %C Leipzig, Germany %V 7905 %P 67-80 %8 2013-06 %@ 978-3-642-38750-0 %G eng %R 10.1007/978-3-642-38750-0_6 %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2013 %T Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms %A Fred G. Gustavson %A Jerzy Wasniewski %A Jack Dongarra %A José Herrero %A Julien Langou %X Four routines called DPOTF3i, i = a,b,c,d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts decreasing. This is our main result and it implies, due to the use of larger block size nb, that DGEMM, DSYRK, and DTRSM performance also increases! The four DPOTF3i routines use simple register blocking. Different platforms have different numbers of registers. Thus, our four routines have different register blocking sizes. BPF is introduced. LAPACK routines for POTRF and PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK POTRF source codes. We call these codes BPTRF. There are two variants of BPF: lower and upper. Upper BPF is “identical” to Square Block Packed Format (SBPF). “LAPACK” implementations on multicore processors use SBPF. Lower BPF is less efficient than upper BPF. Vector inplace transposition converts lower BPF to upper BPF very efficiently. Corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms are given for n ≈ nb as well as results for large n comparing DBPTRF versus DPOTRF. %B ACM Transactions on Mathematical Software (TOMS) %V 39 %8 2013-02 %G eng %N 2 %R 10.1145/2427023.2427026 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Computing %D 2013 %T LU Factorization with Partial Pivoting for a Multicore System with Accelerators %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %K accelerator %K Gaussian elimination %K gpu %K lu factorization %K manycore %K Multicore %K partial pivoting %X LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of partial pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs. %B IEEE Transactions on Parallel and Distributed Computing %V 24 %P 1613-1621 %8 2013-08 %G eng %N 8 %& 1613 %R http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.242 %0 Generic %D 2013 %T Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization %A Aurelien Bouteiller %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %X Failures are increasingly threatening the eciency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to generalpurpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency. %B University of Tennessee Computer Science Technical Report %8 2013-02 %G eng %0 Conference Paper %B Euro-Par 2013 %D 2013 %T Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization %A Aurelien Bouteiller %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %X Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that roll- back recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the sys- tem batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted com- putation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application comple- tion time is unchanged, while it delivers near-perfect platform efficiency. %B Euro-Par 2013 %I Springer %C Aachen, Germany %8 2013-08 %G eng %0 Journal Article %J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications %D 2013 %T Multithreading in the PLASMA Library %A Jakub Kurzak %A Piotr Luszczek %A Asim YarKhan %A Mathieu Faverge %A Julien Langou %A Henricus Bouwmeester %A Jack Dongarra %E Mohamed Ahmed %E Reda Ammar %E Sanguthevar Rajasekaran %B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications %I Taylor & Francis %8 2013-00 %G eng %0 Generic %D 2013 %T Optimal Checkpointing Period: Time vs. Energy %A Guillaume Aupy %A Anne Benoit %A Thomas Herault %A Yves Robert %A Jack Dongarra %B University of Tennessee Computer Science Technical Report (also LAWN 281) %I University of Tennessee %8 2013-10 %G eng %0 Conference Paper %B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013 %D 2013 %T Parallel Reduction to Hessenberg Form with Algorithm-Based Fault Tolerance %A Yulu Jia %A George Bosilca %A Piotr Luszczek %A Jack Dongarra %X This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases. %B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013 %C Denver, CO %8 2013-11 %G eng %0 Journal Article %J IEEE Computing in Science and Engineering %D 2013 %T PaRSEC: Exploiting Heterogeneity to Enhance Scalability %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Thomas Herault %A Jack Dongarra %X New high-performance computing system designs with steeply escalating processor and core counts, burgeoning heterogeneity and accelerators, and increasingly unpredictable memory access times call for dramatically new programming paradigms. These new approaches must react and adapt quickly to unexpected contentions and delays, and they must provide the execution environment with sufficient intelligence and flexibility to rearrange the execution to improve resource utilization. %B IEEE Computing in Science and Engineering %V 15 %P 36-45 %8 2013-11 %G eng %N 6 %R 10.1109/MCSE.2013.98 %0 Generic %D 2013 %T Performance of Various Computers Using Standard Linear Equations Software %A Jack Dongarra %X This report compares the performance of different computer systems in solving dense systems of linear equations. The comparison involves approximately a hundred computers, ranging from the Earth Simulator to personal computers. %B University of Tennessee Computer Science Technical Report %8 2013-02 %G eng %0 Conference Paper %B PPAM 2013 %D 2013 %T Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %K magma %K mic %K xeon phi %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA. %B PPAM 2013 %C Warsaw, Poland %8 2013-09 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2013 %T Post-failure recovery of MPI communication capability: Design and rationale %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery. %B International Journal of High Performance Computing Applications %V 27 %P 244 - 254 %8 2013-01 %G eng %U http://hpc.sagepub.com/cgi/doi/10.1177/1094342013488238 %N 3 %! International Journal of High Performance Computing Applications %R 10.1177/1094342013488238 %0 Conference Paper %B 15th Workshop on Advances in Parallel and Distributed Computational Models, at the IEEE International Parallel & Distributed Processing Symposium %D 2013 %T Revisiting the Double Checkpointing Algorithm %A Jack Dongarra %A Thomas Herault %A Yves Robert %X Abstract—Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kale [1], with the non-blocking algorithm of Ni, Meneses and Kale [2] in terms of both performance and risk. We also extend the model proposed in [1], [2] to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-topeer checkpointing algorithm, called the triple checkpointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations. %B 15th Workshop on Advances in Parallel and Distributed Computational Models, at the IEEE International Parallel & Distributed Processing Symposium %C Boston, MA %8 2013-05 %G eng %0 Generic %D 2013 %T Revisiting the Double Checkpointing Algorithm %A Jack Dongarra %A Thomas Herault %A Yves Robert %K checkpoint algorithm %K communication overlap %K fault-tolerance %K performance model %K resilience %X Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé in terms of both performance and risk. We also extend the model that they have proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations. %B University of Tennessee Computer Science Technical Report (LAWN 274) %8 2013-01 %G eng %0 Book Section %B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing %D 2013 %T Scalable Dense Linear Algebra on Heterogeneous Hardware %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Abstract. Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect, that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand, when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs), which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores. %B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing %G eng %0 Journal Article %J Journal of Computational Science %D 2013 %T Soft Error Resilient QR Factorization for Hybrid System with GPGPU %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K gpgpu %K gpu %K magma %X The general purpose graphics processing units (GPGPUs) are increasingly deployed for scientific computing due to their performance advantages over CPUs. What followed is the fact that fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors, for example, in the form of bit flips. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include: (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R; and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can successfully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs. %B Journal of Computational Science %V 4 %P 457–464 %8 2013-11 %G eng %N 6 %R http://dx.doi.org/10.1016/j.jocs.2013.01.004 %0 Conference Paper %B 17th IEEE High Performance Extreme Computing Conference (HPEC '13) %D 2013 %T Standards for Graph Algorithm Primitives %A Tim Mattson %A David Bader %A Jon Berry %A Aydin Buluc %A Jack Dongarra %A Christos Faloutsos %A John Feo %A John Gilbert %A Joseph Gonzalez %A Bruce Hendrickson %A Jeremy Kepner %A Charles Lieserson %A Andrew Lumsdaine %A David Padua %A Steve W. Poole %A Steve Reinhardt %A Mike Stonebraker %A Steve Wallach %A Andrew Yoo %K algorithms %K graphs %K linear algebra %K software standards %X It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard. %B 17th IEEE High Performance Extreme Computing Conference (HPEC '13) %I IEEE %C Waltham, MA %8 2013-09 %G eng %R 10.1109/HPEC.2013.6670338 %0 Generic %D 2013 %T Toward a New Metric for Ranking High Performance Computing Systems %A Michael A. Heroux %A Jack Dongarra %X The High Performance Linpack (HPL), or Top 500, benchmark is the most widely recognized and discussed metric for ranking high performance computing systems. However, HPL is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications. In this paper we describe a new high performance conjugate gradient (HPCG) benchmark. HPCG is composed of computations and data access patterns more commonly found in applications. Using HPCG we strive for a better correlation to real scientific application performance and expect to drive computer system design and implementation in directions that will better impact performance improvement. %B SAND2013 - 4744 %8 2013-06 %G eng %U http://www.netlib.org/utk/people/JackDongarra/PAPERS/HPCG-Benchmark-utk.pdf %0 Conference Paper %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %D 2013 %T Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication %A Azzam Haidar %A Mark Gates %A Stanimire Tomov %A Jack Dongarra %E Allen D. Malony %E Nemirovsky, Mario %E Midkiff, Sam %K eigenvalue %K gpu communication %K gpu computation %K heterogeneous programming model %K performance %K reduction to tridiagonal %K singular value decomposiiton %K task parallelism %X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges---starting from our algorithm design, kernel optimization and tuning, to our programming model---in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores. %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %I ACM Press %C Eugene, Oregon, USA %8 2013-06 %@ 9781450321303 %G eng %U http://dl.acm.org/citation.cfm?doid=2464996.2465438 %R 10.1145/2464996.2465438 %0 Generic %D 2013 %T Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures %A Yulu Jia %A Piotr Luszczek %A Jack Dongarra %X Graphics Processing Units (GPUs) are gaining wide spread usage in the ﬁeld of scientiﬁc computing owing to the performance boost GPUs bring to computation intensive applications. The typical conﬁguration is to integrate GPUs and CPUs in the same system where the CPUs handle the control ﬂow and part of the computation workload, and the GPUs serve as accelerators carry out the bulk of the data parallel compute workload. In this paper we design and implement a soft error resilient Hessenberg reduction algorithm on GPU based hybrid platforms. Our design employs algorithm based fault tolerance technique, diskless checkpointing and reverse computation. We detect and correct soft errors on-line without delaying the detection and correction to the end of the factorization. By utilizing idle time of the CPUs and overlapping both host side and GPU side workloads we minimize the observed overhead. Experiment results validated our design philosophy. Our algorithm introduces less than 2% performance overhead compared to the non-fault tolerant hybrid Hessenberg reduction algorithm. %B UT-CS-13-712 %I University of Tennessee Computer Science Technical Report %8 2013-06 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems %A Ichitaro Yamazaki %A Tingxing Dong %A Raffaele Solcà %A Stanimire Tomov %A Jack Dongarra %A Thomas C. Schulthess %X For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)-2 symmetric matrix-vector multiplication, and the BLAS-3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi-GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU-GPU kernel into computational kernels at higher-levels of software stacks, that is, a shared-memory dense eigensolver and a distributed-memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher-level kernels, not only reducing the solution time but also enabling the solution of larger-scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques. %B Concurrency and Computation: Practice and Experience %8 2013-10 %G eng %0 Conference Paper %B The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) %D 2013 %T Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster %A Ichitaro Yamazaki %A Tingxing Dong %A Stanimire Tomov %A Jack Dongarra %B The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) %8 2013-05 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale %A George Bosilca %A Aurelien Bouteiller %A Elisabeth Brunet %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %X In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation. %B Concurrency and Computation: Practice and Experience %8 2013-11 %G eng %R 10.1002/cpe.3173 %0 Conference Paper %B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013) %D 2013 %T Virtual Systolic Array for QR Decomposition %A Jakub Kurzak %A Piotr Luszczek %A Mark Gates %A Ichitaro Yamazaki %A Jack Dongarra %K dataflow programming %K message passing %K multi-core %K QR decomposition %K roofline model %K systolic array %X Systolic arrays offer a very attractive, data-centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for the QR decomposition is developed and a virtualization layer is used for mapping of the algorithm to a large distributed memory system. Strong scaling properties are discovered, superior to existing solutions. %B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013) %I IEEE %C Boston, MA %8 2013-05 %G eng %R 10.1109/IPDPS.2013.119 %0 Generic %D 2012 %T Acceleration of the BLAST Hydro Code on GPU %A Tingxing Dong %A Tzanio Kolev %A Robert Rieben %A Veselin Dobrev %A Stanimire Tomov %A Jack Dongarra %B Supercomputing '12 (poster) %I SC12 %C Salt Lake City, Utah %8 2012-11 %G eng %0 Conference Proceedings %B Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012 %D 2012 %T Algorithm-Based Fault Tolerance for Dense Matrix Factorization %A Peng Du %A Aurelien Bouteiller %A George Bosilca %A Thomas Herault %A Jack Dongarra %E J. Ramanujam %E P. Sadayappan %K ft-la %K ftmpi %X Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. %B Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012 %I ACM %C New Orleans, LA, USA %P 225-234 %8 2012-02 %G eng %R 10.1145/2145816.2145845 %0 Generic %D 2012 %T On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties %A Simplice Donfack %A Jack Dongarra %A Mathieu Faverge %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Ichitaro Yamazaki %X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed. %B University of Tennessee Computer Science Technical Report %8 2013-07 %G eng %0 Conference Proceedings %B 2012 IEEE High Performance Extreme Computing Conference %D 2012 %T Anatomy of a Globally Recursive Embedded LINPACK Benchmark %A Piotr Luszczek %A Jack Dongarra %X We present a complete bottom-up implementation of an embedded LINPACK benchmark on iPad 2. We use a novel formulation of a recursive LU factorization that is recursive and parallel at the global scope. We be believe our new algorithm presents an alternative to existing linear algebra parallelization techniques such as master-worker and DAG-based approaches. We show a assembly API that allows us a much higher level of abstraction and provides rapid code development within the confines of mobile device SDK. We use performance modeling to help with the limitation of the device and the limited access to device from the development environment not geared for HPC application tuning. %B 2012 IEEE High Performance Extreme Computing Conference %C Waltham, MA %P 1-6 %8 2012-09 %@ 978-1-4673-1577-7 %G eng %R 10.1109/HPEC.2012.6408679 %0 Journal Article %J ICCS 2012 %D 2012 %T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems %A Hartwig Anzt %A Stanimire Tomov %A Mark Gates %A Jack Dongarra %A Vincent Heuveline %B ICCS 2012 %C Omaha, NE %8 2012-06 %G eng %0 Conference Proceedings %B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award) %D 2012 %T A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %E Christos Kaklamanis %E Theodore Papatheodorou %E Paul Spirakis %B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award) %I Springer-Verlag %C Rhodes, Greece %8 2012-08 %G eng %0 Conference Proceedings %B Proc. of the International Conference on Computational Science (ICCS) %D 2012 %T A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines %A Marc Baboulin %A Simplice Donfack %A Jack Dongarra %A Laura Grigori %A Adrien Remi %A Stanimire Tomov %K magma %B Proc. of the International Conference on Computational Science (ICCS) %V 9 %P 17-26 %8 2012-06 %G eng %0 Journal Article %J IPDPS 2012 %D 2012 %T A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction %A Azzam Haidar %A Hatem Ltaeif %A Piotr Luszczek %A Jack Dongarra %B IPDPS 2012 %C Shanghai, China %8 2012-05 %G eng %0 Journal Article %J Parallel Computing %D 2012 %T DAGuE: A generic distributed DAG Engine for High Performance Computing. %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %E Torsten Hoefler %K dague %K parsec %B Parallel Computing %I Elsevier %V 38 %P 27-51 %8 2012-00 %G eng %0 Journal Article %J High Performance Scientific Computing: Algorithms and Applications %D 2012 %T Dense Linear Algebra on Accelerated Multicore Hardware %A Jack Dongarra %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %E Michael Berry %E et al., %B High Performance Scientific Computing: Algorithms and Applications %I Springer-Verlag %C London, UK %8 2012-00 %G eng %0 Journal Article %J SIAM Journal on Scientific Computing %D 2012 %T Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems %A Christof Voemel %A Stanimire Tomov %A Jack Dongarra %K magma %B SIAM Journal on Scientific Computing %V 34(2) %P C70-C82 %8 2012-04 %G eng %0 Generic %D 2012 %T An efficient distributed randomized solver with application to large dense linear systems %A Marc Baboulin %A Dulceneia Becker %A George Bosilca %A Anthony Danalis %A Jack Dongarra %K dague %K dplasma %K parsec %B ICL Technical Report %8 2012-07 %G eng %0 Conference Proceedings %B 26th ACM International Conference on Supercomputing (ICS 2012) %D 2012 %T Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems %A Fengguang Song %A Stanimire Tomov %A Jack Dongarra %K magma %B 26th ACM International Conference on Supercomputing (ICS 2012) %I ACM %C San Servolo Island, Venice, Italy %8 2012-06 %G eng %0 Conference Proceedings %B The 2nd International Conference on Cloud and Green Computing (submitted) %D 2012 %T Energy Footprint of Advanced Dense Numerical Linear Algebra using Tile Algorithms on Multicore Architecture %A Jack Dongarra %A Hatem Ltaeif %A Piotr Luszczek %A Vincent M Weaver %B The 2nd International Conference on Cloud and Green Computing (submitted) %C Xiangtan, Hunan, China %8 2012-11 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2012 %T Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures using Tree Reduction %A Hatem Ltaeif %A Piotr Luszczek %A Jack Dongarra %B Lecture Notes in Computer Science %V 7203 %P 661-670 %8 2012-09 %G eng %0 Conference Proceedings %B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012 %D 2012 %T An Evaluation of User-Level Failure Mitigation Support in MPI %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A Joshua Hursey %A George Bosilca %A Jack Dongarra %B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012 %I Springer %C Vienna, Austria %8 2012-09 %G eng %0 Generic %D 2012 %T Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %K ftmpi %B University of Tennessee Computer Science Technical Report %8 2012-00 %G eng %0 Journal Article %J Parallel Computing %D 2012 %T From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming %A Peng Du %A Rick Weber %A Piotr Luszczek %A Stanimire Tomov %A Gregory D. Peterson %A Jack Dongarra %B Parallel Computing %V 38 %P 391-407 %8 2012-08 %G eng %0 Conference Paper %B International European Conference on Parallel and Distributed Computing (Euro-Par '12) %D 2012 %T From Serial Loops to Parallel Execution on Distributed Systems %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Jack Dongarra %B International European Conference on Parallel and Distributed Computing (Euro-Par '12) %C Rhodes, Greece %8 2012-08 %G eng %0 Generic %D 2012 %T The Future of Computing: Software Libraries %A Stanimire Tomov %A Jack Dongarra %I DOD CREATE Developers' Review, Keynote Presentation %C Savannah, GA %8 2012-02 %G eng %0 Journal Article %J EuroPar 2012 (also LAWN 260) %D 2012 %T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement %A Hartwig Anzt %A Piotr Luszczek %A Jack Dongarra %A Vincent Heuveline %B EuroPar 2012 (also LAWN 260) %C Rhodes Island, Greece %8 2012-08 %G eng %0 Conference Proceedings %B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium %D 2012 %T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems %A Jack Dongarra %A Mathieu Faverge %A Thomas Herault %A Julien Langou %A Yves Robert %B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium %I IEEE Computer Society Press %C Shanghai, China %8 2012-05 %G eng %0 Journal Article %J IPDPS 2012 (Best Paper) %D 2012 %T HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %B IPDPS 2012 (Best Paper) %C Shanghai, China %8 2012-05 %G eng %0 Journal Article %J Acta Numerica %D 2012 %T High Performance Computing Systems: Status and Outlook %A Jack Dongarra %A Aad J. van der Steen %B Acta Numerica %I Cambridge University Press %C Cambridge, UK %V 21 %P 379-474 %8 2012-05 %G eng %0 Journal Article %J ICCS 2012 %D 2012 %T High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors %A Peng Du %A Piotr Luszczek %A Jack Dongarra %B ICCS 2012 %C Omaha, NE %8 2012-06 %G eng %0 Journal Article %J On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear) %D 2012 %T HPC Challenge: Design, History, and Implementation Highlights %A Jack Dongarra %A Piotr Luszczek %E Jeffrey Vetter %B On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear) %I Chapman & Hall/CRC Press %8 2012-00 %G eng %0 Journal Article %J Applied Parallel and Scientific Computing %D 2012 %T An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs %A Jakub Kurzak %A Rajib Nath %A Peng Du %A Jack Dongarra %E Kristján Jónasson %B Applied Parallel and Scientific Computing %V 7133 %P 248-257 %8 2012-00 %G eng %0 Journal Article %J Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear) %D 2012 %T Looking Back at Dense Linear Algebra Software %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %E Viktor K. Prasanna %E Yves Robert %E Per Stenström %B Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear) %8 2012-00 %G eng %0 Generic %D 2012 %T MAGMA: A Breakthrough in Solvers for Eigenvalue Problems %A Stanimire Tomov %A Jack Dongarra %A Azzam Haidar %A Ichitaro Yamazaki %A Tingxing Dong %A Thomas Schulthess %A Raffaele Solcà %I GPU Technology Conference (GTC12), Presentation %C San Jose, CA %8 2012-05 %G eng %0 Generic %D 2012 %T MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures %A Jack Dongarra %A Tingxing Dong %A Mark Gates %A Azzam Haidar %A Stanimire Tomov %A Ichitaro Yamazaki %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation %C Salt Lake City, UT %8 2012-11 %G eng %0 Generic %D 2012 %T MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors %A Jack Dongarra %A Mark Gates %A Yulu Jia %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12) %C Salt Lake City, UT %8 2012-11 %G eng %0 Journal Article %J Supercomputing '12 (poster) %D 2012 %T Matrices Over Runtime Systems at Exascale %A Emmanuel Agullo %A George Bosilca %A Cedric Castagnède %A Jack Dongarra %A Hatem Ltaeif %A Stanimire Tomov %B Supercomputing '12 (poster) %C Salt Lake City, Utah %8 2012-11 %G eng %0 Journal Article %J Supercomputing '12 (poster) %D 2012 %T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks %A Raffaele Solcà %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %A Thomas C. Schulthess %B Supercomputing '12 (poster) %C Salt Lake City, Utah %8 2012-11 %G eng %0 Conference Proceedings %B The International Conference on Computational Science (ICCS) %D 2012 %T One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %K magma %B The International Conference on Computational Science (ICCS) %8 2012-06 %G eng %0 Journal Article %J VECPAR 2012 %D 2012 %T Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators %A Ahmad Abdelfattah %A Jack Dongarra %A David Keyes %A Hatem Ltaeif %B VECPAR 2012 %C Kobe, Japan %8 2012-07 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2012 %T Parallel Processing and Applied Mathematics, 9th International Conference, PPAM 2011 %E Roman Wyrzykowski %E Jack Dongarra %E Konrad Karczewski %E Jerzy Wasniewski %B Lecture Notes in Computer Science %C Torun, Poland %V 7203 %8 2012-00 %G eng %0 Journal Article %J IPDPS 2012 %D 2012 %T A Parallel Tiled Solver for Symmetric Indefinite Systems On Multicore Architectures %A Marc Baboulin %A Dulceneia Becker %A Jack Dongarra %B IPDPS 2012 %C Shanghai, China %8 2012-05 %G eng %0 Generic %D 2012 %T Performance evaluation of LU factorization through hardware counter measurements %A Simplice Donfack %A Stanimire Tomov %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %8 2012-10 %G eng %0 Conference Proceedings %B Third International Conference on Energy-Aware High Performance Computing %D 2012 %T Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems %A George Bosilca %A Jack Dongarra %A Hatem Ltaeif %B Third International Conference on Energy-Aware High Performance Computing %C Hamburg, Germany %8 2012-09 %G eng %0 Journal Article %J LAWN 267 %D 2012 %T Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %B LAWN 267 %8 2012-00 %G eng %0 Conference Proceedings %B Proceedings of VECPAR’12 %D 2012 %T Programming the LU Factorization for a Multicore System with Accelerators %A Jakub Kurzak %A Piotr Luszczek %A Mathieu Faverge %A Jack Dongarra %K plasma %K quark %B Proceedings of VECPAR’12 %C Kobe, Japan %8 2012-04 %G eng %0 Generic %D 2012 %T A Proposal for User-Level Failure Mitigation in the MPI-3 Standard %A Wesley Bland %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Jack Dongarra %K ftmpi %B University of Tennessee Electrical Engineering and Computer Science Technical Report %I University of Tennessee %8 2012-02 %G eng %0 Generic %D 2012 %T Providing GPU Capability to LU and QR within the ScaLAPACK Framework %A Peng Du %A Stanimire Tomov %A Jack Dongarra %B University of Tennessee Computer Science Technical Report (also LAWN 272) %8 2012-09 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2012 %T Recent Advances in the Message Passing Interface: 19th European MPI Users' Group Meeting, EuroMPI 2012 %E Jesper Larsson Träff %E Siegfried Benkner %E Jack Dongarra %B Lecture Notes in Computer Science %C Vienna, Austria %V 7490 %8 2012-00 %G eng %0 Journal Article %J Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011) %D 2012 %T Reducing the Amount of Pivoting in Symmetric Indefinite Systems %A Dulceneia Becker %A Marc Baboulin %A Jack Dongarra %E Roman Wyrzykowski %E Jack Dongarra %E Konrad Karczewski %E Jerzy Wasniewski %B Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011) %I Springer-Verlag Berlin Heidelberg %V 7203 %P 133-142 %8 2012-00 %G eng %0 Conference Proceedings %B The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012) %D 2012 %T A Scalable Framework for Heterogeneous GPU-Based Clusters %A Fengguang Song %A Jack Dongarra %K magma %B The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012) %I ACM %C Pittsburgh, PA, USA %8 2012-06 %G eng %0 Journal Article %J SIAM Journal on Scientific Computing (Accepted) %D 2012 %T Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices %A Azzam Haidar %A Hatem Ltaeif %A Jack Dongarra %B SIAM Journal on Scientific Computing (Accepted) %8 2012-07 %G eng %0 Generic %D 2012 %T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale %A George Bosilca %A Aurelien Bouteiller %A Elisabeth Brunet %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %B University of Tennessee Computer Science Technical Report (also LAWN 269) %8 2012-06 %G eng %0 Conference Proceedings %B Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper) %D 2012 %T Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %A Vincent Heuveline %B Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper) %C Rhodes Island, Greece %8 2012-08 %G eng %0 Journal Article %J SIAM Journal on Computing (submitted) %D 2012 %T Weighted Block-Asynchronous Relaxation for GPU-Accelerated Systems %A Hartwig Anzt %A Jack Dongarra %A Vincent Heuveline %B SIAM Journal on Computing (submitted) %8 2012-03 %G eng %0 Journal Article %J INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11) %D 2011 %T Accelerating Linear System Solutions Using Randomization Techniques %A Marc Baboulin %A Jack Dongarra %A Julien Herrmann %A Stanimire Tomov %K magma %B INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11) %C Waterloo, Ontario, Canada %8 2011-07 %G eng %0 Generic %D 2011 %T Achieving Numerical Accuracy and High Performance using Recursive Tile LU Factorization %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Piotr Luszczek %K plasma %K quark %B University of Tennessee Computer Science Technical Report (also as a LAWN) %8 2011-09 %G eng %0 Generic %D 2011 %T Algorithm-based Fault Tolerance for Dense Matrix Factorizations %A Peng Du %A Aurelien Bouteiller %A George Bosilca %A Thomas Herault %A Jack Dongarra %K ft-la %B University of Tennessee Computer Science Technical Report %C Knoxville, TN %8 2011-08 %G eng %0 Generic %D 2011 %T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures %A Azzam Haidar %A Hatem Ltaeif %A Asim YarKhan %A Jack Dongarra %K plasma %K quark %B University of Tennessee Computer Science Technical Report, UT-CS-11-666, (also Lawn 243) %8 2011-00 %G eng %0 Generic %D 2011 %T Autotuning GEMMs for Fermi %A Jakub Kurzak %A Stanimire Tomov %A Jack Dongarra %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-11-671, (also Lawn 245) %8 2011-04 %G eng %0 Conference Proceedings %B IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2011 %T BlackjackBench: Hardware Characterization with Portable Micro-Benchmarks and Automatic Statistical Analysis of Results %A Anthony Danalis %A Piotr Luszczek %A Gabriel Marin %A Jeffrey Vetter %A Jack Dongarra %B IEEE International Parallel and Distributed Processing Symposium (submitted) %C Anchorage, AK %8 2011-05 %G eng %0 Journal Article %D 2011 %T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems %A Hartwig Anzt %A Stanimire Tomov %A Mark Gates %A Jack Dongarra %A Vincent Heuveline %K magma %8 2011-12 %G eng %0 Generic %D 2011 %T A Block-Asynchronous Relaxation Method for Graphics Processing Units %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %A Vincent Heuveline %K magma %B University of Tennessee Computer Science Technical Report %8 2011-11 %G eng %0 Journal Article %J in Solving the Schrodinger Equation: Has everything been tried? (to appear) %D 2011 %T Changes in Dense Linear Algebra Kernels - Decades Long Perspective %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %E P. Popular %B in Solving the Schrodinger Equation: Has everything been tried? (to appear) %I Imperial College Press %8 2011-00 %G eng %0 Conference Proceedings %B Symposium for Application Accelerators in High Performance Computing (SAAHPC'11) %D 2011 %T A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures %A Mitch Horton %A Stanimire Tomov %A Jack Dongarra %K magma %K quark %B Symposium for Application Accelerators in High Performance Computing (SAAHPC'11) %C Knoxville, TN %8 2011-07 %G eng %0 Conference Proceedings %B Proceedings of 17th International Conference, Euro-Par 2011, Part II %D 2011 %T Correlated Set Coordination in Fault Tolerant Message Logging Protocols %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %E Emmanuel Jeannot %E Raymond Namyst %E Jean Roman %K ftmpi %B Proceedings of 17th International Conference, Euro-Par 2011, Part II %I Springer %C Bordeaux, France %V 6853 %P 51-64 %8 2011-08 %G eng %0 Conference Proceedings %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %D 2011 %T DAGuE: A Generic Distributed DAG Engine for High Performance Computing %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %K dague %K parsec %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %I IEEE %C Anchorage, Alaska, USA %P 1151-1158 %8 2011-00 %G eng %0 Generic %D 2011 %T Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures %A Fengguang Song %A Stanimire Tomov %A Jack Dongarra %K magma %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250) %8 2011-06 %G eng %0 Conference Proceedings %B 6th Workshop on Virtualization in High-Performance Cloud Computing %D 2011 %T Evaluation of the HPC Challenge Benchmarks in Virtualized Environments %A Piotr Luszczek %A Eric Meek %A Shirley Moore %A Dan Terpstra %A Vincent M Weaver %A Jack Dongarra %K hpcc %B 6th Workshop on Virtualization in High-Performance Cloud Computing %C Bordeaux, France %8 2011-08 %G eng %0 Conference Proceedings %B Proceedings of PARCO'11 %D 2011 %T Exploiting Fine-Grain Parallelism in Recursive LU Factorization %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Piotr Luszczek %K plasma %B Proceedings of PARCO'11 %C Gent, Belgium %8 2011-04 %G eng %0 Conference Proceedings %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %D 2011 %T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %I IEEE %C Anchorage, Alaska, USA %P 1432-1441 %8 2011-05 %G eng %0 Generic %D 2011 %T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement %A Hartwig Anzt %A Piotr Luszczek %A Jack Dongarra %A Vincent Heuveline %K magma %B University of Tennessee Computer Science Technical Report UT-CS-11-690 (also Lawn 260) %8 2011-12 %G eng %0 Generic %D 2011 %T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems %A Jack Dongarra %A Mathieu Faverge %A Thomas Herault %A Julien Langou %A Yves Robert %K magma %K plasma %B University of Tennessee Computer Science Technical Report (also Lawn 257) %8 2011-10 %G eng %0 Generic %D 2011 %T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures %A Hatem Ltaeif %A Piotr Luszczek %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247) %8 2011-05 %G eng %0 Journal Article %J IEEE Cluster 2011 %D 2011 %T High Performance Dense Linear System Solver with Soft Error Resilience %A Peng Du %A Piotr Luszczek %A Jack Dongarra %K ft-la %B IEEE Cluster 2011 %C Austin, TX %8 2011-09 %G eng %0 Conference Proceedings %B Proceedings of MTAGS11 %D 2011 %T High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Piotr Luszczek %B Proceedings of MTAGS11 %C Seattle, WA %8 2011-11 %G eng %0 Journal Article %J in GPU Computing Gems, Jade Edition %D 2011 %T A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Samuel Thibault %A Stanimire Tomov %E Wen-mei W. Hwu %K magma %K morse %B in GPU Computing Gems, Jade Edition %I Elsevier %V 2 %P 473-484 %8 2011-00 %G eng %0 Journal Article %J 18th EuroMPI %D 2011 %T Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW %A Teng Ma %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %K dague %B 18th EuroMPI %I Springer %C Santorini, Greece %P 247-254 %8 2011-09 %G eng %0 Journal Article %J International Journal of High Performance Computing %D 2011 %T The International Exascale Software Project Roadmap %A Jack Dongarra %A Pete Beckman %A Terry Moore %A Patrick Aerts %A Giovanni Aloisio %A Jean-Claude Andre %A David Barkai %A Jean-Yves Berthou %A Taisuke Boku %A Bertrand Braunschweig %A Franck Cappello %A Barbara Chapman %A Xuebin Chi %A Alok Choudhary %A Sudip Dosanjh %A Thom Dunning %A Sandro Fiore %A Al Geist %A Bill Gropp %A Robert Harrison %A Mark Hereld %A Michael Heroux %A Adolfy Hoisie %A Koh Hotta %A Zhong Jin %A Yutaka Ishikawa %A Fred Johnson %A Sanjay Kale %A Richard Kenway %A David Keyes %A Bill Kramer %A Jesus Labarta %A Alain Lichnewsky %A Thomas Lippert %A Bob Lucas %A Barney MacCabe %A Satoshi Matsuoka %A Paul Messina %A Peter Michielse %A Bernd Mohr %A Matthias S. Mueller %A Wolfgang E. Nagel %A Hiroshi Nakashima %A Michael E. Papka %A Dan Reed %A Mitsuhisa Sato %A Ed Seidel %A John Shalf %A David Skinner %A Marc Snir %A Thomas Sterling %A Rick Stevens %A Fred Streitz %A Bob Sugar %A Shinji Sumimoto %A William Tang %A John Taylor %A Rajeev Thakur %A Anne Trefethen %A Mateo Valero %A Aad van der Steen %A Jeffrey Vetter %A Peg Williams %A Robert Wisniewski %A Kathy Yelick %X Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project. %B International Journal of High Performance Computing %V 25 %P 3-60 %8 2011-01 %G eng %R https://doi.org/10.1177/1094342010391989 %0 Conference Proceedings %B Int'l Conference on Parallel Processing (ICPP '11) %D 2011 %T Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Brice Goglin %A J. Squyres %A Jack Dongarra %B Int'l Conference on Parallel Processing (ICPP '11) %C Taipei, Taiwan %8 2011-09 %G eng %0 Journal Article %J IEEE/ACS AICCSA 2011 %D 2011 %T LU Factorization for Accelerator-Based Systems %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Mathieu Faverge %A Julien Langou %A Hatem Ltaeif %A Stanimire Tomov %K magma %K morse %B IEEE/ACS AICCSA 2011 %C Sharm-El-Sheikh, Egypt %8 2011-12 %G eng %0 Generic %D 2011 %T MAGMA - LAPACK for HPC on Heterogeneous Architectures %A Stanimire Tomov %A Jack Dongarra %I Titan Summit at Oak Ridge National Laboratory, Presentation %C Oak Ridge, TN %8 2011-08 %G eng %0 Journal Article %J 18th EuroMPI %D 2011 %T OMPIO: A Modular Software Architecture for MPI I/O %A Mohamad Chaarawi %A Edgar Gabriel %A Rainer Keller %A Richard L. Graham %A George Bosilca %A Jack Dongarra %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %B 18th EuroMPI %I Springer %C Santorini, Greece %P 81-89 %8 2011-09 %G eng %0 Conference Proceedings %B ACM/IEEE Conference on Supercomputing (SC’11) %D 2011 %T Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs %A Rajib Nath %A Stanimire Tomov %A Tingxing Dong %A Jack Dongarra %K magma %B ACM/IEEE Conference on Supercomputing (SC’11) %C Seattle, WA %8 2011-11 %G eng %0 Conference Proceedings %B IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2011 %T Overlapping Computation and Communication for Advection on a Hybrid Parallel Computer %A James B White %A Jack Dongarra %B IEEE International Parallel and Distributed Processing Symposium (submitted) %C Anchorage, AK %8 2011-05 %G eng %0 Generic %D 2011 %T Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels %A Azzam Haidar %A Hatem Ltaeif %A Jack Dongarra %B University of Tennessee Computer Science Technical Report, UT-CS-11-677, (also Lawn254) %8 2011-08 %G eng %0 Conference Proceedings %B Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11) %D 2011 %T Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels %A Azzam Haidar %A Hatem Ltaeif %A Jack Dongarra %K plasma %K quark %B Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11) %C Seattle, WA %8 2011-11 %G eng %0 Generic %D 2011 %T A parallel tiled solver for dense symmetric indefinite systems on multicore architectures %A Marc Baboulin %A Dulceneia Becker %A Jack Dongarra %K plasma %K quark %B University of Tennessee Computer Science Technical Report %8 2011-10 %G eng %0 Generic %D 2011 %T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report) %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %8 2011-00 %G eng %0 Journal Article %J IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC) %D 2011 %T Performance Portability of a GPU Enabled Factorization with the DAGuE Framework %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Narapat Ohm Saengpatsa %A Stanimire Tomov %A Jack Dongarra %K dague %K magma %K parsec %B IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC) %8 2011-06 %G eng %0 Conference Proceedings %B IEEE Int'l Conference on Cluster Computing (Cluster 2011) %D 2011 %T Process Distance-aware Adaptive MPI Collective Communications %A Teng Ma %A Thomas Herault %A George Bosilca %A Jack Dongarra %B IEEE Int'l Conference on Cluster Computing (Cluster 2011) %C Austin, Texas %8 2011-00 %G eng %0 Conference Proceedings %B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011) %D 2011 %T Profiling High Performance Dense Linear Algebra Algorithms on Multicore Architectures for Power and Energy Efficiency %A Hatem Ltaeif %A Piotr Luszczek %A Jack Dongarra %K mumi %B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011) %C Hamburg, Germany %8 2011-09 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2011 %T QCG-OMPI: MPI Applications on Grids. %A Emmanuel Agullo %A Camille Coti %A Thomas Herault %A Julien Langou %A Sylvain Peyronnet %A A. Rezmerita %A Franck Cappello %A Jack Dongarra %B Future Generation Computer Systems %V 27 %P 435-369 %8 2011-01 %G eng %0 Generic %D 2011 %T QUARK Users' Guide: QUeueing And Runtime for Kernels %A Asim YarKhan %A Jakub Kurzak %A Jack Dongarra %K magma %K plasma %K quark %B University of Tennessee Innovative Computing Laboratory Technical Report %8 2011-00 %G eng %0 Generic %D 2011 %T Reducing the Amount of Pivoting in Symmetric Indefinite Systems %A Dulceneia Becker %A Marc Baboulin %A Jack Dongarra %B University of Tennessee Innovative Computing Laboratory Technical Report %I Submitted to PPAM 2011 %C Knoxville, TN %8 2011-05 %G eng %0 Conference Proceedings %B International Conference on Cluster Computing (CLUSTER) %D 2011 %T On Scalability for MPI Runtime Systems %A George Bosilca %A Thomas Herault %A A. Rezmerita %A Jack Dongarra %K harness %B International Conference on Cluster Computing (CLUSTER) %I IEEEE %C Austin, TX, USA %P 187-195 %8 2011-09 %G eng %0 Generic %D 2011 %T On Scalability for MPI Runtime Systems %A George Bosilca %A Thomas Herault %A A. Rezmerita %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %C Knoxville, TN %8 2011-05 %G eng %0 Conference Proceedings %B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011 %D 2011 %T Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure %A George Bosilca %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %A A. Rezmerita %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %K ftmpi %B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011 %I Springer %C Santorini, Greece %V 6960 %P 342-344 %8 2011-09 %G eng %0 Generic %D 2011 %T Soft Error Resilient QR Factorization for Hybrid System %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K ft-la %B University of Tennessee Computer Science Technical Report %C Knoxville, TN %8 2011-07 %G eng %0 Journal Article %J UT-CS-11-675 (also LAPACK Working Note #252) %D 2011 %T Soft Error Resilient QR Factorization for Hybrid System %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K magma %B UT-CS-11-675 (also LAPACK Working Note #252) %8 2011-07 %G eng %0 Journal Article %J Journal of Computational Science %D 2011 %T Soft Error Resilient QR Factorization for Hybrid System with GPGPU %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K ft-la %B Journal of Computational Science %I Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems at SC11 %C Seattle, WA %8 2011-11 %G eng %0 Journal Article %J Submitted to SIAM Journal on Scientific Computing (SISC) %D 2011 %T Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices. %A Azzam Haidar %A Hatem Ltaeif %A Jack Dongarra %B Submitted to SIAM Journal on Scientific Computing (SISC) %8 2011-00 %G eng %0 Generic %D 2011 %T Towards a Parallel Tile LDL Factorization for Multicore Architectures %A Dulceneia Becker %A Mathieu Faverge %A Jack Dongarra %K plasma %K quark %B ICL Technical Report %C Seattle, WA %8 2011-04 %G eng %0 Conference Proceedings %B IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2011 %T Two-stage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures %A Piotr Luszczek %A Hatem Ltaeif %A Jack Dongarra %B IEEE International Parallel and Distributed Processing Symposium (submitted) %C Anchorage, AK %8 2011-05 %G eng %0 Conference Proceedings %B IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2011 %T A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Narapat Ohm Saengpatsa %A Stanimire Tomov %A Jack Dongarra %K dague %B IEEE International Parallel and Distributed Processing Symposium (submitted) %C Anchorage, AK %8 2011-05 %G eng %0 Conference Proceedings %B PPAM 2009 Proceedings %D 2010 %T 8th International Conference on Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (LNCS) %E Roman Wyrzykowski %E Jack Dongarra %E Konrad Karczewski %E Jerzy Wasniewski %B PPAM 2009 Proceedings %I Springer %C Wroclaw, Poland %V 6067 %8 2010-09 %G eng %0 Journal Article %J Proc. of VECPAR'10 %D 2010 %T Accelerating GPU Kernels for Dense Linear Algebra %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %K magma %B Proc. of VECPAR'10 %C Berkeley, CA %8 2010-06 %G eng %0 Journal Article %J Parallel Computing %D 2010 %T Accelerating the Reduction to Upper Hessenberg, Tridiagonal, and Bidiagonal Forms through Hybrid GPU-Based Computing %A Stanimire Tomov %A Rajib Nath %A Jack Dongarra %K magma %B Parallel Computing %V 36 %P 645-654 %8 2010-00 %G eng %0 Journal Article %J Submitted to Concurrency and Computations: Practice and Experience %D 2010 %T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures %A Azzam Haidar %A Hatem Ltaeif %A Asim YarKhan %A Jack Dongarra %K plasma %K quark %B Submitted to Concurrency and Computations: Practice and Experience %8 2010-11 %G eng %0 Generic %D 2010 %T Analysis of Various Scalar, Vector, and Parallel Implementations of RandomAccess %A Piotr Luszczek %A Jack Dongarra %K hpcc %B Innovative Computing Laboratory (ICL) Technical Report %8 2010-06 %G eng %0 Generic %D 2010 %T Autotuning Dense Linear Algebra Libraries on GPUs %A Rajib Nath %A Stanimire Tomov %A Emmanuel Agullo %A Jack Dongarra %I Sixth International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010) %C Basel, Switzerland %8 2010-06 %G eng %0 Book Section %B Scientific Computing with Multicore and Accelerators %D 2010 %T Blas for GPUs %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %B Scientific Computing with Multicore and Accelerators %S Chapman & Hall/CRC Computational Science %I CRC Press %C Boca Raton, Florida %@ 9781439825365 %G eng %& 4 %0 Conference Proceedings %B 3rd Workshop on Functionality of Hardware Performance Monitoring %D 2010 %T Can Hardware Performance Counters Produce Expected, Deterministic Results? %A Vincent M Weaver %A Jack Dongarra %K papi %B 3rd Workshop on Functionality of Hardware Performance Monitoring %C Atlanta, GA %8 2010-12 %G eng %0 Journal Article %J Parallel Computing (to appear) %D 2010 %T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %B Parallel Computing (to appear) %8 2010-00 %G eng %0 Journal Article %J Tools for High Performance Computing 2009 %D 2010 %T Collecting Performance Data with PAPI-C %A Dan Terpstra %A Heike Jagode %A Haihang You %A Jack Dongarra %K mumi %K papi %X Modern high performance computer systems continue to increase in size and complexity. Tools to measure application performance in these increasingly complex environments must also increase the richness of their measurements to provide insights into the increasingly intricate ways in which software and hardware interact. PAPI (the Performance API) has provided consistent platform and operating system independent access to CPU hardware performance counters for nearly a decade. Recent trends toward massively parallel multi-core systems with often heterogeneous architectures present new challenges for the measurement of hardware performance information, which is now available not only on the CPU core itself, but scattered across the chip and system. We discuss the evolution of PAPI into Component PAPI, or PAPI-C, in which multiple sources of performance data can be measured simultaneously via a common software interface. Several examples of components and component data measurements are discussed. We explore the challenges to hardware performance measurement in existing multi-core architectures. We conclude with an exploration of future directions for the PAPI interface. %B Tools for High Performance Computing 2009 %I Springer Berlin / Heidelberg %C 3rd Parallel Tools Workshop, Dresden, Germany %P 157-173 %8 2010-05 %G eng %R https://doi.org/10.1007/978-3-642-11261-4_11 %0 Journal Article %J Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale %D 2010 %T Constructing Resiliant Communication Infrastructure for Runtime Environments in Advances in Parallel Computing %A George Bosilca %A Camille Coti %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %E Barbara Chapman %E Frederic Desprez %E Gerhard R. Joubert %E Alain Lichnewsky %E Frans Peters %E T. Priol %B Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale %V 19 %P 441-451 %G eng %R 10.3233/978-1-60750-530-3-441 %0 Generic %D 2010 %T DAGuE: A generic distributed DAG engine for high performance computing %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %K dague %B Innovative Computing Laboratory Technical Report %8 2010-04 %G eng %0 Book Section %B Scientific Computing with Multicore and Accelerators %D 2010 %T Dense Linear Algebra for Hybrid GPU-based Systems %A Stanimire Tomov %A Jack Dongarra %B Scientific Computing with Multicore and Accelerators %S Chapman & Hall/CRC Computational Science %I CRC Press %C Boca Raton, Florida %@ 9781439825365 %G eng %& 3 %0 Conference Proceedings %B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on %D 2010 %T Dense Linear Algebra Solvers for Multicore with GPU Accelerators %A Stanimire Tomov %A Rajib Nath %A Hatem Ltaeif %A Jack Dongarra %X Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators. We describe how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures. The approach taken is based on hybridization techniques in the context of Cholesky, LU, and QR factorizations. We use a high-level parallel programming model and leverage existing software infrastructure, e.g. optimized BLAS for CPU and GPU, and LAPACK for sequential CPU processing. Included also are architecture and algorithm-specific optimizations for standard solvers as well as mixed-precision iterative refinement solvers. The new algorithms, depending on the hardware configuration and routine parameters, can lead to orders of magnitude acceleration when compared to the same algorithms on standard multicore architectures that do not contain GPU accelerators. The newly developed DLA solvers are integrated and freely available through the MAGMA library. %B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on %C Atlanta, GA %P 1-8 %G eng %R 10.1109/IPDPSW.2010.5470941 %0 Generic %D 2010 %T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-10-660 %8 2010-09 %G eng %0 Generic %D 2010 %T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K plasma %B Innovative Computing Laboratory Technical Report %8 2010-00 %G eng %0 Journal Article %J SIAM Journal on Scientific Computing (submitted) %D 2010 %T Divide & Conquer on Hybrid GPU-Accelerated Multicore Systems %A Christof Voemel %A Stanimire Tomov %A Jack Dongarra %K magma %B SIAM Journal on Scientific Computing (submitted) %8 2010-08 %G eng %0 Conference Proceedings %B Proceedings of EuroMPI 2010 %D 2010 %T Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %E Jack Dongarra %E Michael Resch %E Rainer Keller %E Edgar Gabriel %K ftmpi %B Proceedings of EuroMPI 2010 %I Springer %C Stuttgart, Germany %8 2010-09 %G eng %0 Journal Article %J in Performance Tuning of Scientific Applications (to appear) %D 2010 %T Empirical Performance Tuning of Dense Linear Algebra Software %A Jack Dongarra %A Shirley Moore %E David Bailey %E Robert Lucas %E Sam Williams %B in Performance Tuning of Scientific Applications (to appear) %8 2010-00 %G eng %0 Generic %D 2010 %T EZTrace: a generic framework for performance analysis %A Jack Dongarra %A Mathieu Faverge %A Yutaka Ishikawa %A Raymond Namyst %A François Rue %A Francois Trahay %B ICL Technical Report %8 2010-12 %G eng %0 Generic %D 2010 %T Faster, Cheaper, Better - A Hybridization Methodology to Develop Linear Algebra Software for GPUs %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Samuel Thibault %A Stanimire Tomov %K magma %K morse %B LAPACK Working Note %8 2010-00 %G eng %0 Journal Article %J IEEE Transaction on Parallel and Distributed Systems (submitted) %D 2010 %T Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators %A Hatem Ltaeif %A Stanimire Tomov %A Rajib Nath %A Jack Dongarra %K magma %K plasma %B IEEE Transaction on Parallel and Distributed Systems (submitted) %8 2010-03 %G eng %0 Generic %D 2010 %T An Improved MAGMA GEMM for Fermi GPUs %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %K magma %B University of Tennessee Computer Science Technical Report %8 2010-07 %G eng %0 Journal Article %J International Journal of High Performance Computing %D 2010 %T An Improved MAGMA GEMM for Fermi GPUs %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %K magma %B International Journal of High Performance Computing %V 24 %P 511-515 %8 2010-00 %G eng %0 Conference Proceedings %B Proceedings of International Conference on Computational Science, ICCS 2010 (to appear) %D 2010 %T Improvement of parallelization efficiency of batch pattern BP training algorithm using Open MPI %A Volodymyr Turchenko %A Lucio Grandinetti %A George Bosilca %A Jack Dongarra %K hpcchallenge %B Proceedings of International Conference on Computational Science, ICCS 2010 (to appear) %I Elsevier %C Amsterdam The Netherlands %8 2010-06 %G eng %0 Generic %D 2010 %T International Exascale Software Project Roadmap v1.0 %A Jack Dongarra %A Pete Beckman %B University of Tennessee Computer Science Technical Report, UT-CS-10-654 %8 2010-05 %G eng %0 Generic %D 2010 %T An Introduction to the MAGMA project - Acceleration of Dense Linear Algebra %A Jack Dongarra %A Stanimire Tomov %I NVIDIA Webinar %8 2010-06 %G eng %U http://developer.download.nvidia.com/CUDA/training/introtomagma.mp4 %0 Generic %D 2010 %T Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Brice Goglin %A J. Squyres %A Jack Dongarra %B University of Tennessee Computer Science Technical Report, UT-CS-10-663 %8 2010-11 %G eng %0 Journal Article %J ACM TOMS (submitted), also LAPACK Working Note (LAWN) 211 %D 2010 %T Level-3 Cholesky Kernel Subroutine of a Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm %A Fred G. Gustavson %A Jerzy Wasniewski %A Jack Dongarra %B ACM TOMS (submitted), also LAPACK Working Note (LAWN) 211 %8 2010-00 %G eng %0 Journal Article %J PARA 2010 %D 2010 %T LINPACK on Future Manycore and GPu Based Systems %A Jack Dongarra %B PARA 2010 %C Reykjavik, Iceland %8 2010-06 %G eng %0 Conference Proceedings %B Proceedings of the 17th EuroMPI conference %D 2010 %T Locality and Topology aware Intra-node Communication Among Multicore CPUs %A Teng Ma %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %B Proceedings of the 17th EuroMPI conference %I LNCS %C Stuttgart, Germany %8 2010-09 %G eng %0 Conference Proceedings %B First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010) %D 2010 %T Mixed-Tool Performance Analysis on Hybrid Multicore Architectures %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K magma %B First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010) %C San Diego, CA %8 2010-09 %G eng %0 Conference Proceedings %B Symposium on Application Accelerators in High-Performance Computing (SAAHPC '10) %D 2010 %T OpenCL Evaluation for Numerical Linear Algebra Library Development %A Peng Du %A Piotr Luszczek %A Jack Dongarra %K magma %B Symposium on Application Accelerators in High-Performance Computing (SAAHPC '10) %C Knoxville, TN %8 2010-07 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2010 %T Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures %A Hatem Ltaeif %A Jakub Kurzak %A Jack Dongarra %B IEEE Transactions on Parallel and Distributed Systems %P 417-423 %8 2010-04 %G eng %0 Conference Proceedings %B Proceedings of the Cray Users' Group Meeting %D 2010 %T Performance Evaluation for Petascale Quantum Simulation Tools %A Stanimire Tomov %A Wenchang Lu %A %A Jerzy Bernholc %A Shirley Moore %A Jack Dongarra %B Proceedings of the Cray Users' Group Meeting %C Atlanta, GA %8 2010-05 %G eng %0 Generic %D 2010 %T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report) %A Jack Dongarra %B University of Tennessee Computer Science Technical Report, UT-CS-89-85 %8 2010-00 %G eng %0 Journal Article %J ICCS 2010 %D 2010 %T Proceedings of the International Conference on Computational Science %E Peter M. Sloot %E Geert Dick van Albada %E Jack Dongarra %B ICCS 2010 %I Elsevier %C Amsterdam %8 2010-05 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2010 %T QCG-OMPI: MPI Applications on Grids %A Emmanuel Agullo %A Camille Coti %A Thomas Herault %A Julien Langou %A Sylvain Peyronnet %A A. Rezmerita %A Franck Cappello %A Jack Dongarra %B Future Generation Computer Systems %V 27 %P 357-369 %8 2010-03 %G eng %0 Journal Article %J Scientific Programming %D 2010 %T QR Factorization for the CELL Processor %A Jakub Kurzak %A Jack Dongarra %B Scientific Programming %V 17 %P 31-42 %8 2010-00 %G eng %0 Conference Proceedings %B 24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224) %D 2010 %T QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment %A Emmanuel Agullo %A Camille Coti %A Jack Dongarra %A Thomas Herault %A Julien Langou %B 24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224) %C Atlanta, GA %8 2010-04 %G eng %0 Conference Proceedings %B Proceedings of IPDPS 2011 %D 2010 %T QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Samuel Thibault %A Stanimire Tomov %K magma %K morse %K plasma %B Proceedings of IPDPS 2011 %C Anchorage, AK %8 2010-10 %G eng %0 Conference Proceedings %B EuroMPI 2010 Proceedings %D 2010 %T Recent Advances in the Message Passing Interface, Lecture Notes in Computer Science (LNCS) %E Rainer Keller %E Edgar Gabriel %E Michael Resch %E Jack Dongarra %B EuroMPI 2010 Proceedings %I Springer %C Stuttgart, Germany %V 6305 %8 2010-09 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2010 %T Rectangular Full Packed Format for Cholesky’s Algorithm: Factorization, Solution, and Inversion %A Fred G. Gustavson %A Jerzy Wasniewski %A Jack Dongarra %A Julien Langou %B ACM Transactions on Mathematical Software (TOMS) %C Atlanta, GA %V 37 %8 2010-04 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2010 %T Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion %A Fred G. Gustavson %A Jerzy Wasniewski %A Jack Dongarra %A Julien Langou %B ACM Transactions on Mathematical Software (TOMS) %V 37 %8 2010-04 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience (online version) %D 2010 %T Redesigning the Message Logging Model for High Performance %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %B Concurrency and Computation: Practice and Experience (online version) %8 2010-06 %G eng %0 Generic %D 2010 %T Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling %A Jack Dongarra %A Piotr Luszczek %K hpcc %B University of Tennessee Computer Science Technical Report %8 2010-10 %G eng %0 Journal Article %J Proc. of VECPAR'10 (to appear) %D 2010 %T A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators %A Hatem Ltaeif %A Stanimire Tomov %A Rajib Nath %A Peng Du %A Jack Dongarra %K magma %K plasma %B Proc. of VECPAR'10 (to appear) %C Berkeley, CA %8 2010-06 %G eng %0 Journal Article %J SC'10 %D 2010 %T Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems %A Fengguang Song %A Hatem Ltaeif %A Bilel Hadri %A Jack Dongarra %K plasma %B SC'10 %I ACM SIGARCH/ IEEE Computer Society %C New Orleans, LA %8 2010-11 %G eng %0 Generic %D 2010 %T Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems %A Fengguang Song %A Hatem Ltaeif %A Bilel Hadri %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report %V –10-653 %8 2010-04 %G eng %0 Generic %D 2010 %T Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Rajib Nath %A Jean Roman %A Samuel Thibault %A Stanimire Tomov %I 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster %C Knoxville, TN %8 2010-07 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2010 %T Scheduling Dense Linear Algebra Operations on Multicore Processors %A Jakub Kurzak %A Hatem Ltaeif %A Jack Dongarra %A Rosa M. Badia %K gridpac %K plasma %B Concurrency and Computation: Practice and Experience %V 22 %P 15-44 %8 2010-01 %G eng %0 Journal Article %J Journal of Scientific Computing %D 2010 %T Scheduling Two-sided Transformations using Tile Algorithms on Multicore Architectures %A Hatem Ltaeif %A Jakub Kurzak %A Jack Dongarra %A Rosa M. Badia %K plasma %B Journal of Scientific Computing %V 18 %P 33-50 %8 2010-00 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2010 %T Self-Healing Network for Scalable Fault-Tolerant Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %B Future Generation Computer Systems %V 26 %P 479-485 %8 2010-03 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience (to appear) %D 2010 %T SmartGridRPC: The new RPC model for high performance Grid Computing and Its Implementation in SmartGridSolve %A Thomas Brady %A Alexey Lastovetsky %A Keith Seymour %A Michele Guidolin %A Jack Dongarra %K netsolve %B Concurrency and Computation: Practice and Experience (to appear) %8 2010-01 %G eng %0 Conference Proceedings %B 24th IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2010 %T Tile QR Factorization with Parallel Panel Processing for Multicore Architectures %A Bilel Hadri %A Emmanuel Agullo %A Jack Dongarra %B 24th IEEE International Parallel and Distributed Processing Symposium (submitted) %8 2010-00 %G eng %0 Journal Article %J Parallel Computing %D 2010 %T Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems %A Stanimire Tomov %A Jack Dongarra %A Marc Baboulin %K magma %B Parallel Computing %V 36 %P 232-240 %8 2010-00 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications (to appear) %D 2010 %T Trace-based Performance Analysis for the Petascale Simulation Code FLASH %A Heike Jagode %A Andreas Knuepfer %A Jack Dongarra %A Matthias Jurenz %A Matthias S. Mueller %A Wolfgang E. Nagel %B International Journal of High Performance Computing Applications (to appear) %8 2010-00 %G eng %0 Journal Article %J FOSS4G 2010 %D 2010 %T Tuning Principal Component Analysis for GRASS GIS on Multi-core and GPU Architectures %A Peng Du %A Matthew Parsons %A Erika Fuentes %A Shih-Lung Shaw %A Jack Dongarra %K magma %B FOSS4G 2010 %C Barcelona, Spain %8 2010-09 %G eng %0 Journal Article %J PGI Insider %D 2010 %T Using MAGMA with PGI Fortran %A Stanimire Tomov %A Mathieu Faverge %A Piotr Luszczek %A Jack Dongarra %K magma %B PGI Insider %8 2010-11 %G eng %0 Generic %D 2009 %T Accelerating the Reduction to Upper Hessenberg Form through Hybrid GPU-Based Computing %A Stanimire Tomov %A Jack Dongarra %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-09-642 (also LAPACK Working Note 219) %8 2009-05 %G eng %0 Journal Article %J SciDAC Review %D 2009 %T Accelerating Time-To-Solution for Computational Science and Engineering %A James Demmel %A Jack Dongarra %A Armando Fox %A Sam Williams %A Vasily Volkov %A Katherine Yelick %B SciDAC Review %8 2009-00 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2009 %T Algorithmic Based Fault Tolerance Applied to High Performance Computing %A Jack Dongarra %A George Bosilca %A Remi Delmas %A Julien Langou %B Journal of Parallel and Distributed Computing %V 69 %P 410-416 %8 2009-00 %G eng %0 Journal Article %J IEEE Cluster 2009 %D 2009 %T Analytical Modeling and Optimization for Affinity Based Thread Scheduling on Multicore Systems %A Fengguang Song %A Shirley Moore %A Jack Dongarra %K gridpac %K mumi %B IEEE Cluster 2009 %C New Orleans %8 2009-08 %G eng %0 Journal Article %J Parallel Computing %D 2009 %T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %K plasma %B Parallel Computing %V 35 %P 38-53 %8 2009-00 %G eng %0 Conference Proceedings %B 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear) %D 2009 %T Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware %A Emmanuel Agullo %A Bilel Hadri %A Hatem Ltaeif %A Jack Dongarra %B 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear) %8 2009-00 %G eng %0 Journal Article %J Lecture Notes in Computer Science: Theoretical Computer Science and General Issues %D 2009 %T Computational Science – ICCS 2009, Proceedings of the 9th International Conference %E Gabrielle Allen %E Jarosław Nabrzyski %E E. Seidel %E Geert Dick van Albada %E Jack Dongarra %E Peter M. Sloot %B Lecture Notes in Computer Science: Theoretical Computer Science and General Issues %C Baton Rouge, LA %V - %8 2009-05 %G eng %0 Journal Article %J Numerical Linear Algebra with Applications %D 2009 %T Computing the Conditioning of the Components of a Linear Least-squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B Numerical Linear Algebra with Applications %V 16 %P 517-533 %8 2009-00 %G eng %0 Generic %D 2009 %T Constructing resiliant communication infrastructure for runtime environments %A George Bosilca %A Camille Coti %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %B Innovative Computing Laboratory Technical Report %8 2009-07 %G eng %0 Journal Article %J ParCo 2009 %D 2009 %T Constructing Resilient Communication Infrastructure for Runtime Environments %A Pierre Lemariner %A George Bosilca %A Camille Coti %A Thomas Herault %A Jack Dongarra %B ParCo 2009 %C Lyon France %8 2009-09 %G eng %0 Journal Article %J PPAM 2009 %D 2009 %T Dependency-Driven Scheduling of Dense Matrix Factorizations on Shared-Memory Systems %A Jakub Kurzak %A Hatem Ltaeif %A Jack Dongarra %A Rosa M. Badia %B PPAM 2009 %C Poland %8 2009-09 %G eng %0 Conference Proceedings %B International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) %D 2009 %T Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems %A Fengguang Song %A Asim YarKhan %A Jack Dongarra %K mumi %K plasma %B International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) %C Portland, OR %8 2009-11 %G eng %0 Journal Article %J Submitted to Transaction on Parallel and Distributed Systems %D 2009 %T Enhancing Parallelism of Tile QR Factorization for Multicore Architectures %A Bilel Hadri %A Hatem Ltaeif %A Emmanuel Agullo %A Jack Dongarra %K plasma %B Submitted to Transaction on Parallel and Distributed Systems %8 2009-12 %G eng %0 Generic %D 2009 %T Fully Dynamic Scheduler for Numerical Computing on Multicore Processors %A Jakub Kurzak %A Jack Dongarra %B University of Tennessee Computer Science Department Technical Report, UT-CS-09-643 (Also LAPACK Working Note 220) %8 2009-00 %G eng %0 Conference Proceedings %B Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering %D 2009 %T Grid Computing applied to the Boundary Element Method %A Manoel Cunha %A Jose Telles %A Asim YarKhan %A Jack Dongarra %E B. H. V. Topping %E Peter Iványi %K netsolve %B Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering %I Civil-Comp Press %C Stirlingshire, UK %V 27 %8 2009-00 %G eng %0 Conference Proceedings %B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing %D 2009 %T A Holistic Approach for Performance Measurement and Analysis for Petascale Applications %A Heike Jagode %A Jack Dongarra %A Sadaf Alam %A Jeffrey Vetter %A W. Spear %A Allen D. Malony %E Gabrielle Allen %K point %K test %B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing %I Springer-Verlag Berlin Heidelberg 2009 %C Baton Rouge, Louisiana %V 2009 %P 686-695 %8 2009-05 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications (to appear) %D 2009 %T The International Exascale Software Project: A Call to Cooperative Action by the Global High Performance Community %A Jack Dongarra %A Pete Beckman %A Patrick Aerts %A Franck Cappello %A Thomas Lippert %A Satoshi Matsuoka %A Paul Messina %A Terry Moore %A Rick Stevens %A Anne Trefethen %A Mateo Valero %B International Journal of High Performance Computing Applications (to appear) %8 2009-07 %G eng %0 Journal Article %J ISC'09 %D 2009 %T I/O Performance Analysis for the Petascale Simulation Code FLASH %A Heike Jagode %A Shirley Moore %A Dan Terpstra %A Jack Dongarra %A Andreas Knuepfer %A Matthias Jurenz %A Matthias S. Mueller %A Wolfgang E. Nagel %K test %B ISC'09 %C Hamburg, Germany %8 2009-06 %G eng %0 Conference Proceedings %B 9th International Conference on Computational Science (ICCS 2009) %D 2009 %T A Note on Auto-tuning GEMM for GPUs %A Yinan Li %A Jack Dongarra %A Stanimire Tomov %E Gabrielle Allen %E Jarosław Nabrzyski %E E. Seidel %E Geert Dick van Albada %E Jack Dongarra %E Peter M. Sloot %B 9th International Conference on Computational Science (ICCS 2009) %C Baton Rouge, LA %P 884-892 %8 2009-05 %G eng %R 10.1007/978-3-642-01970-8_89 %0 Generic %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Rajib Nath %A Stanimire Tomov %A Asim YarKhan %A Vasily Volkov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09) %C Portland, OR %8 2009-11 %G eng %0 Conference Proceedings %B Journal of Physics: Conference Series %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Stanimire Tomov %K magma %K plasma %B Journal of Physics: Conference Series %V 180 %8 2009-00 %G eng %0 Generic %D 2009 %T Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project %A Rajib Nath %A Jack Dongarra %A Stanimire Tomov %A Hatem Ltaeif %A Peng Du %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09) %C Portland, Oregon %8 2009-11 %G eng %0 Journal Article %J Parallel Computing %D 2009 %T Optimizing Matrix Multiplication for a Short-Vector SIMD Architecture - CELL Processor %A Wesley Alvaro %A Jakub Kurzak %A Jack Dongarra %B Parallel Computing %V 35 %P 138-150 %8 2009-00 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems (to appear) %D 2009 %T Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures %A Hatem Ltaeif %A Jakub Kurzak %A Jack Dongarra %B IEEE Transactions on Parallel and Distributed Systems (to appear) %8 2009-05 %G eng %0 Journal Article %J in Cyberinfrastructure Technologies and Applications %D 2009 %T Parallel Dense Linear Algebra Software in the Multicore Era %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %E Junwei Cao %K plasma %B in Cyberinfrastructure Technologies and Applications %I Nova Science Publishers, Inc. %P 9-24 %8 2009-00 %G eng %0 Journal Article %J Cluster Computing Journal: Special Issue on High Performance Distributed Computing %D 2009 %T Paravirtualization Effect on Single- and Multi-threaded Memory-Intensive Linear Algebra Software %A Lamia Youseff %A Keith Seymour %A Haihang You %A Dmitrii Zagorodnov %A Jack Dongarra %A Rich Wolski %B Cluster Computing Journal: Special Issue on High Performance Distributed Computing %I Springer Netherlands %V 12 %P 101-122 %8 2009-00 %G eng %0 Conference Proceedings %B Proceedings of CUG09 %D 2009 %T Performance evaluation for petascale quantum simulation tools %A Stanimire Tomov %A Wenchang Lu %A Jerzy Bernholc %A Shirley Moore %A Jack Dongarra %K doe-nano %B Proceedings of CUG09 %C Atlanta, GA %8 2009-05 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2009 %T The Problem with the Linpack Benchmark Matrix Generator %A Julien Langou %A Jack Dongarra %K hpl %B International Journal of High Performance Computing Applications %V 23 %P 5-14 %8 2009-00 %G eng %0 Journal Article %J Scientific Programming (to appear) %D 2009 %T QR Factorization for the CELL Processor %A Jakub Kurzak %A Jack Dongarra %K plasma %B Scientific Programming (to appear) %8 2009-00 %G eng %0 Conference Paper %B CLUSTER '09 %D 2009 %T Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery %A Aurelien Bouteiller %A Thomas Ropars %A George Bosilca %A Christine Morin %A Jack Dongarra %K fault tolerant computing %K libraries message passing %K parallel machines %K protocols %X With the growing scale of high performance computing platforms, fault tolerance has become a major issue. Among the various approaches for providing fault tolerance to MPI applications, message logging has been proved to tolerate higher failure rate. However, this advantage comes at the expense of a higher overhead on communications, due to latency intrusive logging of events to a stable storage. Previous work proposed and evaluated several protocols relaxing the synchronicity of event logging to moderate this overhead. Recently, the model of message logging has been refined to better match the reality of high performance network cards, where message receptions are decomposed in multiple interdependent events. According to this new model, deterministic and non-deterministic events are clearly discriminated, reducing the overhead induced by message logging. In this paper we compare, experimentally, a pessimistic and an optimistic message logging protocol, using this new model and implemented in the Open MPI library. Although pessimistic and optimistic message logging are, respectively, the most and less synchronous message logging paradigms, experiments show that most of the time their performance is comparable. %B CLUSTER '09 %I IEEE %C New Orleans %8 2009-08 %G eng %R 10.1109/CLUSTR.2009.5289157 %0 Journal Article %J in Birth of Numerical Analysis (to appear) %D 2009 %T Recent Trends in High Performance Computing %A Jack Dongarra %A Hans Meuer %A Horst D. Simon %A Erich Strohmaier %B in Birth of Numerical Analysis (to appear) %8 2009-00 %G eng %0 Journal Article %J ACM TOMS (to appear) %D 2009 %T Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion %A Fred G. Gustavson %A Jerzy Wasniewski %A Jack Dongarra %A Julien Langou %B ACM TOMS (to appear) %8 2009-00 %G eng %0 Journal Article %J in Handbook of Research on Scalable Computing Technologies (to appear) %D 2009 %T Reliability and Performance Modeling and Analysis for Grid Computing %A Yuan-Shun Dai %A Jack Dongarra %E Kuan-Ching Li %E Ching-Hsien Hsu %E Laurence Yang %E Jack Dongarra %E Hans Zima %B in Handbook of Research on Scalable Computing Technologies (to appear) %I IGI Global %P 219-245 %8 2009-00 %G eng %0 Conference Proceedings %B The International Conference on Computational Science 2009 (ICCS 2009) %D 2009 %T A Scalable Non-blocking Multicast Scheme for Distributed DAG Scheduling %A Fengguang Song %A Shirley Moore %A Jack Dongarra %K plasma %B The International Conference on Computational Science 2009 (ICCS 2009) %C Baton Rouge, LA %V 5544 %P 195-204 %8 2009-05 %G eng %0 Journal Article %J Concurrency Practice and Experience (to appear) %D 2009 %T Scheduling Linear Algebra Operations on Multicore Processors %A Jakub Kurzak %A Hatem Ltaeif %A Jack Dongarra %A Rosa M. Badia %K plasma %B Concurrency Practice and Experience (to appear) %8 2009-00 %G eng %0 Generic %D 2009 %T Scheduling Linear Algebra Operations on Multicore Processors %A Jakub Kurzak %A Hatem Ltaeif %A Jack Dongarra %A Rosa M. Badia %B University of Tennessee Computer Science Department Technical Report, UT-CS-09-636 (Also LAPACK Working Note 213) %8 2009-00 %G eng %0 Generic %D 2009 %T Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures %A Bilel Hadri %A Hatem Ltaeif %A Emmanuel Agullo %A Jack Dongarra %K plasma %B Innovative Computing Laboratory Technical Report (also LAPACK Working Note 222 and CS Tech Report UT-CS-09-645) %8 2009-09 %G eng %0 Conference Proceedings %B accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010) %D 2009 %T Tile QR Factorization with Parallel Panel Processing for Multicore Architectures %A Bilel Hadri %A Hatem Ltaeif %A Emmanuel Agullo %A Jack Dongarra %K plasma %B accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010) %C Atlanta, GA %8 2009-12 %G eng %0 Journal Article %J Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface - 16th European PVM/MPI Users' Group Meeting %D 2009 %T Towards Efficient MapReduce Using MPI %A Torsten Hoefler %A Yuan-Shun Dai %A Jack Dongarra %E M. Ropo %E J Westerholm %E Jack Dongarra %B Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface - 16th European PVM/MPI Users' Group Meeting %I Springer Berlin / Heidelberg %C Espoo, Finland %V 5759 %P 240-249 %8 2009-00 %G eng %0 Generic %D 2009 %T Trace-based Performance Analysis for the Petascale Simulation Code FLASH %A Heike Jagode %A Andreas Knuepfer %A Jack Dongarra %A Matthias Jurenz %A Matthias S. Mueller %A Wolfgang E. Nagel %K test %B Innovative Computing Laboratory Technical Report %8 2009-04 %G eng %0 Journal Article %J in Cloud Computing and Software Services: Theory and Techniques (to appear) %D 2009 %T Transparent Cross-Platform Access to Software Services using GridSolve and GridRPC %A Keith Seymour %A Asim YarKhan %A Jack Dongarra %E Syed Ahson %E Mohammad Ilyas %K netsolve %B in Cloud Computing and Software Services: Theory and Techniques (to appear) %I CRC Press %8 2009-00 %G eng %0 Conference Proceedings %B 7th International parallel Processing and Applied Mathematics Conference, Lecture Notes in Comptuer Science %D 2008 %E Roman Wyrzykowski %E Jack Dongarra %E Konrad Karczewski %E Jerzy Wasniewski %B 7th International parallel Processing and Applied Mathematics Conference, Lecture Notes in Comptuer Science %I Springer Berlin %C Gdansk, Poland %V 4967 %8 2008-01 %G eng %0 Conference Proceedings %B 8th International Conference on Computational Science (ICCS), Proceedings Parts I, II, and III, Lecture Notes in Computer Science %D 2008 %E Marian Bubak %E Geert Dick van Albada %E Jack Dongarra %E Peter M. Sloot %B 8th International Conference on Computational Science (ICCS), Proceedings Parts I, II, and III, Lecture Notes in Computer Science %I Springer Berlin %C Krakow, Poland %V 5101 %8 2008-01 %G eng %0 Journal Article %J 15th European PVM/MPI Users' Group Meeting, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science %D 2008 %E Alexey Lastovetsky %E Tahar Kechadi %E Jack Dongarra %B 15th European PVM/MPI Users' Group Meeting, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science %I Springer Berlin %C Dublin Ireland %V 5205 %8 2008-01 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2008 %T Algorithm-Based Fault Tolerance for Fail-Stop Failures %A Zizhong Chen %A Jack Dongarra %K FT-MPI %K lapack %K scalapack %B IEEE Transactions on Parallel and Distributed Systems %V 19 %8 2008-01 %G eng %0 Generic %D 2008 %T Algorithmic Based Fault Tolerance Applied to High Performance Computing %A George Bosilca %A Remi Delmas %A Jack Dongarra %A Julien Langou %B University of Tennessee Computer Science Technical Report, UT-CS-08-620 (also LAPACK Working Note 205) %8 2008-01 %G eng %0 Generic %D 2008 %T Analytical Modeling for Affinity-Based Thread Scheduling on Multicore Platforms %A Fengguang Song %A Shirley Moore %A Jack Dongarra %B University of Tennessee Computer Science Technical Report, UT-CS-08-626 %8 2008-01 %G eng %0 Conference Proceedings %B The 3rd international Workshop on Automatic Performance Tuning %D 2008 %T A Comparison of Search Heuristics for Empirical Code Optimization %A Keith Seymour %A Haihang You %A Jack Dongarra %K gco %B The 3rd international Workshop on Automatic Performance Tuning %C Tsukuba, Japan %8 2008-10 %G eng %0 Journal Article %J VECPAR '08, High Performance Computing for Computational Science %D 2008 %T Computing the Conditioning of the Components of a Linear Least Squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B VECPAR '08, High Performance Computing for Computational Science %C Toulouse, France %8 2008-01 %G eng %0 Journal Article %J in Advances in Computers %D 2008 %T DARPA's HPCS Program: History, Models, Tools, Languages %A Jack Dongarra %A Robert Graybill %A William Harrod %A Robert Lucas %A Ewing Lusk %A Piotr Luszczek %A Janice McMahon %A Allan Snavely %A Jeffrey Vetter %A Katherine Yelick %A Sadaf Alam %A Roy Campbell %A Laura Carrington %A Tzu-Yi Chen %A Omid Khalili %A Jeremy Meredith %A Mustafa Tikir %E M. Zelkowitz %B in Advances in Computers %I Elsevier %V 72 %8 2008-01 %G eng %0 Generic %D 2008 %T Enhancing the Performance of Dense Linear Algebra Solvers on GPUs (in the MAGMA Project) %A Marc Baboulin %A James Demmel %A Jack Dongarra %A Stanimire Tomov %A Vasily Volkov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC08) %C Austin, TX %8 2008-11 %G eng %0 Journal Article %J in High Performance Computing and Grids in Action %D 2008 %T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %E Lucio Grandinetti %B in High Performance Computing and Grids in Action %I IOS Press %C Amsterdam %8 2008-01 %G eng %0 Conference Proceedings %B Proceedings of the DoD HPCMP User Group Conference %D 2008 %T Exploring New Architectures in Accelerating CFD for Air Force Applications %A Jack Dongarra %A Shirley Moore %A Gregory D. Peterson %A Stanimire Tomov %A Jeff Allred %A Vincent Natoli %A David Richie %K magma %B Proceedings of the DoD HPCMP User Group Conference %C Seattle, Washington %8 2008-01 %G eng %0 Generic %D 2008 %T Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the CELL Processor %A Wesley Alvaro %A Jakub Kurzak %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report %8 2008-01 %G eng %0 Journal Article %J in Beautiful Code Leading Programmers Explain How They Think (Chapter 14) %D 2008 %T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination %A Jack Dongarra %A Piotr Luszczek %E Andy Oram %E G. Wilson %B in Beautiful Code Leading Programmers Explain How They Think (Chapter 14) %P 243-282 %8 2008-01 %G eng %0 Generic %D 2008 %T HPCS Library Study Effort %A Jack Dongarra %A James Demmel %A Parry Husbands %A Piotr Luszczek %B University of Tennessee Computer Science Technical Report, UT-CS-08-617 %8 2008-01 %G eng %0 Conference Proceedings %B ACM/IEEE International Symposium on High Performance Distributed Computing %D 2008 %T The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software %A Lamia Youseff %A Keith Seymour %A Haihang You %A Jack Dongarra %A Rich Wolski %K gco %K netsolve %B ACM/IEEE International Symposium on High Performance Distributed Computing %C Boston, MA. %8 2008-06 %G eng %0 Journal Article %J Computing and Informatics %D 2008 %T Interactive Grid-Access Using Gridsolve and Giggle %A Marcus Hardt %A Keith Seymour %A Jack Dongarra %A Michael Zapf %A Nicole Ruiter %K netsolve %B Computing and Informatics %V 27 %P 233-248,ISSN1335-9150 %8 2008-00 %G eng %0 Conference Proceedings %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %D 2008 %T Interior State Computation of Nano Structures %A Andrew Canning %A Jack Dongarra %A Julien Langou %A Osni Marques %A Stanimire Tomov %A Christof Voemel %A Lin-Wang Wang %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %C Trondheim, Norway %8 2008-05 %G eng %0 Journal Article %J Concurrency: Practice and Experience %D 2008 %T The LINPACK Benchmark: Past, Present, and Future %A Jack Dongarra %A Piotr Luszczek %A Antoine Petitet %K hpl %B Concurrency: Practice and Experience %V 15 %P 803-820 %8 2008-00 %G eng %0 Conference Proceedings %B 2008 PPoPP Conference %D 2008 %T Matrix Product on Heterogeneous Master Worker Platforms %A Jack Dongarra %A Jean-Francois Pineau %A Yves Robert %A Frederic Vivien %B 2008 PPoPP Conference %C Salt Lake City, Utah %8 2008-01 %G eng %0 Journal Article %J IEEE Annals of the History of Computing %D 2008 %T Netlib and NA-Net: Building a Scientific Computing Community %A Jack Dongarra %A Gene H. Golub %A Eric Grosse %A Cleve Moler %A Keith Moore %B IEEE Annals of the History of Computing %V 30 %P 30-41 %8 2008-01 %G eng %0 Generic %D 2008 %T Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited %A Hatem Ltaeif %A Jakub Kurzak %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-08-624 (also LAPACK Working Note 208) %8 2008-08 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2008 %T Parallel Tiled QR Factorization for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %B Concurrency and Computation: Practice and Experience %V 20 %P 1573-1590 %8 2008-01 %G eng %0 Journal Article %J Lecture Notes in Computer Science, OpenMP Shared Memory Parallel Programming %D 2008 %T Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications %A Oscar Hernandez %A Fengguang Song %A Barbara Chapman %A Jack Dongarra %A Bernd Mohr %A Shirley Moore %A Felix Wolf %B Lecture Notes in Computer Science, OpenMP Shared Memory Parallel Programming %I Springer Berlin / Heidelberg %V 4315 %8 2008-00 %G eng %0 Generic %D 2008 %T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report) %A Jack Dongarra %B University of Tennessee Computer Science Technical Report, CS-89-85 %8 2008-01 %G eng %0 Journal Article %J Proc. SciDAC 2008 %D 2008 %T PERI Auto-tuning %A David Bailey %A Jacqueline Chame %A Chun Chen %A Jack Dongarra %A Mary Hall %A Jeffrey K. Hollingsworth %A Paul D. Hovland %A Shirley Moore %A Keith Seymour %A Jaewook Shin %A Ananta Tiwari %A Sam Williams %A Haihang You %K gco %B Proc. SciDAC 2008 %I Journal of Physics %C Seatlle, Washington %V 125 %8 2008-01 %G eng %0 Journal Article %J Computing in Science and Engineering %D 2008 %T The PlayStation 3 for High Performance Scientific Computing %A Jakub Kurzak %A Alfredo Buttari %A Piotr Luszczek %A Jack Dongarra %B Computing in Science and Engineering %P 80-83 %8 2008-01 %G eng %0 Generic %D 2008 %T The PlayStation 3 for High Performance Scientific Computing %A Jakub Kurzak %A Alfredo Buttari %A Piotr Luszczek %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %8 2008-01 %G eng %0 Generic %D 2008 %T The Problem with the Linpack Benchmark Matrix Generator %A Jack Dongarra %A Julien Langou %B University of Tennessee Computer Science Technical Report, UT-CS-08-621 (also LAPACK Working Note 206) %8 2008-06 %G eng %0 Generic %D 2008 %T QR Factorization for the CELL Processor %A Jakub Kurzak %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-08-616 (also LAPACK Working Note 201) %8 2008-05 %G eng %0 Generic %D 2008 %T Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion %A Fred G. Gustavson %A Jerzy Wasniewski %A Jack Dongarra %B University of Tennessee Computer Science Technical Report, UT-CS-08-614 (also LAPACK Working Note 199) %8 2008-04 %G eng %0 Conference Proceedings %B International Supercomputer Conference (ISC 2008) %D 2008 %T Redesigning the Message Logging Model for High Performance %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %B International Supercomputer Conference (ISC 2008) %C Dresden, Germany %8 2008-01 %G eng %0 Generic %D 2008 %T Request Sequencing: Enabling Workflow for Efficient Parallel Problem Solving in GridSolve %A Yinan Li %A Jack Dongarra %K netsolve %B ICL Technical Report %8 2008-04 %G eng %0 Conference Proceedings %B International Conference on Grid and Cooperative Computing (GCC 2008) (submitted) %D 2008 %T Request Sequencing: Enabling Workflow for Efficient Problem Solving in GridSolve %A Yinan Li %A Jack Dongarra %A Keith Seymour %A Asim YarKhan %B International Conference on Grid and Cooperative Computing (GCC 2008) (submitted) %C Shenzhen, China %8 2008-10 %G eng %0 Journal Article %J International Journal of Foundations of Computer Science (IJFCS) %D 2008 %T Revisiting Matrix Product on Master-Worker Platforms %A Jack Dongarra %A Jean-Francois Pineau %A Yves Robert %A Zhiao Shi %A Frederic Vivien %B International Journal of Foundations of Computer Science (IJFCS) %V 19 %P 1317-1336 %8 2008-12 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2008 %T Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization %A Jakub Kurzak %A Alfredo Buttari %A Jack Dongarra %B IEEE Transactions on Parallel and Distributed Systems %V 19 %P 1-11 %8 2008-01 %G eng %0 Generic %D 2008 %T Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures %A Marc Baboulin %A Jack Dongarra %A Stanimire Tomov %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-08-615 (also LAPACK Working Note 200) %8 2008-01 %G eng %0 Conference Proceedings %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %D 2008 %T Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures %A Marc Baboulin %A Stanimire Tomov %A Jack Dongarra %K magma %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %C Trondheim Norway %8 2008-05 %G eng %0 Journal Article %J Journal of Computational Physics %D 2008 %T State-of-the-Art Eigensolvers for Electronic Structure Calculations of Large Scale Nano-Systems %A Christof Voemel %A Stanimire Tomov %A Osni Marques %A Andrew Canning %A Lin-Wang Wang %A Jack Dongarra %B Journal of Computational Physics %V 227 %P 7113-7124 %8 2008-01 %G eng %0 Generic %D 2008 %T Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems %A Stanimire Tomov %A Jack Dongarra %A Marc Baboulin %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-08-632 (also LAPACK Working Note 210) %8 2008-01 %G eng %0 Journal Article %J Computing in Science and Engineering %D 2008 %T A Tribute to Gene Golub %A Jack Dongarra %B Computing in Science and Engineering %I IEEE %P 5 %8 2008-01 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software %D 2008 %T Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %K plasma %B ACM Transactions on Mathematical Software %V 34 %P 17-22 %8 2008-00 %G eng %0 Generic %D 2007 %T Automated Empirical Tuning of a Multiresolution Analysis Kernel %A Haihang You %A Keith Seymour %A Jack Dongarra %A Shirley Moore %K gco %B ICL Technical Report %P 10 %8 2007-01 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2007 %T Automatic Analysis of Inefficiency Patterns in Parallel Applications %A Felix Wolf %A Bernd Mohr %A Jack Dongarra %A Shirley Moore %B Concurrency and Computation: Practice and Experience %V 19 %P 1481-1496 %8 2007-08 %G eng %0 Conference Proceedings %B Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07) %D 2007 %T Binomial Graph: A Scalable and Fault- Tolerant Logical Network Topology %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07) %I Springer %C Niagara Falls, Canada %8 2007-08 %G eng %0 Conference Proceedings %B 19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (submitted) %D 2007 %T Bi-objective Scheduling Algorithms for Optimizing Makespan and Reliability on Heterogeneous Systems %A Jack Dongarra %A Emmanuel Jeannot %A Erik Saule %A Zhiao Shi %B 19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (submitted) %C San Diego, CA %8 2007-06 %G eng %0 Generic %D 2007 %T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report %8 2007-01 %G eng %0 Generic %D 2007 %T Computing the Conditioning of the Components of a Linear Least Squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B University of Tennessee Computer Science Technical Report %8 2007-01 %G eng %0 Journal Article %J DOE SciDAC Review (to appear) %D 2007 %T Creating Software Technology to Harness the Power of Leadership-class Computing Systems %A John Mellor-Crummey %A Pete Beckman %A Jack Dongarra %A Barton Miller %A Katherine Yelick %B DOE SciDAC Review (to appear) %8 2007-06 %G eng %0 Journal Article %J Euro-Par 2007 %D 2007 %T Decision Trees and MPI Collective Algorithm Selection Problem %A Jelena Pjesivac–Grbovic %A George Bosilca %A Graham Fagg %A Thara Angskun %A Jack Dongarra %K ftmpi %B Euro-Par 2007 %I Springer %C Rennes, France %P 105–115 %8 2007-08 %G eng %0 Journal Article %J in Petascale Computing: Algorithms and Applications (to appear) %D 2007 %T Disaster Survival Guide in Petascale Computing: An Algorithmic Approach %A Jack Dongarra %A Zizhong Chen %A George Bosilca %A Julien Langou %B in Petascale Computing: Algorithms and Applications (to appear) %I Chapman & Hall - CRC Press %8 2007-00 %G eng %0 Generic %D 2007 %T Empirical Tuning of a Multiresolution Analysis Kernel using a Specialized Code Generator %A Haihang You %A Keith Seymour %A Jack Dongarra %A Shirley Moore %K gco %B ICL Technical Report %8 2007-01 %G eng %0 Journal Article %J In High Performance Computing and Grids in Action (to appear) %D 2007 %T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Julie Langou %A Piotr Luszczek %A Stanimire Tomov %E Lucio Grandinetti %B In High Performance Computing and Grids in Action (to appear) %I IOS Press %C Amsterdam %8 2007-00 %G eng %0 Conference Proceedings %B IEEE International Symposium on High Performance Distributed Computing %D 2007 %T Feedback-Directed Thread Scheduling with Memory Considerations %A Fengguang Song %A Shirley Moore %A Jack Dongarra %B IEEE International Symposium on High Performance Distributed Computing %C Monterey Bay, CA %8 2007-06 %G eng %0 Conference Proceedings %B Grid-Based Problem Solving Environments: IFIP TC2/WG 2.5 Working Conference on Grid-Based Problem Solving Environments (Prescott, AZ, July 2006) %D 2007 %T GridSolve: The Evolution of Network Enabled Solver %A Asim YarKhan %A Jack Dongarra %A Keith Seymour %E Patrick Gaffney %K netsolve %B Grid-Based Problem Solving Environments: IFIP TC2/WG 2.5 Working Conference on Grid-Based Problem Solving Environments (Prescott, AZ, July 2006) %I Springer %P 215-226 %8 2007-00 %G eng %0 Journal Article %J International Journal for High Performance Computer Applications %D 2007 %T High Performance Development for High End Computing with Python Language Wrapper (PLW) %A Jack Dongarra %A Piotr Luszczek %B International Journal for High Performance Computer Applications %V 21 %P 360-369 %8 2007-00 %G eng %0 Journal Article %J in Beautiful Code Leading Programmers Explain How They Think %D 2007 %T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination %A Jack Dongarra %A Piotr Luszczek %E Andy Oram %E Greg Wilson %B in Beautiful Code Leading Programmers Explain How They Think %I O'Reilly Media, Inc. %8 2007-06 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2007 %T Implementation of Mixed Precision in Solving Systems of Linear Equations on the Cell Processor %A Jakub Kurzak %A Jack Dongarra %B Concurrency and Computation: Practice and Experience %V 19 %P 1371-1385 %8 2007-07 %G eng %0 Journal Article %J Parallel Processing Letters %D 2007 %T Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Servers Middleware %A Emmanuel Jeannot %A Keith Seymour %A Asim YarKhan %A Jack Dongarra %B Parallel Processing Letters %V 17 %P 47-59 %8 2007-03 %G eng %0 Conference Proceedings %B Proceedings of the 2007 International Conference on Parallel Processing %D 2007 %T L2 Cache Modeling for Scientific Applications on Chip Multi-Processors %A Fengguang Song %A Shirley Moore %A Jack Dongarra %B Proceedings of the 2007 International Conference on Parallel Processing %I IEEE Computer Society %C Xi'an, China %8 2007-01 %G eng %0 Generic %D 2007 %T Limitations of the Playstation 3 for High Performance Cluster Computing %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %B University of Tennessee Computer Science Technical Report, UT-CS-07-597 (Also LAPACK Working Note 185) %8 2007-00 %G eng %0 Journal Article %J International Journal of High Performance Computer Applications (to appear) %D 2007 %T Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems %A Alfredo Buttari %A Jack Dongarra %A Julien Langou %A Julie Langou %A Piotr Luszczek %A Jakub Kurzak %B International Journal of High Performance Computer Applications (to appear) %8 2007-08 %G eng %0 Journal Article %J Parallel Computing (Special Edition: EuroPVM/MPI 2006) %D 2007 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A George Bosilca %A Graham Fagg %A Thara Angskun %A Jack Dongarra %K ftmpi %B Parallel Computing (Special Edition: EuroPVM/MPI 2006) %I Elsevier %8 2007-00 %G eng %0 Conference Proceedings %B Journal of Physics: Conference Series, SciDAC 2007 %D 2007 %T Multithreading for synchronization tolerance in matrix factorization %A Alfredo Buttari %A Jack Dongarra %A Parry Husbands %A Jakub Kurzak %A Katherine Yelick %B Journal of Physics: Conference Series, SciDAC 2007 %V 78 %8 2007-01 %G eng %0 Journal Article %J In IEEE Annals of the History of Computing (to appear) %D 2007 %T Netlib and NA-Net: building a scientific computing community %A Jack Dongarra %A Gene H. Golub %A Cleve Moler %A Keith Moore %B In IEEE Annals of the History of Computing (to appear) %8 2007-08 %G eng %0 Conference Proceedings %B The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT) %D 2007 %T Optimal Routing in Binomial Graph Networks %A Thara Angskun %A George Bosilca %A Brad Vander Zanden %A Jack Dongarra %K ftmpi %B The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT) %I IEEE Computer Society %C Adelaide, Australia %8 2007-12 %G eng %0 Generic %D 2007 %T Parallel Tiled QR Factorization for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-598 (also LAPACK Working Note 190) %8 2007-00 %G eng %0 Journal Article %J Cluster computing %D 2007 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B Cluster computing %I Springer Netherlands %V 10 %P 127-143 %8 2007-06 %G eng %0 Generic %D 2007 %T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report) %A Jack Dongarra %B University of Tennessee Computer Science Dept. Technical Report CS-89-85 %8 2007-00 %G eng %0 Journal Article %J SIAM SISC (to appear) %D 2007 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A Julien Langou %A Zizhong Chen %A George Bosilca %A Jack Dongarra %B SIAM SISC (to appear) %8 2007-05 %G eng %0 Conference Proceedings %B Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) %D 2007 %T Reliability Analysis of Self-Healing Network using Discrete-Event Simulation %A Thara Angskun %A George Bosilca %A Graham Fagg %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %B Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) %I IEEE Computer Society %P 437-444 %8 2007-05 %G eng %0 Journal Article %J SciDAC Review %D 2007 %T Remembering Ken Kennedy %A Jack Dongarra %A et al., %B SciDAC Review %V 5 %8 2007-00 %G eng %0 Journal Article %J Accepted for Euro PVM/MPI 2007 %D 2007 %T Retrospect: Deterministic Relay of MPI Applications for Interactive Distributed Debugging %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %K ftmpi %B Accepted for Euro PVM/MPI 2007 %I Springer %8 2007-09 %G eng %0 Journal Article %J International Journal of Foundations of Computer Science (IJFCS) (accepted) %D 2007 %T Revisiting Matrix Product on Master-Worker Platforms %A Jack Dongarra %A Jean-Francois Pineau %A Yves Robert %A Zhiao Shi %A Frederic Vivien %B International Journal of Foundations of Computer Science (IJFCS) (accepted) %8 2007-00 %G eng %0 Conference Proceedings %B Proceedings of the 2007 International Conference on Computational Science (ICCS 2007) %D 2007 %T Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors %A Karl Fürlinger %A Michael Gerndt %A Jack Dongarra %E Yong Shi %E Jack Dongarra %E Geert Dick van Albada %E Peter M. Sloot %K kojak %B Proceedings of the 2007 International Conference on Computational Science (ICCS 2007) %I Springer LNCS %C Beijing, China %V 4487-4490 %P 815-822 %G eng %R 10.1007/978-3-540-72586-2_115 %0 Generic %D 2007 %T SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3 %A Alfredo Buttari %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %A George Bosilca %K multi-core %B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-595 %8 2007-00 %G eng %0 Conference Proceedings %B Proceedings of Workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing at IPDPS %D 2007 %T Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing %A Zizhong Chen %A Ming Yang %A Guillermo Francia III %A Jack Dongarra %B Proceedings of Workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing at IPDPS %P 1-8 %8 2007-03 %G eng %0 Conference Proceedings %B 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007) %D 2007 %T Self-Healing in Binomial Graph Networks %A Thara Angskun %A George Bosilca %A Jack Dongarra %B 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007) %C Vilamoura, Algarve, Portugal %8 2007-11 %G eng %0 Generic %D 2007 %T Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization %A Jakub Kurzak %A Alfredo Buttari %A Jack Dongarra %K lapack %B UT Computer Science Technical Report (Also LAPACK Working Note 184) %8 2007-01 %G eng %0 Journal Article %J Journal of Computational Physics %D 2007 %T The Use of Bulk States to Accelerate the Band Edge State Calculation of a Semiconductor Quantum Dot %A Christof Voemel %A Stanimire Tomov %A Lin-Wang Wang %A Osni Marques %A Jack Dongarra %B Journal of Computational Physics %V 223 %P 774-782 %8 2007-00 %G eng %0 Conference Proceedings %B Proceedings of the 13th International Euro-Par Conference on Parallel Processing (Euro-Par '07) %D 2007 %T On Using Incremental Profiling for the Performance Analysis of Shared Memory Parallel Applications %A Karl Fürlinger %A Jack Dongarra %A Michael Gerndt %K kojak %B Proceedings of the 13th International Euro-Par Conference on Parallel Processing (Euro-Par '07) %I Springer LNCS %C Rennes, France %8 2007-01 %G eng %0 Conference Proceedings %B IPDPS 2006, 20th IEEE International Parallel and Distributed Processing Symposium %D 2006 %T Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources %A Zizhong Chen %A Jack Dongarra %B IPDPS 2006, 20th IEEE International Parallel and Distributed Processing Symposium %C Rhodes Island, Greece %8 2006-01 %G eng %0 Generic %D 2006 %T ATLAS on the BlueGene/L – Preliminary Results %A Keith Seymour %A Haihang You %A Jack Dongarra %K gco %B ICL Technical Report %8 2006-01 %G eng %0 Journal Article %J International Journal of Computational Science and Engineering %D 2006 %T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures %A Stanimire Tomov %A Julien Langou %A Jack Dongarra %A Andrew Canning %A Lin-Wang Wang %B International Journal of Computational Science and Engineering %V 2 %P 205-212 %8 2006-00 %G eng %0 Conference Proceedings %B 18th IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2006 (submitted) %D 2006 %T Experiments with Strassen's Algorithm: From Sequential to Parallel %A Fengguang Song %A Jack Dongarra %A Shirley Moore %B 18th IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2006 (submitted) %C Dallas, Texas %8 2006-01 %G eng %0 Journal Article %J University of Tennessee Computer Science Tech Report %D 2006 %T Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy %A Julien Langou %A Julien Langou %A Piotr Luszczek %A Jakub Kurzak %A Alfredo Buttari %A Jack Dongarra %K iter-ref %B University of Tennessee Computer Science Tech Report %8 2006-04 %G eng %0 Journal Article %J 2006 Euro PVM/MPI (submitted) %D 2006 %T Flexible collective communication tuning architecture applied to Open MPI %A Graham Fagg %A Jelena Pjesivac–Grbovic %A George Bosilca %A Thara Angskun %A Jack Dongarra %K ftmpi %B 2006 Euro PVM/MPI (submitted) %C Bonn, Germany %8 2006-01 %G eng %0 Conference Proceedings %B SC06 Conference Tutorial %D 2006 %T The HPC Challenge (HPCC) Benchmark Suite %A Piotr Luszczek %A David Bailey %A Jack Dongarra %A Jeremy Kepner %A Robert Lucas %A Rolf Rabenseifner %A Daisuke Takahashi %K hpcc %K hpcchallenge %B SC06 Conference Tutorial %I IEEE %C Tampa, Florida %8 2006-11 %G eng %0 Journal Article %J PARA 2006 %D 2006 %T The Impact of Multicore on Math Software %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %K plasma %B PARA 2006 %C Umea, Sweden %8 2006-06 %G eng %0 Journal Article %J Euro PVM/MPI 2006 %D 2006 %T Implementation and Usage of the PERUSE-Interface in Open MPI %A Rainer Keller %A George Bosilca %A Graham Fagg %A Michael Resch %A Jack Dongarra %B Euro PVM/MPI 2006 %C Bonn, Germany %8 2006-09 %G eng %0 Journal Article %J University of Tennessee Computer Science Tech Report %D 2006 %T Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the CELL Processor %A Jakub Kurzak %A Jack Dongarra %K iter-ref %B University of Tennessee Computer Science Tech Report %8 2006-09 %G eng %0 Journal Article %J University of Tennessee Computer Science Tech Report, UT-CS-06-581, LAPACK Working Note #178 %D 2006 %T Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead %A Jakub Kurzak %A Jack Dongarra %B University of Tennessee Computer Science Tech Report, UT-CS-06-581, LAPACK Working Note #178 %8 2006-01 %G eng %0 Journal Article %J Parallel Processing Letters %D 2006 %T Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Server %A Emmanuel Jeannot %A Keith Seymour %A Asim YarKhan %A Jack Dongarra %K netsolve %B Parallel Processing Letters %V 17 %P 47-59 %8 2006-03 %G eng %0 Generic %D 2006 %T Modeling of L2 Cache Behavior for Thread-Parallel Scientific Programs on Chip Multi-Processors %A Fengguang Song %A Shirley Moore %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %8 2006-01 %G eng %0 Generic %D 2006 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B ICL Technical Report %8 2006-00 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2006 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B Lecture Notes in Computer Science %I Springer Berlin / Heidelberg %V 4192 %P 40-48 %8 2006-09 %G eng %0 Conference Proceedings %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %D 2006 %T Performance evaluation of eigensolvers in nano-structure computations %A Andrew Canning %A Jack Dongarra %A Julien Langou %A Osni Marques %A Stanimire Tomov %A Christof Voemel %A Lin-Wang Wang %K doe-nano %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %8 2006-01 %G eng %0 Conference Proceedings %B Second International Workshop on OpenMP %D 2006 %T Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications %A Oscar Hernandez %A Fengguang Song %A Barbara Chapman %A Jack Dongarra %A Bernd Mohr %A Shirley Moore %A Felix Wolf %K kojak %B Second International Workshop on OpenMP %C Reims, France %8 2006-01 %G eng %0 Generic %D 2006 %T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report) %A Jack Dongarra %B University of Tennessee Computer Science Department Technical Report, UT-CS-04-526 %V –89-95 %8 2006-01 %G eng %0 Journal Article %J J. Phys.: Conf. Ser. 46 %D 2006 %T Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures %A Alex Zunger %A Alberto Franceschetti %A Gabriel Bester %A Wesley B. Jones %A Kwiseon Kim %A Peter A. Graf %A Lin-Wang Wang %A Andrew Canning %A Osni Marques %A Christof Voemel %A Jack Dongarra %A Julien Langou %A Stanimire Tomov %K DOE_NANO %B J. Phys.: Conf. Ser. 46 %V :101088/1742-6596/46/1/040 %P 292-298 %8 2006-01 %G eng %0 Conference Proceedings %B Proceedings of IEEE CCGrid 2006 %D 2006 %T Proposal of MPI operation level Checkpoint/Rollback and one implementation %A Yuan Tang %A Graham Fagg %A Jack Dongarra %K HARNESS/FT-PI %B Proceedings of IEEE CCGrid 2006 %I IEEE Computer Society %8 2006-01 %G eng %0 Journal Article %J PARA 2006 %D 2006 %T Prospectus for the Next LAPACK and ScaLAPACK Libraries %A James Demmel %A Jack Dongarra %A B. Parlett %A William Kahan %A Ming Gu %A David Bindel %A Yozo Hida %A Xiaoye Li %A Osni Marques %A Jason E. Riedy %A Christof Voemel %A Julien Langou %A Piotr Luszczek %A Jakub Kurzak %A Alfredo Buttari %A Julien Langou %A Stanimire Tomov %B PARA 2006 %C Umea, Sweden %8 2006-06 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications (Special Issue: Scheduling for Large-Scale Heterogeneous Platforms) %D 2006 %T Recent Developments in GridSolve %A Asim YarKhan %A Keith Seymour %A Kiran Sagi %A Zhiao Shi %A Jack Dongarra %E Yves Robert %K netsolve %B International Journal of High Performance Computing Applications (Special Issue: Scheduling for Large-Scale Heterogeneous Platforms) %I Sage Science Press %V 20 %8 2006-00 %G eng %0 Journal Article %J 2006 Euro PVM/MPI %D 2006 %T Scalable Fault Tolerant Protocol for Parallel Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %B 2006 Euro PVM/MPI %C Bonn, Germany %8 2006-00 %G eng %0 Journal Article %J IBM Journal of Research and Development %D 2006 %T Self Adapting Numerical Software SANS Effort %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Victor Eijkhout %A Graham Fagg %A Erika Fuentes %A Julien Langou %A Piotr Luszczek %A Jelena Pjesivac–Grbovic %A Keith Seymour %A Haihang You %A Sathish Vadhiyar %K gco %B IBM Journal of Research and Development %V 50 %P 223-238 %8 2006-01 %G eng %0 Conference Proceedings %B DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems %D 2006 %T Self-Healing Network for Scalable Fault Tolerant Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %B DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems %C Innsbruck, Austria %8 2006-01 %G eng %0 Conference Proceedings %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %D 2006 %T Towards bulk based preconditioning for quantum dot computations %A Andrew Canning %A Jack Dongarra %A Julien Langou %A Osni Marques %A Stanimire Tomov %A Christof Voemel %A Lin-Wang Wang %K doe-nano %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %8 2006-01 %G eng %0 Generic %D 2006 %T Twenty-Plus Years of Netlib and NA-Net %A Jack Dongarra %A Gene H. Golub %A Eric Grosse %A Cleve Moler %A Keith Moore %B University of Tennessee Computer Science Department Technical Report, UT-CS-04-526 %8 2006-00 %G eng %0 Journal Article %J Journal of Computational Physics (submitted) %D 2006 %T The use of bulk states to accelerate the band edge state calculation of a semiconductor quantum dot %A Christof Voemel %A Stanimire Tomov %A Lin-Wang Wang %A Osni Marques %A Jack Dongarra %K doe-nano %B Journal of Computational Physics (submitted) %8 2006-01 %G eng %0 Generic %D 2005 %T Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources %A Zizhong Chen %A Jack Dongarra %B University of Tennessee Computer Science Department Technical Report %V –05-561 %8 2005-11 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience, Special issue "Automatic Performance Analysis" (submitted) %D 2005 %T Automatic analysis of inefficiency patterns in parallel applications %A Felix Wolf %A Bernd Mohr %A Jack Dongarra %A Shirley Moore %K kojak %B Concurrency and Computation: Practice and Experience, Special issue "Automatic Performance Analysis" (submitted) %8 2005-00 %G eng %0 Conference Proceedings %B In Proceedings of the International Conference on Parallel Processing %D 2005 %T Automatic Experimental Analysis of Communication Patterns in Virtual Topologies %A Nikhil Bhatia %A Fengguang Song %A Felix Wolf %A Jack Dongarra %A Bernd Mohr %A Shirley Moore %K kojak %B In Proceedings of the International Conference on Parallel Processing %I IEEE Computer Society %C Oslo, Norway %8 2005-06 %G eng %0 Journal Article %J Future Generation Computing Systems %D 2005 %T Biological Sequence Alignment on the Computational Grid Using the GrADS Framework %A Asim YarKhan %A Jack Dongarra %K grads %B Future Generation Computing Systems %I Elsevier %V 21 %P 980-986 %8 2005-06 %G eng %0 Conference Proceedings %B Proceedings of 5th International Conference on Computational Science (ICCS) %D 2005 %T Comparison of Nonlinear Conjugate-Gradient methods for computing the Electronic Properties of Nanostructure Architectures %A Stanimire Tomov %A Julien Langou %A Andrew Canning %A Lin-Wang Wang %A Jack Dongarra %E V. S. Sunderman %E Geert Dick van Albada %E Peter M. Sloot %E Jack Dongarra %K doe-nano %B Proceedings of 5th International Conference on Computational Science (ICCS) %I Springer's Lecture Notes in Computer Science %C Atlanta, GA, USA %P 317-325 %8 2005-01 %G eng %0 Journal Article %J International Journal of Parallel Programming %D 2005 %T The Component Structure of a Self-Adapting Numerical Software System %A Victor Eijkhout %A Erika Fuentes %A Thomas Eidson %A Jack Dongarra %K salsa %K sans %B International Journal of Parallel Programming %V 33 %8 2005-06 %G eng %0 Generic %D 2005 %T Condition Numbers of Gaussian Random Matrices %A Zizhong Chen %A Jack Dongarra %K ft-la %B University of Tennessee Computer Science Department Technical Report %V –04-539 %8 2005-00 %G eng %0 Journal Article %J SIAM Journal on Matrix Analysis and Applications (to appear) %D 2005 %T Condition Numbers of Gaussian Random Matrices %A Zizhong Chen %A Jack Dongarra %K ftmpi %K grads %K lacsi %K sans %B SIAM Journal on Matrix Analysis and Applications (to appear) %8 2005-01 %G eng %0 Journal Article %J International Journal of Computational Science and Engineering (to appear) %D 2005 %T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures %A Stanimire Tomov %A Julien Langou %A Andrew Canning %A Lin-Wang Wang %A Jack Dongarra %B International Journal of Computational Science and Engineering (to appear) %8 2005-01 %G eng %0 Generic %D 2005 %T An Effective Empirical Search Method for Automatic Software Tuning %A Haihang You %A Keith Seymour %A Jack Dongarra %K gco %B ICL Technical Report %8 2005-01 %G eng %0 Conference Proceedings %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %D 2005 %T Fault Tolerant High Performance Computing by a Coding Approach %A Zizhong Chen %A Graham Fagg %A Edgar Gabriel %A Julien Langou %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %K grads %K lacsi %K sans %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %C Chicago, Illinois %8 2005-01 %G eng %0 Conference Proceedings %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %D 2005 %T Hash Functions for Datatype Signatures in MPI %A George Bosilca %A Jack Dongarra %A Graham Fagg %A Julien Langou %E Beniamino Di Martino %K ftmpi %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %I Springer-Verlag Berlin %C Sorrento (Naples), Italy %V 3666 %P 76-83 %8 2005-09 %G eng %0 Conference Proceedings %B Second Workshop on Productivity and Performance in High-End Computing (P-PHEC) at 11th International Symposium on High Performance Computer Architecture (HPCA-2005) %D 2005 %T Improving Time to Solution with Automated Performance Analysis %A Shirley Moore %A Felix Wolf %A Jack Dongarra %A Bernd Mohr %K kojak %B Second Workshop on Productivity and Performance in High-End Computing (P-PHEC) at 11th International Symposium on High Performance Computer Architecture (HPCA-2005) %C San Francisco %8 2005-02 %G eng %0 Journal Article %D 2005 %T Introduction to the HPC Challenge Benchmark Suite %A Piotr Luszczek %A Jack Dongarra %A David Koester %A Rolf Rabenseifner %A Bob Lucas %A Jeremy Kepner %A John McCalpin %A David Bailey %A Daisuke Takahashi %K hpcc %K hpcchallenge %8 2005-03 %G eng %0 Generic %D 2005 %T Introduction to the HPCChallenge Benchmark Suite %A Jack Dongarra %A Piotr Luszczek %K hpcc %K hpcchallenge %B ICL Technical Report %8 2005-01 %G eng %0 Journal Article %D 2005 %T LAPACK 2005 Prospectus: Reliable and Scalable Software for Linear Algebra Computations on High End Computers %A James Demmel %A Jack Dongarra %I LAPACK Working Note 164 %8 2005-01 %G eng %0 Journal Article %J Grid Computing and New Frontiers of High Performance Processing %D 2005 %T NetSolve: Grid Enabling Scientific Computing Environments %A Keith Seymour %A Asim YarKhan %A Sudesh Agrawal %A Jack Dongarra %E Lucio Grandinetti %K netsolve %B Grid Computing and New Frontiers of High Performance Processing %I Elsevier %8 2005-00 %G eng %0 Journal Article %J International Journal of Parallel Programming %D 2005 %T New Grid Scheduling and Rescheduling Methods in the GrADS Project %A Francine Berman %A Henri Casanova %A Andrew Chien %A Keith Cooper %A Holly Dail %A Anshuman Dasgupta %A Wei Deng %A Jack Dongarra %A Lennart Johnsson %A Ken Kennedy %A Charles Koelbel %A Bo Liu %A Xu Liu %A Anirban Mandal %A Gabriel Marin %A Mark Mazina %A John Mellor-Crummey %A Celso Mendes %A A. Olugbile %A Jignesh M. Patel %A Dan Reed %A Zhiao Shi %A Otto Sievert %A H. Xia %A Asim YarKhan %K grads %B International Journal of Parallel Programming %I Springer %V 33 %P 209-229 %8 2005-06 %G eng %0 Journal Article %J NCSA Access Online %D 2005 %T A Not So Simple Matter of Software %A Jack Dongarra %B NCSA Access Online %I NCSA %8 2005-00 %G eng %0 Conference Proceedings %B The International Conference on Computational Science %D 2005 %T Numerically Stable Real Number Codes Based on Random Matrices %A Zizhong Chen %A Jack Dongarra %K ftmpi %K grads %K lacsi %K sans %B The International Conference on Computational Science %I LNCS 3514, Springer-Verlag %C Atlanta, GA %8 2005-01 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems (submitted) %D 2005 %T Optimization Problem Solving System Using GridRPC %A Hisashi Shimosaka %A Tomoyuki Hiroyasu %A Mitsunori Miki %A Jack Dongarra %B IEEE Transactions on Parallel and Distributed Systems (submitted) %8 2005-01 %G eng %0 Conference Proceedings %B Workshop on Patterns in High Performance Computing %D 2005 %T A Pattern-Based Approach to Automated Application Performance Analysis %A Nikhil Bhatia %A Shirley Moore %A Felix Wolf %A Jack Dongarra %A Bernd Mohr %K kojak %B Workshop on Patterns in High Performance Computing %C University of Illinois at Urbana-Champaign %8 2005-05 %G eng %0 Journal Article %J Cluster Computing Journal (to appear) %D 2005 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B Cluster Computing Journal (to appear) %8 2005-01 %G eng %0 Conference Proceedings %B 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05) %D 2005 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05) %C Denver, Colorado %8 2005-04 %G eng %0 Generic %D 2005 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Julien Langou %K ft-la %B University of Tennessee Computer Science Department Technical Report, UT-CS-04-538 %8 2005-00 %G eng %0 Generic %D 2005 %T Remote Software Toolkit Installer %A Eric Meek %A Jeff Larkin %A Jack Dongarra %K rest %B ICL Technical Report %8 2005-06 %G eng %0 Conference Proceedings %B In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference %D 2005 %T A Scalable Approach to MPI Application Performance Analysis %A Shirley Moore %A Felix Wolf %A Jack Dongarra %A Sameer Shende %A Allen D. Malony %A Bernd Mohr %K kojak %B In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference %I Springer LNCS %8 2005-09 %G eng %0 Conference Proceedings %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %D 2005 %T Scalable Fault Tolerant MPI: Extending the Recovery Algorithm %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %E Beniamino Di Martino %K ftmpi %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %I Springer-Verlag Berlin %C Sorrento (Naples) , Italy %V 3666 %P 67 %8 2005-09 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience, Special Issue: Grid Performance %D 2005 %T Self Adaptivity in Grid Computing %A Sathish Vadhiyar %A Jack Dongarra %E John Gurd %E Anthony Hey %E Juri Papay %E Graham Riley %K netsolve %K sans %B Concurrency and Computation: Practice and Experience, Special Issue: Grid Performance %V 17 %P 235-257 %8 2005-00 %G eng %0 Generic %D 2005 %T Towards an Accurate Model for Collective Communications %A Sathish Vadhiyar %A Graham Fagg %A Jack Dongarra %B ICL Technical Report %8 2005-01 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS 2004) %D 2004 %T Accurate Cache and TLB Characterization Using Hardware Counters %A Jack Dongarra %A Shirley Moore %A Phil Mucci %A Keith Seymour %A Haihang You %K gco %K lacsi %K papi %X We have developed a set of microbenchmarks for accurately determining the structural characteristics of data cache memories and TLBs. These characteristics include cache size, cache line size, cache associativity, memory page size, number of data TLB entries, and data TLB associativity. Unlike previous microbenchmarks that used time-based measurements, our microbenchmarks use hardware event counts to more accurately and quickly determine these characteristics while requiring fewer limiting assumptions. %B International Conference on Computational Science (ICCS 2004) %I Springer %C Krakow, Poland %8 2004-06 %G eng %R https://doi.org/10.1007/978-3-540-24688-6_57 %0 Conference Proceedings %B 4th International Symposium on Cluster Computing and the Grid (CCGrid 2004)(submitted) %D 2004 %T Active Logistical State Management in the GridSolve/L %A Micah Beck %A Jack Dongarra %A Jian Huang %A Terry Moore %A James Plank %K netsolve %B 4th International Symposium on Cluster Computing and the Grid (CCGrid 2004)(submitted) %C Chicago, Illinois %8 2004-01 %G eng %0 Conference Proceedings %B 2004 International Conference on Parallel Processing (ICCP-04) %D 2004 %T An Algebra for Cross-Experiment Performance Analysis %A Fengguang Song %A Felix Wolf %A Nikhil Bhatia %A Jack Dongarra %A Shirley Moore %K kojak %B 2004 International Conference on Parallel Processing (ICCP-04) %C Montreal, Quebec, Canada %8 2004-08 %G eng %0 Generic %D 2004 %T An Asynchronous Algorithm on NetSolve Global Computing System %A Nahid Emad %A S. A. Shahzadeh Fazeli %A Jack Dongarra %K netsolve %B PRiSM - Laboratoire de recherche en informatique, Université de Versailles St-Quentin Technical Report %8 2004-03 %G eng %0 Conference Paper %B 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004) %D 2004 %T Automatic Blocking of QR and LU Factorizations for Locality %A Qing Yi %A Ken Kennedy %A Haihang You %A Keith Seymour %A Jack Dongarra %K gco %K papi %K sans %X QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures. %B 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004) %I ACM %C Washington, DC %8 2004-06 %G eng %R 10.1145/1065895.1065898 %0 Conference Paper %B 5th LCI International Conference on Linux Clusters: The HPC Revolution %D 2004 %T Automating the Large-Scale Collection and Analysis of Performance %A Phil Mucci %A Jack Dongarra %A Rick Kufrin %A Shirley Moore %A Fengguang Song %A Felix Wolf %K kojak %K papi %B 5th LCI International Conference on Linux Clusters: The HPC Revolution %C Austin, Texas %8 2004-05 %G eng %0 Journal Article %J International Journal of High Performance Applications and Supercomputing (to appear) %D 2004 %T Building and using a Fault Tolerant MPI implementation %A Graham Fagg %A Jack Dongarra %K ftmpi %K lacsi %K sans %B International Journal of High Performance Applications and Supercomputing (to appear) %8 2004-00 %G eng %0 Journal Article %J Oak Ridge National Laboratory Report %D 2004 %T Cray X1 Evaluation Status Report %A Pratul Agarwal %A R. A. Alexander %A E. Apra %A Satish Balay %A Arthur S. Bland %A James Colgan %A Eduardo D'Azevedo %A Jack Dongarra %A Tom Dunigan %A Mark Fahey %A Al Geist %A M. Gordon %A Robert Harrison %A Dinesh Kaushik %A M. Krishnakumar %A Piotr Luszczek %A Tony Mezzacapa %A Jeff Nichols %A Jarek Nieplocha %A Leonid Oliker %A T. Packwood %A M. Pindzola %A Thomas C. Schulthess %A Jeffrey Vetter %A James B White %A T. Windus %A Patrick H. Worley %A Thomas Zacharia %B Oak Ridge National Laboratory Report %V /-2004/13 %8 2004-01 %G eng %0 Conference Proceedings %B International Conference on Computational Science %D 2004 %T Design of an Interactive Environment for Numerically Intensive Parallel Linear Algebra Calculations %A Piotr Luszczek %A Jack Dongarra %E Marian Bubak %E Geert Dick van Albada %E Peter M. Sloot %E Jack Dongarra %K lacsi %K lfc %B International Conference on Computational Science %I Springer Verlag %C Poland %8 2004-06 %G eng %R 10.1007/978-3-540-25944-2_35 %0 Conference Proceedings %B Proceedings of Euro-Par 2004 %D 2004 %T Efficient Pattern Search in Large Traces through Successive Refinement %A Felix Wolf %A Bernd Mohr %A Jack Dongarra %A Shirley Moore %K kojak %B Proceedings of Euro-Par 2004 %I Springer-Verlag %C Pisa, Italy %8 2004-08 %G eng %0 Conference Proceedings %B Proceedings of ISC2004 (to appear) %D 2004 %T Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems %A Graham Fagg %A Edgar Gabriel %A George Bosilca %A Thara Angskun %A Zizhong Chen %A Jelena Pjesivac–Grbovic %A Kevin London %A Jack Dongarra %K ftmpi %K lacsi %B Proceedings of ISC2004 (to appear) %C Heidelberg, Germany %8 2004-06 %G eng %0 Conference Proceedings %B IPDPS 2004, NGS Workshop (to appear) %D 2004 %T Improvements in the Efficient Composition of Applications %A Thomas Eidson %A Victor Eijkhout %A Jack Dongarra %K salsa %K sans %B IPDPS 2004, NGS Workshop (to appear) %C Sante Fe %8 2004-00 %G eng %0 Conference Proceedings %B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04') %D 2004 %T LAPACK for Clusters Project: An Example of Self Adapting Numerical Software %A Zizhong Chen %A Jack Dongarra %A Piotr Luszczek %A Kenneth Roche %K lacsi %K lfc %B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04') %C Big Island, Hawaii %V 9 %P 90282 %8 2004-01 %G eng %0 Generic %D 2004 %T NetBuild: Automated Installation and Use of Network-Accessible Software Libraries %A Keith Moore %A Jack Dongarra %A Shirley Moore %A Eric Grosse %K netbuild %B ICL Technical Report %8 2004-01 %G eng %0 Generic %D 2004 %T Numerically Stable Real-Number Codes Based on Random Matrices %A Zizhong Chen %A Jack Dongarra %K ftmpi %B University of Tennessee Computer Science Department Technical Report %V –04-526 %8 2004-10 %G eng %0 Journal Article %J Engineering the Grid (to appear) %D 2004 %T An Overview of Heterogeneous High Performance and Grid Computing %A Jack Dongarra %A Alexey Lastovetsky %E Beniamino Di Martino %E Jack Dongarra %E Adolfy Hoisie %E Laurence Yang %E Hans Zima %B Engineering the Grid (to appear) %I Nova Science Publishers, Inc. %8 2004-00 %G eng %0 Generic %D 2004 %T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report) %A Jack Dongarra %B University of Tennessee Computer Science Department Technical Report, CS-89-85 %8 2004-01 %G eng %0 Journal Article %J International Journal for High Performance Applications and Supercomputing (to appear) %D 2004 %T Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing %A Graham Fagg %A Edgar Gabriel %A Zizhong Chen %A Thara Angskun %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %K lacsi %B International Journal for High Performance Applications and Supercomputing (to appear) %8 2004-04 %G eng %0 Generic %D 2004 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Julien Langou %B ICL Technical Report %8 2004-01 %G eng %0 Conference Proceedings %B IEEE Proceedings (to appear) %D 2004 %T Self Adapting Linear Algebra Algorithms and Software %A James Demmel %A Jack Dongarra %A Victor Eijkhout %A Erika Fuentes %A Antoine Petitet %A Rich Vuduc %A Clint Whaley %A Katherine Yelick %K salsa %K sans %B IEEE Proceedings (to appear) %8 2004-00 %G eng %0 Journal Article %J International Journal of High Performance Applications, Special Issue: Automatic Performance Tuning %D 2004 %T Towards an Accurate Model for Collective Communications %A Sathish Vadhiyar %A Graham Fagg %A Jack Dongarra %K lacsi %B International Journal of High Performance Applications, Special Issue: Automatic Performance Tuning %V 18 %P 159-167 %8 2004-01 %G eng %0 Journal Article %J The Computer Journal %D 2004 %T Trends in High Performance Computing %A Jack Dongarra %B The Computer Journal %I The British Computer Society %V 47 %P 399-403 %8 2004-00 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2004 %T The Virtual Instrument: Support for Grid-enabled Scientific Simulations %A Henri Casanova %A Thomas Bartol %A Francine Berman %A Adam Birnbaum %A Jack Dongarra %A Mark Ellisman %A Marcio Faerman %A Erhan Gockay %A Michelle Miller %A Graziano Obertelli %A Stuart Pomerantz %A Terry Sejnowski %A Joel Stiles %A Rich Wolski %B International Journal of High Performance Computing Applications %V 18 %P 3-17 %8 2004-01 %G eng %0 Conference Proceedings %B IPDPS 2003, Workshop on NSF-Next Generation Software %D 2003 %T Applying Aspect-Oriented Programming Concepts to a Component-based Programming Model %A Thomas Eidson %A Jack Dongarra %A Victor Eijkhout %K salsa %K sans %B IPDPS 2003, Workshop on NSF-Next Generation Software %C Nice, France %8 2003-03 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2003 %T Automatic Translation of Fortran to JVM Bytecode %A Keith Seymour %A Jack Dongarra %K f2j %B Concurrency and Computation: Practice and Experience %V 15 %P 202-207 %8 2003-00 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2003 %T Computational Science — ICCS 2003 %A Peter M. Sloot %A David Abramson %A Alexander V. Bogdanov %A Jack Dongarra %A Albert Zomaya %A Yuriy Gorbachev %B Lecture Notes in Computer Science %I Springer-Verlag, Berlin %C ICCS 2003, International Conference. Melbourne, Australia %V 2657-2660 %8 2003-06 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2003 %T Distributed Probablistic Model-Building Genetic Algorithm %A Tomoyuki Hiroyasu %A Mitsunori Miki %A Masaki Sano %A Hisashi Shimosaka %A Shigeyoshi Tsutsui %A Jack Dongarra %B Lecture Notes in Computer Science %I Springer-Verlag, Heidelberg %V 2723 %P 1015-1028 %8 2003-01 %G eng %0 Journal Article %J Special Issue on Biological Applications of Genetic and Evolutionary Computation (submitted) %D 2003 %T Energy Minimization of Protein Tertiary Structure by Parallel Simulated Annealing using Genetic Crossover %A Tomoyuki Hiroyasu %A Mitsunori Miki %A Shinya Ogura %A Keiko Aoi %A Takeshi Yoshida %A Yuko Okamoto %A Jack Dongarra %B Special Issue on Biological Applications of Genetic and Evolutionary Computation (submitted) %8 2003-03 %G eng %0 Journal Article %J Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th European PVM/MPI User's Group Meeting %D 2003 %T Evaluating The Performance Of MPI-2 Dynamic Communicators And One-Sided Communication %A Edgar Gabriel %A Graham Fagg %A Jack Dongarra %K ftmpi %B Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th European PVM/MPI User's Group Meeting %I Springer-Verlag, Berlin %C Venice, Italy %V 2840 %P 88-97 %8 2003-09 %G eng %0 Conference Paper %B PADTAD Workshop, IPDPS 2003 %D 2003 %T Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters %A Jack Dongarra %A Kevin London %A Shirley Moore %A Phil Mucci %A Dan Terpstra %A Haihang You %A Min Zhou %K lacsi %K papi %X The PAPI project has defined and implemented a cross-platform interface to the hardware counters available on most modern microprocessors. The interface has gained widespread use and acceptance from hardware vendors, users, and tool developers. This paper reports on experiences with the community-based open-source effort to define the PAPI specification and implement it on a variety of platforms. Collaborations with tool developers who have incorporated support for PAPI are described. Issues related to interpretation and accuracy of hardware counter data and to the overheads of collecting this data are discussed. The paper concludes with implications for the design of the next version of PAPI. %B PADTAD Workshop, IPDPS 2003 %I IEEE %C Nice, France %8 2003-04 %@ 0-7695-1926-1 %G eng %0 Conference Proceedings %B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented) %D 2003 %T Fault Tolerant Communication Library and Applications for High Performance Computing %A Graham Fagg %A Edgar Gabriel %A Zizhong Chen %A Thara Angskun %A George Bosilca %A Antonin Bukovsky %A Jack Dongarra %K ftmpi %K lacsi %B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented) %C Santa Fe, NM %8 2003-10 %G eng %0 Conference Proceedings %B 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science %D 2003 %T A Fault-Tolerant Communication Library for Grid Environments %A Edgar Gabriel %A Graham Fagg %A Antonin Bukovsky %A Thara Angskun %A Jack Dongarra %K ftmpi %K lacsi %B 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science %C San Francisco %8 2003-06 %G eng %0 Generic %D 2003 %T Finite-choice Algorithm Optimization in Conjugate Gradients (LAPACK Working Note 159) %A Jack Dongarra %A Victor Eijkhout %B University of Tennessee Computer Science Technical Report, UT-CS-03-502 %8 2003-01 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing (submitted) %D 2003 %T GrADSolve - A Grid-based RPC System for Remote Invocation of Parallel Software %A Sathish Vadhiyar %A Jack Dongarra %K grads %B Journal of Parallel and Distributed Computing (submitted) %8 2003-03 %G eng %0 Conference Proceedings %B Lecture Notes in Computer Science, Proceedings of the 9th International Euro-Par Conference %D 2003 %T GrADSolve - RPC for High Performance Computing on the Grid %A Sathish Vadhiyar %A Jack Dongarra %A Asim YarKhan %E Harald Kosch %E Laszlo Boszormenyi %E Hermann Hellwagner %K netsolve %B Lecture Notes in Computer Science, Proceedings of the 9th International Euro-Par Conference %I Springer-Verlag, Berlin %C Klagenfurt, Austria %V 2790 %P 394-403 %8 2003-01 %G eng %R 10.1007/978-3-540-45209-6_58 %0 Journal Article %J Lecture Notes in Computer Science %D 2003 %T High Performance Computing for Computational Science %A Jose Palma %A Jack Dongarra %A Vicente Hernández %E Antonio Augusto Sousa %B Lecture Notes in Computer Science %I Springer-Verlag, Berlin %C VECPAR 2002, 5th International Conference June 26-28, 2002 %V 2565 %8 2003-01 %G eng %0 Conference Proceedings %B Lecture Notes in Computer Science, High Performance Computing, 5th International Symposium ISHPC %D 2003 %T High Performance Computing Trends and Self Adapting Numerial Software %A Jack Dongarra %B Lecture Notes in Computer Science, High Performance Computing, 5th International Symposium ISHPC %I Springer-Verlag, Heidelberg %C Tokyo-Odaiba, Japan %V 2858 %P 1-9 %8 2003-01 %G eng %0 Conference Proceedings %B Information Processing Society of Japan Symposium Series %D 2003 %T High Performance Computing Trends, Supercomputers, Clusters, and Grids %A Jack Dongarra %B Information Processing Society of Japan Symposium Series %V 2003 %P 55-58 %8 2003-01 %G eng %0 Journal Article %J Making the Global Infrastructure a Reality %D 2003 %T NetSolve: Past, Present, and Future - A Look at a Grid Enabled Server %A Sudesh Agrawal %A Jack Dongarra %A Keith Seymour %A Sathish Vadhiyar %E Francine Berman %E Geoffrey Fox %E Anthony Hey %K netsolve %B Making the Global Infrastructure a Reality %I Wiley Publishing %8 2003-00 %G eng %0 Conference Proceedings %B Information Processing Society of Japan Symposium Series %D 2003 %T Optimization of Injection Schedule of Diesel Engine Using GridRPC %A Tomoyuki Hiroyasu %A Mitsunori Miki %A Junji Sawada %A Jack Dongarra %B Information Processing Society of Japan Symposium Series %V 2003 %P 189-197 %8 2003-01 %G eng %0 Conference Proceedings %B 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid %D 2003 %T Optimization Problem Solving System using Grid RPC %A Tomoyuki Hiroyasu %A Mitsunori Miki %A Hisashi Shimosaka %A Jack Dongarra %B 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid %C Tokyo, Japan %8 2003-03 %G eng %0 Conference Proceedings %B Proceedings of the IPDPS 2003, NGS Workshop %D 2003 %T Optimizing Performance and Reliability in Distributed Computing Systems Through Wide Spectrum Storage %A James Plank %A Micah Beck %A Jack Dongarra %A Rich Wolski %A Henri Casanova %B Proceedings of the IPDPS 2003, NGS Workshop %C Nice, France %P 209 %8 2003-01 %G eng %0 Journal Article %J SIAM Journal on Scientific Computing %D 2003 %T A Parallel Implementation of the Nonsymmetric QR Algorithm for Distributed Memory Architectures %A Greg Henry %A David Watkins %A Jack Dongarra %B SIAM Journal on Scientific Computing %V 24 %P 284-311 %8 2003-01 %G eng %0 Conference Paper %B ICCS 2003 Terascale Workshop %D 2003 %T Performance Instrumentation and Measurement for Terascale Systems %A Jack Dongarra %A Allen D. Malony %A Shirley Moore %A Phil Mucci %A Sameer Shende %K papi %X As computer systems grow in size and complexity, tool support is needed to facilitate the efficient mapping of large-scale applications onto these systems. To help achieve this mapping, performance analysis tools must provide robust performance observation capabilities at all levels of the system, as well as map low-level behavior to high-level program constructs. Instrumentation and measurement strategies, developed over the last several years, must evolve together with performance analysis infrastructure to address the challenges of new scalable parallel systems. %B ICCS 2003 Terascale Workshop %I Springer, Berlin, Heidelberg %C Melbourne, Australia %8 2003-06 %G eng %R https://doi.org/10.1007/3-540-44864-0_6 %0 Journal Article %J Lecture Notes in Computer Science %D 2003 %T Recent Advances in Parallel Virtual Machine and Message Passing Interface %A Jack Dongarra %A Domenico Laforenza %A S. Orlando %B Lecture Notes in Computer Science %I Springer-Verlag, Berlin %V 2840 %8 2003-01 %G eng %0 Conference Proceedings %B DOE/NSF Workshop on New Directions in Cyber-Security in Large-Scale Networks: Development Obstacles %D 2003 %T Scalable, Trustworthy Network Computing Using Untrusted Intermediaries: A Position Paper %A Micah Beck %A Jack Dongarra %A Victor Eijkhout %A Mike Langston %A Terry Moore %A James Plank %K netsolve %B DOE/NSF Workshop on New Directions in Cyber-Security in Large-Scale Networks: Development Obstacles %C National Conference Center - Landsdowne, Virginia %8 2003-03 %G eng %0 Journal Article %J Resource Management in the Grid %D 2003 %T Scheduling in the Grid Application Development Software Project %A Holly Dail %A Otto Sievert %A Francine Berman %A Henri Casanova %A Asim YarKhan %A Sathish Vadhiyar %A Jack Dongarra %A Chuang Liu %A Lingyun Yang %A Dave Angulo %A Ian Foster %K grads %B Resource Management in the Grid %I Kluwer Publishers %8 2003-03 %G eng %0 Journal Article %J Concurrency: Practice and Experience (submitted) %D 2003 %T Self Adaptability in Grid Computing %A Sathish Vadhiyar %A Jack Dongarra %K sans %B Concurrency: Practice and Experience (submitted) %8 2003-03 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2003 %T Self Adapting Numerical Algorithm for Next Generation Applications %A Jack Dongarra %A Victor Eijkhout %K lacsi %K sans %B International Journal of High Performance Computing Applications %V 17 %P 125-132 %8 2003-01 %G eng %0 Journal Article %J Parallel Computing %D 2003 %T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters %A Zizhong Chen %A Jack Dongarra %A Piotr Luszczek %A Kenneth Roche %K lacsi %K lfc %K sans %B Parallel Computing %V 29 %P 1723-1743 %8 2003-11 %G eng %0 Generic %D 2003 %T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters (LAPACK Working Note 160) %A Zizhong Chen %A Jack Dongarra %A Piotr Luszczek %A Kenneth Roche %K lacsi %B University of Tennessee Computer Science Technical Report, UT-CS-03-499 %8 2003-01 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2003 %T Self-Adapting Numerical Software and Automatic Tuning of Heuristics %A Jack Dongarra %A Victor Eijkhout %K salsa %K sans %B Lecture Notes in Computer Science %I Springer Verlag %C Melbourne, Australia %V 2660 %P 759-770 %8 2003-06 %G eng %0 Journal Article %J Statistical Data Mining and Knowledge Discovery %D 2003 %T The Semantic Conference Organizer %A Kevin Heinrich %A Michael Berry %A Jack Dongarra %A Sathish Vadhiyar %E Hamparsum Bozdogan %K netsolve %B Statistical Data Mining and Knowledge Discovery %I CRC Press %8 2003-00 %G eng %0 Conference Proceedings %B ClusterWorld Conference and Expo %D 2003 %T A Simple Installation and Administration Tool for Large-scaled PC Cluster System %A Tomoyuki Hiroyasu %A Mitsunori Miki %A Kenzo Kodama %A Junichi Uekawa %A Jack Dongarra %B ClusterWorld Conference and Expo %C San Jose, CA %8 2003-03 %G eng %0 Journal Article %J Parallel Processing Letters %D 2003 %T SRS - A Framework for Developing Malleable and Migratable Parallel Software %A Sathish Vadhiyar %A Jack Dongarra %K grads %B Parallel Processing Letters %V 13 %P 291-312 %8 2003-06 %G eng %0 Conference Proceedings %B Information Processing Society of Japan Symposium Series %D 2003 %T Static Scheduling for ScaLAPACK on the Grid Using Genetic Algorithm %A Tomoyuki Hiroyasu %A Mitsunori Miki %A Hiroki Saito %A Yusuke Tanimura %A Jack Dongarra %B Information Processing Society of Japan Symposium Series %V 2003 %P 3-10 %8 2003-01 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2003 %T VisPerf: Monitoring Tool for Grid Computing %A DongWoo Lee %A Jack Dongarra %E R. S. Ramakrishna %K netsolve %B Lecture Notes in Computer Science %I Springer Verlag, Heidelberg %V 2659 %P 233-243 %8 2003-00 %G eng %0 Journal Article %J Journal of Digital Information special issue on Interactivity in Digital Libraries %D 2002 %T Active Netlib: An Active Mathematical Software Collection for Inquiry-based Computational Science and Engineering Education %A Shirley Moore %A A.J. Baker %A Jack Dongarra %A Christian Halloy %A Chung Ng %K activenetlib %K rib %B Journal of Digital Information special issue on Interactivity in Digital Libraries %V 2 %8 2002-00 %G eng %0 Journal Article %J International Journal of Supercomputer Applications and High-Performance Computing %D 2002 %T Adaptive Scheduling for Task Farming with Grid Middleware %A Henri Casanova %A Myung Ho Kim %A James Plank %A Jack Dongarra %B International Journal of Supercomputer Applications and High-Performance Computing %V 13 %P 231-240 %8 2002-10 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Computing %D 2002 %T Algorithmic Redistribution Methods for Block Cyclic Decompositions %A Antoine Petitet %A Jack Dongarra %B IEEE Transactions on Parallel and Distributed Computing %V 10 %P 201-220 %8 2002-10 %G eng %0 Journal Article %J EuroPar 2002 %D 2002 %T Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load %A Javier Cuenca %A Domingo Giminez %A José González %A Jack Dongarra %A Kenneth Roche %B EuroPar 2002 %C Paderborn, Germany %8 2002-08 %G eng %0 Journal Article %J SIAM News %D 2002 %T Biannual Top-500 Computer Lists Track Changing Environments for Scientific Computing %A Jack Dongarra %A Hans Meuer %A Horst D. Simon %A Erich Strohmaier %K top500 %B SIAM News %V 34 %8 2002-10 %G eng %0 Journal Article %J Parallel and Distributed Computing Practices %D 2002 %T A Comparison of Parallel Solvers for General Narrow Banded Linear Systems %A Peter Arbenz %A Andrew Cleary %A Jack Dongarra %A Markus Hegland %B Parallel and Distributed Computing Practices %V 2 %P 385-400 %8 2002-10 %G eng %0 Conference Proceedings %B Parallel Computing: Advances and Current Issues:Proceedings of the International Conference ParCo2001 %D 2002 %T Deploying Parallel Numerical Library Routines to Cluster Computing in a Self Adapting Fashion %A Kenneth Roche %A Jack Dongarra %E Gerhard R. Joubert %E Almerica Murli %E Frans Peters %E Marco Vanneschi %K lfc %K sans %B Parallel Computing: Advances and Current Issues:Proceedings of the International Conference ParCo2001 %I Imperial College Press %C London, England %8 2002-01 %G eng %0 Conference Proceedings %B Grid Computing - GRID 2002, Third International Workshop %D 2002 %T Experiments with Scheduling Using Simulated Annealing in a Grid Environment %A Asim YarKhan %A Jack Dongarra %E Manish Parashar %K grads %B Grid Computing - GRID 2002, Third International Workshop %I Springer %C Baltimore, MD %V 2536 %P 232-242 %8 2002-11 %G eng %0 Generic %D 2002 %T GridRPC: A Remote Procedure Call API for Grid Computing %A Keith Seymour %A Hidemoto Nakada %A Satoshi Matsuoka %A Jack Dongarra %A Craig Lee %A Henri Casanova %B ICL Technical Report %8 2002-11 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2002 %T HARNESS Fault Tolerant MPI Design, Usage and Performance Issues %A Graham Fagg %A Jack Dongarra %B Future Generation Computer Systems %V 18 %P 1127-1142 %8 2002-01 %G eng %0 Journal Article %J Concurrency: Practice and Experience %D 2002 %T Innovations of the NetSolve Grid Computing System %A Dorian Arnold %A Henri Casanova %A Jack Dongarra %K netsolve %B Concurrency: Practice and Experience %V 14 %P 1457-1479 %8 2002-01 %G eng %0 Journal Article %J Scientific Programming (to appear) %D 2002 %T An Iterative Solver Benchmark %A Jack Dongarra %A Victor Eijkhout %A Henk van der Vorst %B Scientific Programming (to appear) %8 2002-00 %G eng %0 Journal Article %J Scientific Programming %D 2002 %T JLAPACK - Compiling LAPACK Fortran to Java %A David Doolin %A Jack Dongarra %A Keith Seymour %K f2j %B Scientific Programming %V 7 %P 111-138 %8 2002-10 %G eng %0 Journal Article %J Parallel Computing %D 2002 %T The Marketplace for High-Performance Computers %A Erich Strohmaier %A Jack Dongarra %A Hans Meuer %A Horst D. Simon %B Parallel Computing %V 25 %P 1517-1545 %8 2002-10 %G eng %0 Conference Proceedings %B Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing (HPDC 2002) %D 2002 %T A Metascheduler For The Grid %A Sathish Vadhiyar %A Jack Dongarra %K grads %B Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing (HPDC 2002) %I IEEE Computer Society %C Edinburgh, Scotland %P 343-351 %8 2002-07 %G eng %0 Journal Article %J Parallel Computing %D 2002 %T Middleware for the Use of Storage in Communication %A Micah Beck %A Dorian Arnold %A Alessandro Bassi %A Francine Berman %A Henri Casanova %A Jack Dongarra %A Terry Moore %A Graziano Obertelli %A James Plank %A Martin Swany %A Sathish Vadhiyar %A Rich Wolski %K netsolve %B Parallel Computing %V 28 %P 1773-1788 %8 2002-08 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience, Special Issue: Grid Computing Environments %D 2002 %T NetBuild: Transparent Cross-Platform Access to Computational Software Libraries %A Keith Moore %A Jack Dongarra %K netbuild %B Concurrency and Computation: Practice and Experience, Special Issue: Grid Computing Environments %V 14 %P 1445-1456 %8 2002-11 %G eng %0 Journal Article %J International Journal of High Performance Applications and Supercomputing %D 2002 %T Numerical Libraries and Tools for Scalable Parallel Cluster Computing %A Shirley Browne %A Jack Dongarra %A Anne Trefethen %B International Journal of High Performance Applications and Supercomputing %V 15 %P 175-180 %8 2002-10 %G eng %0 Journal Article %J Meeting of the Japan Society of Mechanical Engineers %D 2002 %T Optimization System Using Grid RPC %A Tomoyuki Hiroyasu %A Mitsunori Miki %A Hisashi Shimosaka %A Yusuke Tanimura %A Jack Dongarra %B Meeting of the Japan Society of Mechanical Engineers %C Kyoto University, Kyoto, Japan %8 2002-10 %G eng %0 Conference Proceedings %B Proceedings of the Third International Workshop on Grid Computing %D 2002 %T Overview of GridRPC: A Remote Procedure Call API for Grid Computing %A Keith Seymour %A Hidemoto Nakada %A Satoshi Matsuoka %A Jack Dongarra %A Craig Lee %A Henri Casanova %E Manish Parashar %B Proceedings of the Third International Workshop on Grid Computing %P 274-278 %8 2002-01 %G eng %0 Journal Article %J SIAM Journal on Scientific Computing %D 2002 %T A Parallel Implementation of the Nonsymmetric QR Algorithm for Disitributed Memory Architectures %A Greg Henry %A David Watkins %A Jack Dongarra %B SIAM Journal on Scientific Computing %V 16 %P 284-311 %8 2002-10 %G eng %0 Journal Article %J SIAM Journal on Scientific Computing %D 2002 %T Parallelizing the Divide and Conquer Algorithm for the Symmetric Tridiagonal Eigenvalue Problem on Distributed Memory Architectures %A Francoise Tisseur %A Jack Dongarra %B SIAM Journal on Scientific Computing %V 6 %P 2223-2236 %8 2002-10 %G eng %0 Generic %D 2002 %T Self-adapting Numerical Software for Next Generation Applications (LAPACK Working Note 157) %A Jack Dongarra %A Victor Eijkhout %K salsa %K sans %B ICL Technical Report %8 2002-00 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2002 %T Stochastic Performance Prediction for Iterative Algorithms in Distributed Environments %A Henri Casanova %A Michael G. Thomason %A Jack Dongarra %B Journal of Parallel and Distributed Computing %V 98 %P 68-91 %8 2002-10 %G eng %0 Conference Proceedings %B International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops %D 2002 %T Toward a Framework for Preparing and Executing Adaptive Grid Programs %A Ken Kennedy %A John Mellor-Crummey %A Keith Cooper %A Linda Torczon %A Francine Berman %A Andrew Chien %A Dave Angulo %A Ian Foster %A Dennis Gannon %A Lennart Johnsson %A Carl Kesselman %A Jack Dongarra %A Sathish Vadhiyar %K grads %B International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops %C Fort Lauderdale, FL %P 0171 %8 2002-04 %G eng %0 Journal Article %J Meeting of the Japan Society of Mechanical Engineers %D 2002 %T Truss Structural Optimization Using NetSolve System %A Tomoyuki Hiroyasu %A Mitsunori Miki %A Hisashi Shimosaka %A Masaki Sano %A Yusuke Tanimura %A Yasunari Mimura %A Shinobu Yoshimura %A Jack Dongarra %K netsolve %B Meeting of the Japan Society of Mechanical Engineers %C Kyoto University, Kyoto, Japan %8 2002-10 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software %D 2002 %T An Updated Set of Basic Linear Algebra Subprograms (BLAS) %A Susan Blackford %A James Demmel %A Jack Dongarra %A Iain Duff %A Sven Hammarling %A Greg Henry %A Michael Heroux %A Linda Kaufman %A Andrew Lumsdaine %A Antoine Petitet %A Roldan Pozo %A Karin Remington %A Clint Whaley %B ACM Transactions on Mathematical Software %V 28 %P 135-151 %8 2002-12 %G eng %R 10.1145/567806.567807 %0 Generic %D 2002 %T Users' Guide to NetSolve v1.4.1 %A Sudesh Agrawal %A Dorian Arnold %A Susan Blackford %A Jack Dongarra %A Michelle Miller %A Kiran Sagi %A Zhiao Shi %A Keith Seymour %A Sathish Vadhiyar %K netsolve %B ICL Technical Report %8 2002-06 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing (submitted) %D 2002 %T The Virtual Instrument: Support for Grid-enabled Scientific Simulations %A Henri Casanova %A Thomas Bartol %A Francine Berman %A Adam Birnbaum %A Jack Dongarra %A Mark Ellisman %A Marcio Faerman %A Erhan Gockay %A Michelle Miller %A Graziano Obertelli %A Stuart Pomerantz %A Terry Sejnowski %A Joel Stiles %A Rich Wolski %B Journal of Parallel and Distributed Computing (submitted) %8 2002-10 %G eng %0 Journal Article %J Parallel Computing %D 2001 %T Automated Empirical Optimization of Software and the ATLAS Project %A Clint Whaley %A Antoine Petitet %A Jack Dongarra %K atlas %B Parallel Computing %V 27 %P 3-25 %8 2001-01 %G eng %0 Conference Proceedings %B Joint ACM Java Grande - ISCOPE 2001 Conference (submitted) %D 2001 %T Automatic Translation of Fortran to JVM Bytecode %A Keith Seymour %A Jack Dongarra %K f2j %B Joint ACM Java Grande - ISCOPE 2001 Conference (submitted) %C Stanford University, California %8 2001-06 %G eng %0 Journal Article %J (an update), submitted to ACM TOMS %D 2001 %T Basic Linear Algebra Subprograms (BLAS) %A Susan Blackford %A James Demmel %A Jack Dongarra %A Iain Duff %A Sven Hammarling %A Greg Henry %A Michael Heroux %A Linda Kaufman %A Andrew Lumsdaine %A Antoine Petitet %A Roldan Pozo %A Karin Remington %A Clint Whaley %B (an update), submitted to ACM TOMS %8 2001-02 %G eng %0 Journal Article %J Parallel Processing Letters %D 2001 %T On the Convergence of Computational and Data Grids %A Dorian Arnold %A Sathish Vadhiyar %A Jack Dongarra %K netsolve %B Parallel Processing Letters %V 11 %P 187-202 %8 2001-01 %G eng %0 Conference Paper %B International Conference on Parallel and Distributed Computing Systems %D 2001 %T End-user Tools for Application Performance Analysis, Using Hardware Counters %A Kevin London %A Jack Dongarra %A Shirley Moore %A Phil Mucci %A Keith Seymour %A T. Spencer %K papi %X One purpose of the end-user tools described in this paper is to give users a graphical representation of performance information that has been gathered by instrumenting an application with the PAPI library. PAPI is a project that specifies a standard API for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count "events", which are occurrences of specific signals and states related to a processor’s function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. The perfometer tool developed by the PAPI project provides a graphical view of this information, allowing users to quickly see where performance bottlenecks are in their application. Only one function call has to be added by the user to their program to take advantage of perfometer. This makes it quick and simple to add and remove instrumentation from a program. Also, perfometer allows users to change the "event" they are monitoring. Add the ability to monitor parallel applications, set alarms and a Java front-end that can run anywhere, and this gives the user a powerful tool for quickly discovering where and why a bottleneck exists. A number of third-party tools for analyzing performance of message-passing and/or threaded programs have also incorporated support for PAPI so as to be able to display and analyze hardware counter data from their interfaces. %B International Conference on Parallel and Distributed Computing Systems %C Dallas, TX %8 2001-08 %G eng %0 Conference Proceedings %B Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science %D 2001 %T Fault Tolerant MPI for the HARNESS Meta-Computing System %A Graham Fagg %A Antonin Bukovsky %A Jack Dongarra %E Benjoe A. Juliano %E R. Renner %E K. Tan %K ftmpi %K harness %B Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science %I Springer Verlag %C Berlin %V 2073 %P 355-366 %8 2001-00 %G eng %R 10.1007/3-540-45545-0_44 %0 Journal Article %J International Journal of High Performance Applications and Supercomputing %D 2001 %T The GrADS Project: Software Support for High-Level Grid Application Development %A Francine Berman %A Andrew Chien %A Keith Cooper %A Jack Dongarra %A Ian Foster %A Dennis Gannon %A Lennart Johnsson %A Ken Kennedy %A Carl Kesselman %A John Mellor-Crummey %A Dan Reed %A Linda Torczon %A Rich Wolski %K grads %B International Journal of High Performance Applications and Supercomputing %V 15 %P 327-344 %8 2001-01 %G eng %0 Conference Proceedings %B Proceedings of the High Performance Computing Symposium (HPC 2001) in 2001 Advanced Simulation Technologies Conference %D 2001 %T Grid-Enabling Problem Solving Environments: A Case Study of SCIRUN and NetSolve %A Michelle Miller %A Christopher Moulding %A Jack Dongarra %A Christopher Johnson %K netsolve %B Proceedings of the High Performance Computing Symposium (HPC 2001) in 2001 Advanced Simulation Technologies Conference %I Society for Modeling and Simulation International %C Seattle, Washington %8 2001-04 %G eng %0 Journal Article %J Parallel Computing %D 2001 %T HARNESS and Fault Tolerant MPI %A Graham Fagg %A Antonin Bukovsky %A Jack Dongarra %B Parallel Computing %V 27 %P 1479-1496 %8 2001-01 %G eng %0 Journal Article %J HERMIS %D 2001 %T High Performance Computing Trends %A Jack Dongarra %A Hans Meuer %A Horst D. Simon %A Erich Strohmaier %B HERMIS %V 2 %P 155-163 %8 2001-11 %G eng %0 Journal Article %J Scientific Programming %D 2001 %T Iterative Solver Benchmark (LAPACK Working Note 152) %A Jack Dongarra %A Victor Eijkhout %A Henk van der Vorst %B Scientific Programming %V 9 %P 223-231 %8 2001-00 %G eng %0 Journal Article %J submitted to SC2001 %D 2001 %T Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication %A Micah Beck %A Dorian Arnold %A Alessandro Bassi %A Francine Berman %A Henri Casanova %A Jack Dongarra %A Terry Moore %A Graziano Obertelli %A James Plank %A Martin Swany %A Sathish Vadhiyar %A Rich Wolski %K netsolve %B submitted to SC2001 %C Denver, Colorado %8 2001-11 %G eng %0 Journal Article %J SIAM Review (book review) %D 2001 %T Measuring Computer Performance: A Practioner's Guide %A Jack Dongarra %B SIAM Review (book review) %V 43 %P 383-384 %8 2001-00 %G eng %0 Generic %D 2001 %T NetBuild %A Keith Moore %A Jack Dongarra %K netbuild %B University of Tennessee Computer Science Technical Report %8 2001-01 %G eng %0 Conference Proceedings %B 2001 High Performance Computing Symposium (HPC'01), part of the Advance Simulation Technologies Conference %D 2001 %T Network-Enabled Server Systems: Deploying Scientific Simulations on the Grid %A Henri Casanova %A Satoshi Matsuoka %A Jack Dongarra %B 2001 High Performance Computing Symposium (HPC'01), part of the Advance Simulation Technologies Conference %C Seattle, Washington %8 2001-04 %G eng %0 Journal Article %J SIAM News %D 2001 %T Network-Enabled Solvers: A Step Toward Grid-Based Computing %A Jack Dongarra %B SIAM News %V 34 %8 2001-12 %G eng %0 Journal Article %J International Journal of High Performance Applications and Supercomputing %D 2001 %T Numerical Libraries and The Grid %A Antoine Petitet %A Susan Blackford %A Jack Dongarra %A Brett Ellis %A Graham Fagg %A Kenneth Roche %A Sathish Vadhiyar %K grads %B International Journal of High Performance Applications and Supercomputing %V 15 %P 359-374 %8 2001-01 %G eng %0 Generic %D 2001 %T Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK %A Antoine Petitet %A Susan Blackford %A Jack Dongarra %A Brett Ellis %A Graham Fagg %A Kenneth Roche %A Sathish Vadhiyar %K grads %K scalapack %B University of Tennessee Computer Science Technical Report %8 2001-01 %G eng %0 Journal Article %J International Journal of High Performance Applications and Supercomputing %D 2001 %T Numerical Libraries and Tools for Scalable Parallel Cluster Computing %A Jack Dongarra %A Shirley Moore %A Anne Trefethen %B International Journal of High Performance Applications and Supercomputing %V 15 %P 175-180 %8 2001-01 %G eng %0 Journal Article %J Handbook of Massive Data Sets %D 2001 %T Overview of High Performance Computers %A Aad J. van der Steen %A Jack Dongarra %E James Abello %E Panos Pardalos %E Mauricio Resende %B Handbook of Massive Data Sets %I Kluwer Academic Publishers %P 791-852 %8 2001-01 %G eng %0 Conference Proceedings %B LACSI Symposium 2001 %D 2001 %T Performance Modeling for Self Adapting Collective Communications for MPI %A Sathish Vadhiyar %A Graham Fagg %A Jack Dongarra %K ftmpi %B LACSI Symposium 2001 %C Santa Fe, NM %8 2001-10 %G eng %0 Generic %D 2001 %T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report) %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %8 2001-01 %G eng %0 Journal Article %J Computing in Science and Engineering %D 2001 %T The Quest for Petascale Computing %A Jack Dongarra %A David W. Walker %B Computing in Science and Engineering %V 3 %P 32-39 %8 2001-05 %G eng %0 Journal Article %J Scientific Programming %D 2001 %T Recursive Approach in Sparse Matrix LU Factorization %A Jack Dongarra %A Victor Eijkhout %A Piotr Luszczek %B Scientific Programming %V 9 %P 51-60 %8 2001-00 %G eng %0 Journal Article %J European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting, Lecture Notes in Computer Science 2131 %D 2001 %T Review of Performance Analysis Tools for MPI Parallel Programs %A Shirley Moore %A David Cronk %A Kevin London %A Jack Dongarra %K papi %X In order to produce MPI applications that perform well on today’s parallel architectures, programmers need effective tools for collecting and analyzing performance data. A variety of such tools, both commercial and research, are becoming available. This paper reviews and evaluations the available cross-platform MPI performance analysis tools. %B European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting, Lecture Notes in Computer Science 2131 %I Springer Verlag, Berlin %C Greece %P 241-248 %8 2001-09 %G eng %R https://doi.org/10.1007/3-540-45417-9_34 %0 Generic %D 2001 %T RIBAPI - Repository in a Box Application Programmer's Interface %A Jeremy Millar %A Paul McMahan %A Jack Dongarra %K rib %B University of Tennessee Computer Science Technical Report %8 2001-00 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2001 %T Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries %A Ken Kennedy %A Bradley Broom %A Keith Cooper %A Jack Dongarra %A Rob Fowler %A Dennis Gannon %A Lennart Johnsson %A John Mellor-Crummey %A Linda Torczon %B Journal of Parallel and Distributed Computing %V 61 %P 1803-1826 %8 2001-12 %G eng %0 Conference Paper %B Conference on Linux Clusters: The HPC Revolution %D 2001 %T Using PAPI for Hardware Performance Monitoring on Linux Systems %A Jack Dongarra %A Kevin London %A Shirley Moore %A Phil Mucci %A Dan Terpstra %K papi %X PAPI is a specification of a cross-platform interface to hardware performance counters on modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals related to a processor's function. Monitoring these events has a variety of uses in application performance analysis and tuning. The PAPI specification consists of both a standard set of events deemed most relevant for application performance tuning, as well as both high-level and low-level sets of routines for accessing the counters. The high level interface simply provides the ability to start, stop, and read sets of events, and is intended for the acquisition of simple but accurate measurement by application engineers. The fully programmable low-level interface provides sophisticated options for controlling the counters, such as setting thresholds for interrupt on overflow, as well as access to all native counting modes and events, and is intended for third-party tool writers or users with more sophisticated needs. PAPI has been implemented on a number of platforms, including Linux/x86 and Linux/IA-64. The Linux/x86 implementation requires a kernel patch that provides a driver for the hardware counters. The driver memory maps the counter registers into user space and allows virtualizing the counters on a perprocess or per-thread basis. The kernel patch is being proposed for inclusion in the main Linux tree. The PAPI library provides access on Linux platforms not only to the standard set of events mentioned above but also to all the Linux/x86 and Linux/IA-64 native events. PAPI has been installed and is in use, either directly or through incorporation into third-party end-user performance analysis tools, on a number of Linux clusters, including the New Mexico LosLobos cluster and Linux clusters at NCSA and the University of Tennessee being used for the GrADS (Grid Application Development Software) project. %B Conference on Linux Clusters: The HPC Revolution %I Linux Clusters Institute %C Urbana, Illinois %8 2001-06 %G eng %0 Generic %D 2000 %T Automated Empirical Optimizations of Software and the ATLAS Project (LAPACK Working Note 147) %A Clint Whaley %A Antoine Petitet %A Jack Dongarra %K atlas %B University of Tennessee Computer Science Department Technical Report, %8 2000-09 %G eng %0 Conference Proceedings %B Proceedings of SuperComputing 2000 (SC'2000) %D 2000 %T Automatically Tuned Collective Communications %A Sathish Vadhiyar %A Graham Fagg %A Jack Dongarra %K ftmpi %B Proceedings of SuperComputing 2000 (SC'2000) %C Dallas, TX %8 2000-11 %G eng %0 Generic %D 2000 %T Design and Implementation of NetSolve using DCOM as the Remoting Layer %A Ganapathy Raman %A Jack Dongarra %K netsolve %B University of Tennessee Computer Science Department Technical Report %8 2000-05 %G eng %0 Journal Article %J Concurrency: Practice and Experience %D 2000 %T The Design and Implementation of the Parallel Out of Core ScaLAPACK LU, QR, and Cholesky Factorization Routines %A Eduardo D'Azevedo %A Jack Dongarra %B Concurrency: Practice and Experience %V 12 %P 1481-1493 %8 2000-01 %G eng %0 Conference Proceedings %B to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications %D 2000 %T Developing an Architecture to Support the Implementation and Development of Scientific Computing Applications %A Dorian Arnold %A Jack Dongarra %K netsolve %B to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications %C Ottawa, Canada %8 2000-10 %G eng %0 Conference Proceedings %B Lecture Notes in Computer Science: Proceedings of EuroPVM-MPI 2000 %D 2000 %T FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World %A Graham Fagg %A Jack Dongarra %K ftmpi %B Lecture Notes in Computer Science: Proceedings of EuroPVM-MPI 2000 %C (Hungary: Springer Verlag, 2000) %P V1908,346-353 %8 2000-01 %G eng %0 Generic %D 2000 %T The GrADS Project: Software Support for High-Level Grid Application Development %A Francine Berman %A Andrew Chien %A Keith Cooper %A Jack Dongarra %A Ian Foster %A Dennis Gannon %A Lennart Johnsson %A Ken Kennedy %A Carl Kesselman %A Dan Reed %A Linda Torczon %A Rich Wolski %K grads %B Technical Report %8 2000-02 %G eng %0 Conference Proceedings %B FOMMS 2000: Foundations of Molecular Modeling and Simulation Conference (to appear) %D 2000 %T High Performance Computing Today %A Jack Dongarra %A Hans Meuer %A Horst D. Simon %A Erich Strohmaier %B FOMMS 2000: Foundations of Molecular Modeling and Simulation Conference (to appear) %8 2000-01 %G eng %0 Journal Article %J Encyclopedia of Electrical and Engineering, Supplement 1 %D 2000 %T Message Passing Software Systems %A Jack Dongarra %A Graham Fagg %A Rolf Hempel %A David W. Walker %E J. Webster %K ftmpi %B Encyclopedia of Electrical and Engineering, Supplement 1 %I John Wiley & Sons, Inc. %8 2000-00 %G eng %0 Conference Proceedings %B 2000 International Conference on Parallel Processing (ICPP-2000) %D 2000 %T The NetSolve Environment: Progressing Towards the Seamless Grid %A Dorian Arnold %A Jack Dongarra %K netsolve %B 2000 International Conference on Parallel Processing (ICPP-2000) %C Toronto, Canada %8 2000-08 %G eng %0 Conference Proceedings %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %D 2000 %T A New Recursive Implementation of Sparse Cholesky Factorization %A Jack Dongarra %A Padma Raghavan %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %C Lausanne, Switzerland %8 2000-08 %G eng %0 Generic %D 2000 %T Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report) %A Jack Dongarra %B University of Tennessee Computer Science Department Technical Report %8 2000-01 %G eng %0 Generic %D 2000 %T A Portable Programming Interface for Performance Evaluation on Modern Processors %A Shirley Browne %A Jack Dongarra %A Nathan Garner %A Kevin London %A Phil Mucci %B University of Tennessee Computer Science Technical Report, UT-CS-00-444 %8 2000-07 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2000 %T A Portable Programming Interface for Performance Evaluation on Modern Processors %A Shirley Browne %A Jack Dongarra %A Nathan Garner %A George Ho %A Phil Mucci %K papi %B The International Journal of High Performance Computing Applications %V 14 %P 189-204 %8 2000-09 %G eng %R https://doi.org/10.1177/109434200001400303 %0 Journal Article %J ASTC-HPC 2000 %D 2000 %T Providing Infrastructure and Interface to High Performance Applications in a Distributed Setting %A Dorian Arnold %A Wonsuck Lee %A Jack Dongarra %A Mary Wheeler %B ASTC-HPC 2000 %C Washington, DC %8 2000-04 %G eng %0 Conference Proceedings %B Lecture Notes in Computer Science: Proceedings of 7th European PVM/MPI Users' Group Meeting 2000 %D 2000 %T Recent Advances in Parallel Virtual Machine and Message Passing Interface %A Jack Dongarra %A Peter Kacsuk %A N. Podhorszki %K ftmpi %B Lecture Notes in Computer Science: Proceedings of 7th European PVM/MPI Users' Group Meeting 2000 %C (Hungary: Springer Verlag) %P V1908 %8 2000-01 %G eng %0 Conference Proceedings %B Proceedings of 1st SGI Users Conference %D 2000 %T Recursive approach in sparse matrix LU factorization %A Jack Dongarra %A Victor Eijkhout %A Piotr Luszczek %B Proceedings of 1st SGI Users Conference %C Cracow, Poland (ACC Cyfronet UMM, 2000) %P 409-418 %8 2000-01 %G eng %0 Conference Proceedings %B Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing %D 2000 %T Request Sequencing: Optimizing Communication for the Grid %A Dorian Arnold %A Dieter Bachmann %A Jack Dongarra %K netsolve %B Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing %C (Germany: Springer Verlag 2000) %P V1900,1213-1222 %8 2000-01 %G eng %0 Conference Proceedings %B Proceedings of SuperComputing 2000 (SC'00) %D 2000 %T A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters %A Shirley Browne %A Jack Dongarra %A Nathan Garner %A Kevin London %A Phil Mucci %K papi %B Proceedings of SuperComputing 2000 (SC'00) %C Dallas, TX %8 2000-11 %G eng %0 Conference Proceedings %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %D 2000 %T Seamless Access to Adaptive Solver Algorithms %A Dorian Arnold %A Susan Blackford %A Jack Dongarra %A Victor Eijkhout %A Tinghua Xu %K netsolve %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %C Lausanne, Switzerland %8 2000-08 %G eng %0 Generic %D 2000 %T Secure Remote Access to Numerical Software and Computation Hardware %A Dorian Arnold %A Shirley Browne %A Jack Dongarra %A Graham Fagg %A Keith Moore %B University of Tennessee Computer Science Technical Report, UT-CS-00-446 %8 2000-07 %G eng %0 Conference Proceedings %B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000 %D 2000 %T Secure Remote Access to Numerical Software and Computational Hardware %A Dorian Arnold %A Shirley Browne %A Jack Dongarra %A Graham Fagg %A Keith Moore %K netsolve %B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000 %C Albuquerque, NM %8 2000-06 %G eng %0 Generic %D 2000 %T Top500 Supercomputer Sites (15th edition) %A Jack Dongarra %A Hans Meuer %A Erich Strohmaier %K top500 %B University of Tennessee Computer Science Department Technical Report %8 2000-06 %G eng %0 Journal Article %J Parallel Processing Letters %D 1999 %T Algorithmic Issues on Heterogeneous Computing Platforms %A Pierre Boulet %A Jack Dongarra %A Fabrice Rastello %A Yves Robert %A Frederic Vivien %B Parallel Processing Letters %V 9 %P 197-213 %8 1999-01 %G eng %0 Journal Article %J SIAM News %D 1999 %T Atlanta Organizers Put Mathematics to Work For the Math Sciences Community %A Michael Berry %A Jack Dongarra %B SIAM News %V 32 %8 1999-01 %G eng %0 Generic %D 1999 %T A Comparison of Parallel Solvers for Diagonally Dominant and General Narrow Banded Linear Systems II (LAPACK Working Note 143) %A Peter Arbenz %A Andrew Cleary %A Jack Dongarra %A Markus Hegland %B University of Tennessee Computer Science Department Technical Report %8 1999-01 %G eng %0 Generic %D 1999 %T A Comparison of Parallel Solvers for General Narrow Banded Linear Systems (LAPACK Working Note 142) %A Peter Arbenz %A Andrew Cleary %A Jack Dongarra %A Markus Hegland %B University of Tennessee Computer Science Technical Report %8 1999-01 %G eng %0 Journal Article %J Future Generation Computer Systems %D 1999 %T Deploying Fault-tolerance and Task Migration with NetSolve %A Henri Casanova %A James Plank %A Micah Beck %A Jack Dongarra %K netsolve %B Future Generation Computer Systems %I Elsevier %V 15 %P 745-755 %8 1999-10 %G eng %0 Journal Article %J Parallel and Distributed Computing Practices, Special Issue: Cluster Computing %D 1999 %T Experiences with Windows 95/NT as a Cluster Computing Platform for Parallel Computing %A Markus Fischer %A Jack Dongarra %B Parallel and Distributed Computing Practices, Special Issue: Cluster Computing %I Nova Science Publishers, USA %V 2 %P 119-128 %8 1999-02 %G eng %0 Journal Article %J International Journal on Future Generation Computer Systems %D 1999 %T HARNESS: A Next Generation Distributed Virtual Machine %A Micah Beck %A Jack Dongarra %A Graham Fagg %A Al Geist %A Paul Gray %A James Kohl %A Mauro Migliardi %A Keith Moore %A Terry Moore %A Philip Papadopoulous %A Stephen L. Scott %A Vaidy Sunderam %K harness %B International Journal on Future Generation Computer Systems %V 15 %P 571-582 %8 1999-01 %G eng %0 Journal Article %J Philadelphia: Society for Industrial and Applied Mathematics %D 1999 %T LAPACK Users' Guide, 3rd ed. %A Ed Anderson %A Zhaojun Bai %A Christian Bischof %A Susan Blackford %A James Demmel %A Jack Dongarra %A Jeremy Du Croz %A Anne Greenbaum %A Sven Hammarling %A Alan McKenney %A Danny Sorensen %B Philadelphia: Society for Industrial and Applied Mathematics %8 1999-01 %G eng %0 Journal Article %J Computer Communications %D 1999 %T Logistical Quality of Service in NetSolve %A Micah Beck %A Henri Casanova %A Jack Dongarra %A Terry Moore %A James Plank %A Francine Berman %A Rich Wolski %K netsolve %B Computer Communications %V 22 %P 1034-1044 %8 1999-01 %G eng %0 Journal Article %J IEEE Cluster Computing BOF at SC99 %D 1999 %T Numerical Libraries and Tools for Scalable Parallel Cluster Computing %A Shirley Browne %A Jack Dongarra %A Anne Trefethen %B IEEE Cluster Computing BOF at SC99 %C Portland, Oregon %8 1999-01 %G eng %0 Journal Article %J Encyclopedia of Computer Science and Technology, eds. Kent, A., Williams, J. %D 1999 %T Numerical Linear Algebra %A Jack Dongarra %A Victor Eijkhout %E Marcel Dekker %B Encyclopedia of Computer Science and Technology, eds. Kent, A., Williams, J. %V 41 %P 207-233 %8 1999-08 %G eng %0 Journal Article %J Journal of Computational and Applied Mathematics %D 1999 %T Numerical Linear Algebra Algorithms and Software %A Jack Dongarra %A Victor Eijkhout %B Journal of Computational and Applied Mathematics %V 123 %P 489-514 %8 1999-10 %G eng %0 Journal Article %J SIAM Annual Meeting %D 1999 %T A Numerical Linear Algebra Problem Solving Environment Designer's Perspective (LAPACK Working Note 139) %A Antoine Petitet %A Henri Casanova %A Clint Whaley %A Jack Dongarra %A Yves Robert %B SIAM Annual Meeting %C Atlanta, GA %8 1999-05 %G eng %0 Journal Article %J Handbook on Parallel and Distributed Processing %D 1999 %T Parallel and Distributed Scientific Computing: A Numerical Linear Algebra Problem Solving Environment Designer's Perspective %A Antoine Petitet %A Henri Casanova %A Jack Dongarra %A Yves Robert %A Clint Whaley %B Handbook on Parallel and Distributed Processing %8 1999-01 %G eng %0 Journal Article %J Journal on Future Generation Computer Systems %D 1999 %T Scalable Networked Information Processing Environment (SNIPE) %A Graham Fagg %A Keith Moore %A Jack Dongarra %K harness %B Journal on Future Generation Computer Systems %V 15 %P 595-605 %8 1999-01 %G eng %0 Journal Article %J Parallel Computing %D 1999 %T Static Tiling for Heterogeneous Computing Platforms %A Pierre Boulet %A Jack Dongarra %A Yves Robert %A Frederic Vivien %B Parallel Computing %V 25 %P 547-568 %8 1999-01 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 1999 %T Stochastic Performance Prediction for Iterative Algorithms in Distributed Environments %A Henri Casanova %A Myung Ho Kim %A James Plank %A Jack Dongarra %B Journal of Parallel and Distributed Computing %V 98 %P 68-91 %8 1999-01 %G eng %0 Journal Article %J Concurrency: Practice and Experience %D 1999 %T Tiling on Systems with Communication/Computation Overlap %A Pierre-Yves Calland %A Jack Dongarra %A Yves Robert %B Concurrency: Practice and Experience %V 11 %P 139-153 %8 1999-01 %G eng %0 Generic %D 1999 %T Top500 Supercomputer Sites (13th edition) %A Jack Dongarra %A Hans Meuer %A Erich Strohmaier %K top500 %B University of Tennessee Computer Science Department Technical Report %8 1999-06 %G eng %0 Generic %D 1999 %T Top500 Supercomputer Sites (14th edition) %A Jack Dongarra %A Hans Meuer %A Erich Strohmaier %K top500 %B University of Tennessee Computer Science Department Technical Report %8 1999-11 %G eng %0 Conference Paper %B 1998 ACM/IEEE conference on Supercomputing (SC '98) %D 1998 %T Automatically Tuned Linear Algebra Software %A Clint Whaley %A Jack Dongarra %K BLAS %K code generation %K high performance %K linear algebra %K optimization %K Tuning %X This paper describes an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and time consuming process. The work described here can help in automating much of this process. We will concentrate our efforts on the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS). In particular, the work presented here is for general matrix multiply, DGEMM. However much of the technology and approach developed here can be applied to the other Level 3 BLAS and the general strategy can have an impact on basic linear algebra operations in general and may be extended to other important kernel operations. %B 1998 ACM/IEEE conference on Supercomputing (SC '98) %I IEEE Computer Society %C Orlando, FL %8 1998-11 %@ 0-89791-984-X %G eng %0 Book %D 1998 %T MPI - The Complete Reference, Volume 1: The MPI Core %A Marc Snir %A Steve Otto %A Steven Huss-Lederman %A David Walker %A Jack Dongarra %X Since its release in summer 1994, the Message Passing Interface (MPI) specification has become a standard for message-passing libraries for parallel computations. There exist more than a dozen implementations on a variety of computing platforms, from the IBM SP-2 supercomputer to PCs running Windows NT. The initial MPI Standard, known as MPI-1, has been modified over the last two years. This volume, the definitive reference manual for the latest version of MPI-1, contains a complete specification of the MPI Standard. It is annotated with comments that clarify complicated issues, including why certain design choices were made, how users are intended to use the interface, and how they should construct their version of MPI. The volume also provides many detailed, illustrative programming examples. %7 Second %I MIT Press %C Cambridge, MA, USA %P 426 %8 1998-08 %@ 978-0-262-69215-1 %G eng %0 Journal Article %J D-Lib Magazine %D 1998 %T National HPCC Software Exchange (NHSE): Uniting the High Performance Computing and Communications Community %A Shirley Browne %A Jack Dongarra %A Jeff Horner %A Paul McMahan %A Scott Wells %K rib %B D-Lib Magazine %8 1998-01 %G eng %0 Book %B Software, Environments and Tools %D 1998 %T Numerical Linear Algebra for High-Performance Computers %A Jack Dongarra %A Iain Duff %A Danny Sorensen %A Henk van der Vorst %X This book presents a unified treatment of recently developed techniques and current understanding about solving systems of linear equations and large scale eigenvalue problems on high-performance computers. It provides a rapid introduction to the world of vector and parallel processing for these linear algebra applications. Topics include major elements of advanced-architecture computers and their performance, recent algorithmic development, and software for direct solution of dense matrix problems, direct solution of sparse systems of equations, iterative solution of sparse systems of equations, and solution of large sparse eigenvalue problems. This book supersedes the SIAM publication Solving Linear Systems on Vector and Shared Memory Computers, which appeared in 1990. The new book includes a considerable amount of new material in addition to incorporating a substantial revision of existing text. %B Software, Environments and Tools %I SIAM %G eng %R https://doi.org/10.1137/1.9780898719611 %0 Journal Article %J Computer Physics Communications %D 1996 %T ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance %A Jaeyoung Choi %A Jim Demmel %A Inderjit Dhillon %A Jack Dongarra %A Susan Ostrouchov %A Antoine Petitet %A Kendall Stanley %A David Walker %A Clint Whaley %X This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different components and building blocks of ScaLAPACK. This paper outlines the difficulties inherent in producing correct codes for networks of heterogeneous processors. We define a theoretical model of parallel computers dedicated to linear algebra applications: the Distributed Linear Algebra Machine (DLAM). This model provides a convenient framework for developing parallel algorithms and investigating their scalability, performance and programmability. Extensive performance results on various platforms are presented and analyzed with the help of the DLAM. Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes by suggesting alternative approaches to mathematical libraries, explaining how ScaLAPACK could be integrated into efficient and user-friendly distributed systems. %B Computer Physics Communications %V 97 %P 1-15 %8 1996-08 %G eng %N 1-2 %R https://doi.org/10.1016/0010-4655(96)00017-3