Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required.

This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors’ knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2:5 against two Sandy Bridge CPUs (8-core each) running Intel MKL library.

%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016 %I IEEE %C Chicago, IL %8 05-2016 %G eng %0 Conference Proceedings %B Software for Exascale Computing - SPPEXA %D 2016 %T Domain Overlap for Iterative Sparse Triangular Solves on GPUs %A Hartwig Anzt %A Edmond Chow %A Daniel Szyld %A Jack Dongarra %E Hans-Joachim Bungartz %E Philipp Neumann %E Wolfgang E. Nagel %X Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution. %B Software for Exascale Computing - SPPEXA %S Lecture Notes in Computer Science and Engineering %I Springer International Publishing %V 113 %P 527–545 %8 09-2016 %G eng %R 10.1007/978-3-319-40528-5_24 %0 Conference Paper %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) %D 2016 %T Efficiency of General Krylov Methods on GPUs – An Experimental Study %A Hartwig Anzt %A Jack Dongarra %A Moritz Kreutzer %A Gerhard Wellein %A Martin Kohler %K algorithmic bombardment %K BiCGSTAB %K CGS %K gpu %K IDR(s) %K Krylov solver %K QMR %X This paper compares different Krylov methods based on short recurrences with respect to their efficiency when implemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergence characteristics. For a large set of test matrices taken from the University of Florida Matrix Collection, we evaluate the methods’ performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and execution time. We also analyze whether the methods are “orthogonal” in terms of problem suitability. We propose best practices for choosing methods in a “black box” scenario, where no information about the optimal solver is available. %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) %I IEEE %C Chicago, IL %8 05-2016 %G eng %R 10.1109/IPDPSW.2016.45 %0 Conference Proceedings %B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2016 %T Efficiency of General Krylov Methods on GPUs – An Experimental Study %A Hartwig Anzt %A Jack Dongarra %A Moritz Kreutzer %A Gerhard Wellein %A Martin Kohler %K algorithmic bombardment %K BiCGSTAB %K CGS %K Convergence %K Electric breakdown %K gpu %K graphics processing units %K Hardware %K IDR(s) %K Krylov solver %K Libraries %K linear systems %K QMR %K Sparse matrices %X This paper compares different Krylov methods based on short recurrences with respect to their efficiency whenimplemented on GPUs. The comparison includes BiCGSTAB, CGS, QMR, and IDR using different shadow space dimensions. These methods are known for their good convergencecharacteristics. For a large set of test matrices taken from theUniversity of Florida Matrix Collection, we evaluate the methods'performance against different target metrics: convergence, number of sparse matrix-vector multiplications, and executiontime. We also analyze whether the methods are "orthogonal"in terms of problem suitability. We propose best practicesfor choosing methods in a "black box" scenario, where noinformation about the optimal solver is available. %B 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %P 683-691 %8 05-2016 %G eng %R 10.1109/IPDPSW.2016.45 %0 Journal Article %J Journal of Computational Science %D 2016 %T Fine-grained Bit-Flip Protection for Relaxation Methods %A Hartwig Anzt %A Jack %A Enrique S. Quintana-Ortí %K Bit flips %K Fault tolerance %K High Performance Computing %K iterative solvers %K Jacobi method %K sparse linear systems %X Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance. %B Journal of Computational Science %8 11-2016 %G eng %R 10.1016/j.jocs.2016.11.013 %0 Conference Paper %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %D 2016 %T Heterogeneous Streaming %A Chris J. Newburn %A Gaurav Bansal %A Michael Wood %A Luis Crivelli %A Judit Planas %A Alejandro Duran %A Paulo Souza %A Leonardo Borges %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %A Hartwig Anzt %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Ichitaro Yamazaki %A Jesus Labarta %X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application. %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %I IEEE %C Chicago, IL %8 05-2016 %G eng %0 Conference Paper %B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16) %D 2016 %T High-performance Matrix-matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Joël Falcou %A Jack Dongarra %X The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries. %B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16) %I Springer International Publishing %C Grenoble, France %8 08-2016 %G eng %0 Generic %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 01-2016 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %K Applications %K Batched linear algebra %K FEM %K gpu %K Tensor contractions %K Tensor HPC %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 06-2016 %G eng %0 Conference Paper %B IEEE High Performance Extreme Computing Conference (HPEC'16) %D 2016 %T LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi %A Azzam Haidar %A Stanimire Tomov %A Konstantin Arturov %A Murat Guney %A Shane Story %A Jack Dongarra %X A wide variety of heterogeneous compute resources, ranging from multicore CPUs to GPUs and coprocessors, are available to modern computers, making it challenging to design unified numerical libraries that efficiently and productively use all these varied resources. For example, in order to efficiently use Intel’s Knights Langing (KNL) processor, the next-generation of Xeon Phi architectures, one must design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance. We propose a productive and portable programming model that allows us to write a serial-looking code, which, however, achieves parallelism and scalability by using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and the parallel execution. This is done through multiple techniques ranging from multi-level data partitioning to adaptive task grain sizes, and dynamic task scheduling. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. Finally, we outline the strengths and the effectiveness of this approach – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate current work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. %B IEEE High Performance Extreme Computing Conference (HPEC'16) %I IEEE %C Waltham, MA %8 09-2016 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2016 %T On the performance and energy efficiency of sparse linear algebra on GPUs %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based supercomputers. We compare the resource efficiency of different sparse matrix–vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel’s MKL for multicore CPUs, and develop a GPU sparse matrix–matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers. %B International Journal of High Performance Computing Applications %8 10-2016 %G eng %U http://hpc.sagepub.com/content/early/2016/10/05/1094342016672081.abstract %R 10.1177/1094342016672081 %0 Generic %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Autotuning %K Batched GEMM %K GEMM %K GPU computing %K HPC %X Abstract. The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra. It is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, there becomes a need to have a high performance GEMM kernel for a batch of small matrices. Such kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both xed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance test reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 02-2016 %G eng %0 Conference Paper %B The International Supercomputing Conference (ISC High Performance 2016) %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Autotuning %K Batched GEMM %K GEMM %K GPU computing %K HPC %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B The International Supercomputing Conference (ISC High Performance 2016) %C Frankfurt, Germany %8 06-2016 %G eng %0 Book Section %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %E Julian M. Kunkel %E Pavan Balaji %E Jack Dongarra %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %I Springer International Publishing %P 21–38 %@ 978-3-319-41321-1 %G eng %U http://dx.doi.org/10.1007/978-3-319-41321-1_2 %R 10.1007/978-3-319-41321-1_2 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2016 %T Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs %A Ahmad Abdelfattah %A Hatem Ltaeif %A David Keyes %A Jack Dongarra %X Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications. %B Concurrency and Computation: Practice and Experience %V 28 %P 3447 - 3465 %8 05-2016 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/cpe.3874/full %N 12 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.v28.1210.1002/cpe.3874 %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K batched computation %K Cholesky Factorization %K GPUs %K Tuning %XSolving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually.

This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2 faster.

%B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 06-2016 %G eng %0 Journal Article %J International Journal of Networking and Computing %D 2016 %T Scheduling Computational Workflows on Failure-prone Platforms %A Guillaume Aupy %A Anne Benoit %A Henri Casanova %A Yves Robert %K checkpointing %K fault-tolerance %K reliability %K scheduling %K workflow %X We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomialtime optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations. %B International Journal of Networking and Computing %V 6 %P 2-26 %G eng %0 Journal Article %J Numerical Algorithms %D 2016 %T Updating Incomplete Factorization Preconditioners for Model Order Reduction %A Hartwig Anzt %A Edmond Chow %A Jens Saak %A Jack Dongarra %K key publication %X When solving a sequence of related linear systems by iterative methods, it is common to reuse the preconditioner for several systems, and then to recompute the preconditioner when the matrix has changed significantly. Rather than recomputing the preconditioner from scratch, it is potentially more efficient to update the previous preconditioner. Unfortunately, it is not always known how to update a preconditioner, for example, when the preconditioner is an incomplete factorization. A recently proposed iterative algorithm for computing incomplete factorizations, however, is able to exploit an initial guess, unlike existing algorithms for incomplete factorizations. By treating a previous factorization as an initial guess to this algorithm, an incomplete factorization may thus be updated. We use a sequence of problems from model order reduction. Experimental results using an optimized GPU implementation show that updating a previous factorization can be inexpensive and effective, making solving sequences of linear systems a potential niche problem for the iterative incomplete factorization algorithm. %B Numerical Algorithms %V 73 %P 611–630 %8 02-2016 %G eng %N 3 %R 10.1007/s11075-016-0110-2 %0 Conference Paper %B 2015 IEEE International Conference on Big Data (IEEE BigData 2015) %D 2015 %T Accelerating Collaborative Filtering for Implicit Feedback Datasets using GPUs %A Mark Gates %A Hartwig Anzt %A Jakub Kurzak %A Jack Dongarra %X In this paper we accelerate the Alternating Least Squares (ALS) algorithm used for generating product recommendations on the basis of implicit feedback datasets. We approach the algorithm with concepts proven to be successful in High Performance Computing. This includes the formulation of the algorithm as a mix of cache-optimized algorithm-specific kernels and standard BLAS routines, acceleration via graphics processing units (GPUs), use of parallel batched kernels, and autotuning to identify performance winners. For benchmark datasets, the multi-threaded CPU implementation we propose achieves more than a 10 times speedup over the implementations available in the GraphLab and Spark MLlib software packages. For the GPU implementation, the parameters of an algorithm-specific kernel were optimized using a comprehensive autotuning sweep. This results in an additional 2 times speedup over our CPU implementation. %B 2015 IEEE International Conference on Big Data (IEEE BigData 2015) %I IEEE %C Santa Clara, CA %8 11-2015 %G eng %0 Conference Paper %B Spring Simulation Multi-Conference 2015 (SpringSim'15) %D 2015 %T Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative eigensolver the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For the key routine generating the Krylov search spaces via the product of a sparse matrix and a block of vectors, we propose a GPU kernel based on a modied sliced ELLPACK format. Blocking a set of vectors and processing them simultaneously accelerates the computation of a set of consecutive SpMVs significantly. Comparing the performance against similar routines from Intel's MKL and NVIDIA's cuSPARSE library we identify appealing performance improvements. We integrate it into the highly optimized LOBPCG implementation. Compared to the BLOBEX CPU implementation running on two eight-core Intel Xeon E5-2690s, we accelerate the computation of a small set of eigenvectors using NVIDIA's K40 GPU by typically more than an order of magnitude. %B Spring Simulation Multi-Conference 2015 (SpringSim'15) %I SCS %C Alexandria, VA %8 04-2015 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2015 %T Acceleration of GPU-based Krylov solvers via Data Transfer Reduction %A Hartwig Anzt %A William Sawyer %A Stanimire Tomov %A Piotr Luszczek %A Jack Dongarra %B International Journal of High Performance Computing Applications %G eng %0 Conference Paper %B 3rd International Workshop on Energy Efficient Supercomputing (E2SC '15) %D 2015 %T Adaptive Precision Solvers for Sparse Linear Systems %A Hartwig Anzt %A Jack Dongarra %A Enrique S. Quintana-Ortí %B 3rd International Workshop on Energy Efficient Supercomputing (E2SC '15) %I ACM %C Austin, TX %8 11-2015 %G eng %0 Conference Paper %B International Supercomputing Conference (ISC 2015) %D 2015 %T Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs %A Edmond Chow %A Hartwig Anzt %A Jack Dongarra %B International Supercomputing Conference (ISC 2015) %C Frankfurt, Germany %8 07-2015 %G eng %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %D 2015 %T Batched Matrix Computations on Hardware Accelerators Based on GPUs %A Azzam Haidar %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %X We will present techniques for small matrix computations on GPUs and their use for energy efficient, high-performance solvers. Work on small problems delivers high performance through improved data reuse. Many numerical libraries and applications need this functionality further developed. We describe the main factorizations LU, QR, and Cholesky for a set of small dense matrices in parallel. We achieve significant acceleration and reduced energy consumption against other solutions. Our techniques are of interest to GPU application developers in general. %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %I SIAM %C Atlanta, GA %8 10-2015 %G eng %0 Conference Paper %B Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15) %D 2015 %T Energy Efficiency and Performance Frontiers for Sparse Computations on GPU Supercomputers %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6x performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5x faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers. %B Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15) %I ACM %C San Francisco, CA %8 02-2015 %@ 978-1-4503-3404-4 %G eng %R 10.1145/2712386.2712387 %0 Journal Article %J Concurrency in Computation: Practice and Experience %D 2015 %T Experiences in autotuning matrix multiplication for energy minimization on GPUs %A Hartwig Anzt %A Blake Haugen %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %B Concurrency in Computation: Practice and Experience %V 27 %P 5096-5113 %8 12-2015 %G eng %N 17 %R 10.1002/cpe.3516 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2015 %T Experiences in Autotuning Matrix Multiplication for Energy Minimization on GPUs %A Hartwig Anzt %A Blake Haugen %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %K Autotuning %K energy efficiency %K hardware accelerators %K matrix multiplication %K power %X In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use. %B Concurrency and Computation: Practice and Experience %V 27 %P 5096 - 5113 %8 Oct-12-2015 %G eng %U http://doi.wiley.com/10.1002/cpe.3516https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2Fcpe.3516 %N 17 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.3516 %0 Conference Paper %B 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing %D 2015 %T GPU-accelerated Co-design of Induced Dimension Reduction: Algorithmic Fusion and Kernel Overlap %A Hartwig Anzt %A Eduardo Ponce %A Gregory D. Peterson %A Jack Dongarra %X In this paper we present an optimized GPU co-design of the Induced Dimension Reduction (IDR) algorithm for solving linear systems. Starting from a baseline implementation based on the generic BLAS routines from the MAGMA software library, we apply optimizations that are based on kernel fusion and kernel overlap. Runtime experiments are used to investigate the benefit of the distinct optimization techniques for different variants of the IDR algorithm. A comparison to the reference implementation reveals that the interplay between them can succeed in cutting the overall runtime by up to about one third. %B 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing %I ACM %C Austin, TX %8 11-2015 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2015 %T Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs %A Jakub Kurzak %A Hartwig Anzt %A Mark Gates %A Jack Dongarra %B IEEE Transactions on Parallel and Distributed Systems %8 11-2015 %G eng %0 Conference Paper %B EuroPar 2015 %D 2015 %T Iterative Sparse Triangular Solves for Preconditioning %A Hartwig Anzt %A Edmond Chow %A Jack Dongarra %X Sparse triangular solvers are typically parallelized using level scheduling techniques, but parallel eciency is poor on high-throughput architectures like GPUs. We propose using an iterative approach for solving sparse triangular systems when an approximation is suitable. This approach will not work for all problems, but can be successful for sparse triangular matrices arising from incomplete factorizations, where an approximate solution is acceptable. We demonstrate the performance gains that this approach can have on GPUs in the context of solving sparse linear systems with a preconditioned Krylov subspace method. We also illustrate the effect of using asynchronous iterations. %B EuroPar 2015 %I Springer Berlin %C Vienna, Austria %8 08-2015 %G eng %U http://dx.doi.org/10.1007/978-3-662-48096-0_50 %R 10.1007/978-3-662-48096-0_50 %0 Generic %D 2015 %T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi %A Hartwig Anzt %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %I ISC High Performance (ISC15), Intel Booth Presentation %C Frankfurt, Germany %8 06-2015 %G eng %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2015 %T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems %A Maksims Abalenkovs %A Ahmad Abdelfattah %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %A Asim YarKhan %K dense linear algebra %K gpu %K HPC %K Multicore %K Programming models %K runtime %X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. %B Supercomputing Frontiers and Innovations %V 2 %8 10-2015 %G eng %R 10.14529/jsfi1504 %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %D 2015 %T Random-Order Alternating Schwarz for Sparse Triangular Solves %A Hartwig Anzt %A Edmond Chow %A Daniel Szyld %A Jack Dongarra %X Block-asynchronous Jacobi is an iteration method where a locally synchronous iteration is embedded in an asynchronous global iteration. The unknowns are partitioned into small subsets, and while the components within the same subset are iterated in Jacobi fashion, no update order in-between the subsets is enforced. The values of the non-local entries remain constant during the local iterations, which can result in slow inter-subset information propagation and slow convergence. Interpreting of the subsets as subdomains allows to transfer the concept of domain overlap typically enhancing the information propagation to block-asynchronous solvers. In this talk we explore the impact of overlapping domains to convergence and performance of block-asynchronous Jacobi iterations, and present results obtained by running this solver class on state-of-the-art HPC systems. %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %I SIAM %C Atlanta, GA %8 10-2015 %G eng %0 Generic %D 2015 %T Scheduling for fault-tolerance: an introduction %A Guillaume Aupy %A Yves Robert %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 01-2015 %G eng %0 Conference Paper %B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15) %D 2015 %T Tuning Stationary Iterative Solvers for Fault Resilience %A Hartwig Anzt %A Jack Dongarra %A Enrique S. Quintana-Ortí %X As the transistor’s feature size decreases following Moore’s Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community. In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation. %B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15) %I ACM %C Austin, TX %8 11-2015 %G eng %0 Generic %D 2014 %T Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iterative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU data structures and kernels to the higher-level algorithmic choices and overall heterogeneous design. Most notably, the eigensolver leverages the high-performance of a new GPU kernel developed for the simultaneous multiplication of a sparse matrix and a set of vectors (SpMM). This is a building block that serves as a backbone for not only block-Krylov, but also for other methods relying on blocking for acceleration in general. The heterogeneous LOBPCG developed here reveals the potential of this type of eigensolver by highly optimizing all of its components, and can be viewed as a benchmark for other SpMM-dependent applications. Compared to non-blocked algorithms, we show that the performance speedup factor of SpMM vs. SpMV-based algorithms is up to six on GPUs like NVIDIA’s K40. In particular, a typical SpMV performance range in double precision is 20 to 25 GFlop/s, while the SpMM is in the range of 100 to 120 GFlop/s. Compared to highly-optimized CPU implementations, e.g., the SpMM from MKL on two eight-core Intel Xeon E5-2690s, our kernel is 3 to 5x. faster on a K40 GPU. For comparison to other computational loads, the same GPU to CPU performance acceleration is observed for the SpMV product, as well as dense linear algebra, e.g., matrix-matrix multiplication and factorizations like LU, QR, and Cholesky. Thus, the modeled GPU (vs. CPU) acceleration for the entire solver is also 3 to 5x. In practice though, currently available CPU implementations are much slower due to missed optimization opportunities, as we show. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 10-2014 %G eng %0 Conference Paper %B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014 %D 2014 %T Hybrid Multi-Elimination ILU Preconditioners on GPUs %A Dimitar Lukarski %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X Abstract—Iterative solvers for sparse linear systems often benefit from using preconditioners. While there are implementations for many iterative methods that leverage the computing power of accelerators, porting the latest developments in preconditioners to accelerators has been challenging. In this paper we develop a selfadaptive multi-elimination preconditioner for graphics processing units (GPUs). The preconditioner is based on a multi-level incomplete LU factorization and uses a direct dense solver for the bottom-level system. For test matrices from the University of Florida matrix collection, we investigate the influence of handling the triangular solvers in the distinct iteration steps in either single or double precision arithmetic. Integrated into a Conjugate Gradient method, we show that our multi-elimination algorithm is highly competitive against popular preconditioners, including multi-colored symmetric Gauss-Seidel relaxation preconditioners, and (multi-colored symmetric) ILU for numerous problems. %B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014 %I IEEE %C Phoenix, AZ %8 05-2014 %G eng %0 Generic %D 2014 %T Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X Numerical methods in sparse linear algebra typically rely on a fast and efficient matrix vector product, as this usually is the backbone of iterative algorithms for solving eigenvalue problems or linear systems. Against the background of a large diversity in the characteristics of high performance computer architectures, it is a challenge to derive a cross-platform efficient storage format along with fast matrix vector kernels. Recently, attention focused on the SELL-C- format, a sliced ELLPACK format enhanced by row-sorting to reduce the fill in when padding rows with zeros. In this paper we propose an additional modification resulting in the padded sliced ELLPACK (SELLP) format, for which we develop a sparse matrix vector CUDA kernel that is able to efficiently exploit the computing power of NVIDIA GPUs. We show that the kernel we developed outperforms straight-forward implementations for the widespread CSR and ELLPACK formats, and is highly competitive to the implementations in the highly optimized CUSPARSE library. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 04-2014 %G eng %0 Journal Article %J Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences %D 2014 %T Improving the Energy Efficiency of Sparse Linear System Solvers on Multicore and Manycore Systems %A Hartwig Anzt %A Enrique S. Quintana-Ortí %K energy efficiency %K graphics processing units %K High Performance Computing %K iterative solvers %K multicore processors %K sparse linear systems %X While most recent breakthroughs in scientific research rely on complex simulations carried out in large-scale supercomputers, the power draft and energy spent for this purpose is increasingly becoming a limiting factor to this trend. In this paper, we provide an overview of the current status in energy-efficient scientific computing by reviewing different technologies used to monitor power draft as well as power- and energy-saving mechanisms available in commodity hardware. For the particular domain of sparse linear algebra, we analyze the energy efficiency of a broad collection of hardware architectures and investigate how algorithmic and implementation modifications can improve the energy performance of sparse linear system solvers, without negatively impacting their performance. %B Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences %V 372 %8 07-2014 %G eng %N 2018 %R 10.1098/rsta.2013.0279 %0 Conference Paper %B IPDPS 2014 %D 2014 %T Improving the performance of CA-GMRES on multicores with multiple GPUs %A Ichitaro Yamazaki %A Hartwig Anzt %A Stanimire Tomov %A Mark Hoemmen %A Jack Dongarra %X Abstract—The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 05-2014 %G eng %0 Conference Paper %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %D 2014 %T Optimizing Krylov Subspace Solvers on Graphics Processing Units %A Stanimire Tomov %A Piotr Luszczek %A Ichitaro Yamazaki %A Jack Dongarra %A Hartwig Anzt %A William Sawyer %X Krylov subspace solvers are often the method of choice when solving sparse linear systems iteratively. At the same time, hardware accelerators such as graphics processing units (GPUs) continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to leverage the full potential of the accelerator. In this paper we target the acceleration of the BiCGSTAB solver for GPUs, showing that significant improvement can be achieved by reformulating the method and developing application-specific kernels instead of using the generic CUBLAS library provided by NVIDIA. We propose an implementation that benefits from a significantly reduced number of kernel launches and GPUhost communication events, by means of increased data locality and a simultaneous reduction of multiple scalar products. Using experimental data, we show that, depending on the dominance of the untouched sparse matrix vector products, significant performance improvements can be achieved compared to a reference implementation based on the CUBLAS library. We feel that such optimizations are crucial for the subsequent development of highlevel sparse linear algebra libraries. %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %I IEEE %C Phoenix, AZ %8 05-2014 %G eng %0 Conference Paper %B VECPAR 2014 %D 2014 %T Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures %A Hartwig Anzt %A Dimitar Lukarski %A Stanimire Tomov %A Jack Dongarra %X Based on the premise that preconditioners needed for scientific computing are not only required to be robust in the numerical sense, but also scalable for up to thousands of light-weight cores, we argue that this two-fold goal is achieved for the recently developed self-adaptive multi-elimination preconditioner. For this purpose, we revise the underlying idea and analyze the performance of implementations realized in the PARALUTION and MAGMA open-source software libraries on GPU architectures (using either CUDA or OpenCL), Intel’s Many Integrated Core Architecture, and Intel’s Sandy Bridge processor. The comparison with other well-established preconditioners like multi-coloured Gauss-Seidel, ILU(0) and multi-colored ILU(0), shows that the twofold goal of a numerically stable cross-platform performant algorithm is achieved. %B VECPAR 2014 %C Eugene, OR %8 06-2014 %G eng %0 Conference Paper %B 2014 IEEE International Conference on High Performance Computing and Communications (HPCC) %D 2014 %T Task-Based Programming for Seismic Imaging: Preliminary Results %A Lionel Boillot %A George Bosilca %A Emmanuel Agullo %A Henri Calandra %X The level of hardware complexity of current supercomputers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms. While this paradigm has proved efficient for achieving such goals for dense and sparse linear solvers, it is yet to be demonstrated that industrial parallel codes—relying on the classical Message Passing Interface (MPI) standard and that accumulate dozens of years of expertise (and countless lines of code)—may be revisited to turn them into efficient task-based programs. In this paper, we study the applicability of task-based programming in the case of a Reverse Time Migration (RTM) application for Seismic Imaging. The initial MPI-based application is turned into a task-based code executed on top of the PaRSEC runtime system. Preliminary results show that the approach is competitive with (and even potentially superior to) the original MPI code on a homogeneous multicore node, and can more efficiently exploit complex hardware such as a cache coherent Non Uniform Memory Access (ccNUMA) node or an Intel Xeon Phi accelerator. %B 2014 IEEE International Conference on High Performance Computing and Communications (HPCC) %I IEEE %C Paris, France %8 08-2014 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2014 %T Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors %A José I. Aliaga %A Hartwig Anzt %A Maribel Castillo %A Juan C. Fernández %A Germán León %A Joaquín Pérez %A Enrique S. Quintana-Ortí %K CG %K CPUs %K energy efficiency %K GPUs %K low-power architectures %X In this paper, we analyze the interactions occurring in the triangle performance-power-energy for the execution of a pivotal numerical algorithm, the iterative conjugate gradient (CG) method, on a diverse collection of parallel multithreaded architectures. This analysis is especially timely in a decade where the power wall has arisen as a major obstacle to build faster processors. Moreover, the CG method has recently been proposed as a complement to the LINPACK benchmark, as this iterative method is argued to be more archetypical of the performance of today's scientific and engineering applications. To gain insights about the benefits of hands-on optimizations we include runtime and energy efficiency results for both out-of-the-box usage relying exclusively on compiler optimizations, and implementations manually optimized for target architectures, that range from general-purpose and digital signal multicore processors to manycore graphics processing units, all representative of current multithreaded systems. %B Concurrency and Computation: Practice and Experience %V 27 %P 885-904 %8 09-2014 %G eng %U http://dx.doi.org/10.1002/cpe.3341 %N 4 %& 885 %R 10.1002/cpe.3341 %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2013 %T A Block-Asynchronous Relaxation Method for Graphics Processing Units %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %A Vincent Heuveline %X In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the ‘‘subdomain’’ handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing. %B Journal of Parallel and Distributed Computing %V 73 %P 1613–1626 %8 12-2013 %G eng %N 12 %R http://dx.doi.org/10.1016/j.jpdc.2013.05.008 %0 Generic %D 2013 %T On the Combination of Silent Error Detection and Checkpointing %A Guillaume Aupy %A Anne Benoit %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %K checkpointing %K error recovery %K High-performance computing %K silent data corruption %K verification %X In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters. %B UT-CS-13-710 %I University of Tennessee Computer Science Technical Report %8 06-2013 %G eng %U http://www.netlib.org/lapack/lawnspdf/lawn278.pdf %0 Generic %D 2013 %T Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC %A Guillaume Aupy %A Mathieu Faverge %A Yves Robert %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %X This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter-node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures %B Lawn 277 %8 05-2013 %G eng %0 Journal Article %J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications %D 2013 %T Multithreading in the PLASMA Library %A Jakub Kurzak %A Piotr Luszczek %A Asim YarKhan %A Mathieu Faverge %A Julien Langou %A Henricus Bouwmeester %A Jack Dongarra %E Mohamed Ahmed %E Reda Ammar %E Sanguthevar Rajasekaran %B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications %I Taylor & Francis %8 00-2013 %G eng %0 Generic %D 2013 %T Optimal Checkpointing Period: Time vs. Energy %A Guillaume Aupy %A Anne Benoit %A Thomas Herault %A Yves Robert %A Jack Dongarra %B University of Tennessee Computer Science Technical Report (also LAWN 281) %I University of Tennessee %8 10-2013 %G eng %0 Journal Article %J ICCS 2012 %D 2012 %T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems %A Hartwig Anzt %A Stanimire Tomov %A Mark Gates %A Jack Dongarra %A Vincent Heuveline %B ICCS 2012 %C Omaha, NE %8 06-2012 %G eng %0 Journal Article %J EuroPar 2012 (also LAWN 260) %D 2012 %T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement %A Hartwig Anzt %A Piotr Luszczek %A Jack Dongarra %A Vincent Heuveline %B EuroPar 2012 (also LAWN 260) %C Rhodes Island, Greece %8 08-2012 %G eng %0 Journal Article %J Supercomputing '12 (poster) %D 2012 %T Matrices Over Runtime Systems at Exascale %A Emmanuel Agullo %A George Bosilca %A Cedric Castagnède %A Jack Dongarra %A Hatem Ltaeif %A Stanimire Tomov %B Supercomputing '12 (poster) %C Salt Lake City, Utah %8 11-2012 %G eng %0 Journal Article %J VECPAR 2012 %D 2012 %T Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators %A Ahmad Abdelfattah %A Jack Dongarra %A David Keyes %A Hatem Ltaeif %B VECPAR 2012 %C Kobe, Japan %8 07-2012 %G eng %0 Conference Proceedings %B Euro-Par 2012: Parallel Processing Workshops %D 2012 %T User Level Failure Mitigation in MPI %A Wesley Bland %E Ioannis Caragiannis %E Michael Alexander %E Rosa M. Badia %E Mario Cannataro %E Alexandru Costan %E Marco Danelutto %E Frederic Desprez %E Bettina Krammer %E Sahuquillo, J. %E Stephen L. Scott %E J. Weidendorfer %K ftmpi %B Euro-Par 2012: Parallel Processing Workshops %I Springer Berlin Heidelberg %C Rhodes Island, Greece %V 7640 %P 499-504 %8 08-2012 %G eng %0 Conference Proceedings %B Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper) %D 2012 %T Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %A Vincent Heuveline %B Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper) %C Rhodes Island, Greece %8 08-2012 %G eng %0 Journal Article %J SIAM Journal on Computing (submitted) %D 2012 %T Weighted Block-Asynchronous Relaxation for GPU-Accelerated Systems %A Hartwig Anzt %A Jack Dongarra %A Vincent Heuveline %B SIAM Journal on Computing (submitted) %8 03-2012 %G eng %0 Conference Proceedings %B The Twentieth International Conference on Domain Decomposition Methods %D 2011 %T Algebraic Schwarz Preconditioning for the Schur Complement: Application to the Time-Harmonic Maxwell Equations Discretized by a Discontinuous Galerkin Method. %A Emmanuel Agullo %A Luc Giraud %A Amina Guermouche %A Azzam Haidar %A Stephane Lanteri %A Jean Roman %B The Twentieth International Conference on Domain Decomposition Methods %C La Jolla, California %8 02-2011 %G eng %U http://hal.inria.fr/inria-00577639 %0 Journal Article %D 2011 %T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems %A Hartwig Anzt %A Stanimire Tomov %A Mark Gates %A Jack Dongarra %A Vincent Heuveline %K magma %8 12-2011 %G eng %0 Generic %D 2011 %T A Block-Asynchronous Relaxation Method for Graphics Processing Units %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %A Vincent Heuveline %K magma %B University of Tennessee Computer Science Technical Report %8 11-2011 %G eng %0 Generic %D 2011 %T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement %A Hartwig Anzt %A Piotr Luszczek %A Jack Dongarra %A Vincent Heuveline %K magma %B University of Tennessee Computer Science Technical Report UT-CS-11-690 (also Lawn 260) %8 12-2011 %G eng %0 Journal Article %J in GPU Computing Gems, Jade Edition %D 2011 %T A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Samuel Thibault %A Stanimire Tomov %E Wen-mei W. Hwu %K magma %K morse %B in GPU Computing Gems, Jade Edition %I Elsevier %V 2 %P 473-484 %8 00-2011 %G eng %0 Journal Article %J IEEE/ACS AICCSA 2011 %D 2011 %T LU Factorization for Accelerator-Based Systems %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Mathieu Faverge %A Julien Langou %A Hatem Ltaeif %A Stanimire Tomov %K magma %K morse %B IEEE/ACS AICCSA 2011 %C Sharm-El-Sheikh, Egypt %8 12-2011 %G eng %0 Journal Article %J Parallel, Distributed, Grid and Cloud Computing for Engineering, Ajaccio, Corsica, France, 12-15 April %D 2011 %T Parallel algebraic domain decomposition solver for the solution of augmented systems. %A Emmanuel Agullo %A Luc Giraud %A Amina Guermouche %A Azzam Haidar %A Jean Roman %B Parallel, Distributed, Grid and Cloud Computing for Engineering, Ajaccio, Corsica, France, 12-15 April %8 00-2011 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2011 %T QCG-OMPI: MPI Applications on Grids. %A Emmanuel Agullo %A Camille Coti %A Thomas Herault %A Julien Langou %A Sylvain Peyronnet %A A. Rezmerita %A Franck Cappello %A Jack Dongarra %B Future Generation Computer Systems %V 27 %P 435-369 %8 01-2011 %G eng %0 Generic %D 2010 %T Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers %A Stanimire Tomov %A George Bosilca %A Cedric Augonnet %I 2010 Symposium on Application Accelerators in. High-Performance Computing (SAAHPC'10), Tutorial %8 07-2010 %G eng %0 Generic %D 2010 %T Autotuning Dense Linear Algebra Libraries on GPUs %A Rajib Nath %A Stanimire Tomov %A Emmanuel Agullo %A Jack Dongarra %I Sixth International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010) %C Basel, Switzerland %8 06-2010 %G eng %0 Generic %D 2010 %T Faster, Cheaper, Better - A Hybridization Methodology to Develop Linear Algebra Software for GPUs %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Samuel Thibault %A Stanimire Tomov %K magma %K morse %B LAPACK Working Note %8 00-2010 %G eng %0 Journal Article %J Sparse Days 2010 Meeting at CERFACS %D 2010 %T MaPHyS or the Development of a Parallel Algebraic Domain Decomposition Solver in the Course of the Solstice Project %A Emmanuel Agullo %A Luc Giraud %A Amina Guermouche %A Azzam Haidar %A Jean Roman %A Yohan Lee-Tin-Yien %B Sparse Days 2010 Meeting at CERFACS %C Toulouse, France %8 06-2010 %G eng %0 Journal Article %J ICCS 2010 %D 2010 %T Proceedings of the International Conference on Computational Science %E Peter M. Sloot %E Geert Dick van Albada %E Jack Dongarra %B ICCS 2010 %I Elsevier %C Amsterdam %8 05-2010 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2010 %T QCG-OMPI: MPI Applications on Grids %A Emmanuel Agullo %A Camille Coti %A Thomas Herault %A Julien Langou %A Sylvain Peyronnet %A A. Rezmerita %A Franck Cappello %A Jack Dongarra %B Future Generation Computer Systems %V 27 %P 357-369 %8 03-2010 %G eng %0 Conference Proceedings %B 24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224) %D 2010 %T QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment %A Emmanuel Agullo %A Camille Coti %A Jack Dongarra %A Thomas Herault %A Julien Langou %B 24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224) %C Atlanta, GA %8 04-2010 %G eng %0 Conference Proceedings %B Proceedings of IPDPS 2011 %D 2010 %T QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Samuel Thibault %A Stanimire Tomov %K magma %K morse %K plasma %B Proceedings of IPDPS 2011 %C Anchorage, AK %8 10-2010 %G eng %0 Generic %D 2010 %T Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Rajib Nath %A Jean Roman %A Samuel Thibault %A Stanimire Tomov %I 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster %C Knoxville, TN %8 07-2010 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2010 %T Self-Healing Network for Scalable Fault-Tolerant Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %B Future Generation Computer Systems %V 26 %P 479-485 %8 03-2010 %G eng %0 Conference Proceedings %B 24th IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2010 %T Tile QR Factorization with Parallel Panel Processing for Multicore Architectures %A Bilel Hadri %A Emmanuel Agullo %A Jack Dongarra %B 24th IEEE International Parallel and Distributed Processing Symposium (submitted) %8 00-2010 %G eng %0 Journal Article %J PARA 2010 %D 2010 %T Towards a Complexity Analysis of Sparse Hybrid Linear Solvers %A Emmanuel Agullo %A Luc Giraud %A Amina Guermouche %A Azzam Haidar %A Jean Roman %B PARA 2010 %C Reykjavik, Iceland %8 06-2010 %G eng %0 Conference Proceedings %B 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear) %D 2009 %T Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware %A Emmanuel Agullo %A Bilel Hadri %A Hatem Ltaeif %A Jack Dongarra %B 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear) %8 00-2009 %G eng %0 Journal Article %J Lecture Notes in Computer Science: Theoretical Computer Science and General Issues %D 2009 %T Computational Science – ICCS 2009, Proceedings of the 9th International Conference %E Gabrielle Allen %E Jarosław Nabrzyski %E E. Seidel %E Geert Dick van Albada %E Jack Dongarra %E Peter M. Sloot %B Lecture Notes in Computer Science: Theoretical Computer Science and General Issues %C Baton Rouge, LA %V - %8 05-2009 %G eng %0 Journal Article %J Submitted to Transaction on Parallel and Distributed Systems %D 2009 %T Enhancing Parallelism of Tile QR Factorization for Multicore Architectures %A Bilel Hadri %A Hatem Ltaeif %A Emmanuel Agullo %A Jack Dongarra %K plasma %B Submitted to Transaction on Parallel and Distributed Systems %8 12-2009 %G eng %0 Conference Proceedings %B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing %D 2009 %T A Holistic Approach for Performance Measurement and Analysis for Petascale Applications %A Heike Jagode %A Jack Dongarra %A Sadaf Alam %A Jeffrey Vetter %A W. Spear %A Allen Maloney %E Gabrielle Allen %K point %K test %B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing %I Springer-Verlag Berlin Heidelberg 2009 %C Baton Rouge, Louisiana %V 2009 %P 686-695 %8 05-2009 %G eng %0 Journal Article %J Euro-Par 2009, Lecture Notes in Computer Science %D 2009 %T Impact of Quad-core Cray XT4 System and Software Stack on Scientific Computation %A Sadaf Alam %A Richard F. Barrett %A Heike Jagode %A J. A. Kuehn %A Steve W. Poole %A R. Sankaran %K test %B Euro-Par 2009, Lecture Notes in Computer Science %I Springer Berlin / Heidelberg %C Delft, The Netherlands %V 5704/2009 %P 334-344 %8 08-2009 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications (to appear) %D 2009 %T The International Exascale Software Project: A Call to Cooperative Action by the Global High Performance Community %A Jack Dongarra %A Pete Beckman %A Patrick Aerts %A Franck Cappello %A Thomas Lippert %A Satoshi Matsuoka %A Paul Messina %A Terry Moore %A Rick Stevens %A Anne Trefethen %A Mateo Valero %B International Journal of High Performance Computing Applications (to appear) %8 07-2009 %G eng %0 Conference Proceedings %B SciDAC 2009, Journal of Physics: Conference Series %D 2009 %T Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team %A Bronis R. de Supinski %A Sadaf Alam %A David Bailey %A Laura Carrington %A Chris Daley %A Anshu Dubey %A Todd Gamblin %A Dan Gunter %A Paul D. Hovland %A Heike Jagode %A Karen Karavanic %A Gabriel Marin %A John Mellor-Crummey %A Shirley Moore %A Boyana Norris %A Leonid Oliker %A Catherine Olschanowsky %A Philip C. Roth %A Martin Schulz %A Sameer Shende %A Allan Snavely %K test %B SciDAC 2009, Journal of Physics: Conference Series %I IOP Publishing %C San Diego, California %V 180(2009)012039 %8 07-2009 %G eng %0 Conference Proceedings %B 9th International Conference on Computational Science (ICCS 2009) %D 2009 %T A Note on Auto-tuning GEMM for GPUs %A Yinan Li %A Jack Dongarra %A Stanimire Tomov %E Gabrielle Allen %E Jarosław Nabrzyski %E E. Seidel %E Geert Dick van Albada %E Jack Dongarra %E Peter M. Sloot %B 9th International Conference on Computational Science (ICCS 2009) %C Baton Rouge, LA %P 884-892 %8 05-2009 %G eng %R 10.1007/978-3-642-01970-8_89 %0 Generic %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Rajib Nath %A Stanimire Tomov %A Asim YarKhan %A Vasily Volkov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09) %C Portland, OR %8 11-2009 %G eng %0 Conference Proceedings %B Journal of Physics: Conference Series %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Stanimire Tomov %K magma %K plasma %B Journal of Physics: Conference Series %V 180 %8 00-2009 %G eng %0 Journal Article %J Parallel Computing %D 2009 %T Optimizing Matrix Multiplication for a Short-Vector SIMD Architecture - CELL Processor %A Wesley Alvaro %A Jakub Kurzak %A Jack Dongarra %B Parallel Computing %V 35 %P 138-150 %8 00-2009 %G eng %0 Generic %D 2009 %T Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures %A Bilel Hadri %A Hatem Ltaeif %A Emmanuel Agullo %A Jack Dongarra %K plasma %B Innovative Computing Laboratory Technical Report (also LAPACK Working Note 222 and CS Tech Report UT-CS-09-645) %8 09-2009 %G eng %0 Conference Proceedings %B accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010) %D 2009 %T Tile QR Factorization with Parallel Panel Processing for Multicore Architectures %A Bilel Hadri %A Hatem Ltaeif %A Emmanuel Agullo %A Jack Dongarra %K plasma %B accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010) %C Atlanta, GA %8 12-2009 %G eng %0 Journal Article %J in Cloud Computing and Software Services: Theory and Techniques (to appear) %D 2009 %T Transparent Cross-Platform Access to Software Services using GridSolve and GridRPC %A Keith Seymour %A Asim YarKhan %A Jack Dongarra %E Syed Ahson %E Mohammad Ilyas %K netsolve %B in Cloud Computing and Software Services: Theory and Techniques (to appear) %I CRC Press %8 00-2009 %G eng %0 Conference Proceedings %B 8th International Conference on Computational Science (ICCS), Proceedings Parts I, II, and III, Lecture Notes in Computer Science %D 2008 %E Marian Bubak %E Geert Dick van Albada %E Jack Dongarra %E Peter M. Sloot %B 8th International Conference on Computational Science (ICCS), Proceedings Parts I, II, and III, Lecture Notes in Computer Science %I Springer Berlin %C Krakow, Poland %V 5101 %8 01-2008 %G eng %0 Journal Article %J in Advances in Computers %D 2008 %T DARPA's HPCS Program: History, Models, Tools, Languages %A Jack Dongarra %A Robert Graybill %A William Harrod %A Robert Lucas %A Ewing Lusk %A Piotr Luszczek %A Janice McMahon %A Allan Snavely %A Jeffrey Vetter %A Katherine Yelick %A Sadaf Alam %A Roy Campbell %A Laura Carrington %A Tzu-Yi Chen %A Omid Khalili %A Jeremy Meredith %A Mustafa Tikir %E M. Zelkowitz %B in Advances in Computers %I Elsevier %V 72 %8 01-2008 %G eng %0 Conference Proceedings %B Proceedings of the DoD HPCMP User Group Conference %D 2008 %T Exploring New Architectures in Accelerating CFD for Air Force Applications %A Jack Dongarra %A Shirley Moore %A Gregory D. Peterson %A Stanimire Tomov %A Jeff Allred %A Vincent Natoli %A David Richie %K magma %B Proceedings of the DoD HPCMP User Group Conference %C Seattle, Washington %8 01-2008 %G eng %0 Generic %D 2008 %T Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the CELL Processor %A Wesley Alvaro %A Jakub Kurzak %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report %8 01-2008 %G eng %0 Journal Article %J Recent developments in Grid Technology and Applications %D 2008 %T High Performance GridRPC Middleware %A Yves Caniou %A Eddy Caron %A Frederic Desprez %A Hidemoto Nakada %A Yoshio Tanaka %A Keith Seymour %E George A. Gravvanis %E John P. Morrison %E Hamid R. Arabnia %E D. A. Power %K netsolve %B Recent developments in Grid Technology and Applications %I Nova Science Publishers %8 00-2008 %G eng %0 Conference Proceedings %B Proceedings of the 2nd International Workshop on Tools for High Performance Computing %D 2008 %T Usage of the Scalasca Toolset for Scalable Performance Analysis of Large-scale Parallel Applications %A Felix Wolf %A Brian Wylie %A Erika Abraham %A Wolfgang Frings %A Karl Fürlinger %A Markus Geimer %A Marc-Andre Hermanns %A Bernd Mohr %A Shirley Moore %A Matthias Pfeifer %E Michael Resch %E Rainer Keller %E Valentin Himmler %E Bettina Krammer %E A Schulz %K point %B Proceedings of the 2nd International Workshop on Tools for High Performance Computing %I Springer %C Stuttgart, Germany %P 157-167 %8 01-2008 %G eng %0 Conference Proceedings %B Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07) %D 2007 %T Binomial Graph: A Scalable and Fault- Tolerant Logical Network Topology %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07) %I Springer %C Niagara Falls, Canada %8 08-2007 %G eng %0 Journal Article %J Euro-Par 2007 %D 2007 %T Decision Trees and MPI Collective Algorithm Selection Problem %A Jelena Pjesivac–Grbovic %A George Bosilca %A Graham Fagg %A Thara Angskun %A Jack Dongarra %K ftmpi %B Euro-Par 2007 %I Springer %C Rennes, France %P 105–115 %8 08-2007 %G eng %0 Journal Article %J Parallel Computing (Special Edition: EuroPVM/MPI 2006) %D 2007 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A George Bosilca %A Graham Fagg %A Thara Angskun %A Jack Dongarra %K ftmpi %B Parallel Computing (Special Edition: EuroPVM/MPI 2006) %I Elsevier %8 00-2007 %G eng %0 Conference Proceedings %B The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT) %D 2007 %T Optimal Routing in Binomial Graph Networks %A Thara Angskun %A George Bosilca %A Brad Vander Zanden %A Jack Dongarra %K ftmpi %B The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT) %I IEEE Computer Society %C Adelaide, Australia %8 12-2007 %G eng %0 Journal Article %J Cluster computing %D 2007 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B Cluster computing %I Springer Netherlands %V 10 %P 127-143 %8 06-2007 %G eng %0 Conference Proceedings %B Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) %D 2007 %T Reliability Analysis of Self-Healing Network using Discrete-Event Simulation %A Thara Angskun %A George Bosilca %A Graham Fagg %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %B Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) %I IEEE Computer Society %P 437-444 %8 05-2007 %G eng %0 Conference Proceedings %B Proceedings of the 2007 International Conference on Computational Science (ICCS 2007) %D 2007 %T Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors %A Karl Fürlinger %A Michael Gerndt %A Jack Dongarra %E Yong Shi %E Jack Dongarra %E Geert Dick van Albada %E Peter M. Sloot %K kojak %B Proceedings of the 2007 International Conference on Computational Science (ICCS 2007) %I Springer LNCS %C Beijing, China %V 4487-4490 %P 815-822 %G eng %R 10.1007/978-3-540-72586-2_115 %0 Conference Proceedings %B 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007) %D 2007 %T Self-Healing in Binomial Graph Networks %A Thara Angskun %A George Bosilca %A Jack Dongarra %B 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007) %C Vilamoura, Algarve, Portugal %8 11-2007 %G eng %0 Journal Article %J 2006 Euro PVM/MPI (submitted) %D 2006 %T Flexible collective communication tuning architecture applied to Open MPI %A Graham Fagg %A Jelena Pjesivac–Grbovic %A George Bosilca %A Thara Angskun %A Jack Dongarra %K ftmpi %B 2006 Euro PVM/MPI (submitted) %C Bonn, Germany %8 01-2006 %G eng %0 Generic %D 2006 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B ICL Technical Report %8 00-2006 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2006 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B Lecture Notes in Computer Science %I Springer Berlin / Heidelberg %V 4192 %P 40-48 %8 09-2006 %G eng %0 Journal Article %J 2006 Euro PVM/MPI %D 2006 %T Scalable Fault Tolerant Protocol for Parallel Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %B 2006 Euro PVM/MPI %C Bonn, Germany %8 00-2006 %G eng %0 Conference Proceedings %B DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems %D 2006 %T Self-Healing Network for Scalable Fault Tolerant Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %B DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems %C Innsbruck, Austria %8 01-2006 %G eng %0 Conference Proceedings %B Proc. of the 5th International Workshop on Performance Modeling, Evaluation, and Organization of Parallel and Distributed Systems (PMEO-PDS 2006) %D 2006 %T A Systematic Multi-step Methodology for Performance Analysis of Communication Traces of Distributed Applications based on Hierarchical Clustering %A Gabriela Aguilera %A Patricia J. Teller %A Michela Taufer %A Felix Wolf %K kojak %B Proc. of the 5th International Workshop on Performance Modeling, Evaluation, and Organization of Parallel and Distributed Systems (PMEO-PDS 2006) %I IEEE Computer Society %C Rhodes Island, Greece %8 04-2006 %G eng %0 Conference Proceedings %B Proceedings of Parallel Computing 2005 (ParCo) (to appear) %D 2005 %T Analysis and Optimization of Yee_Bench using Hardware Performance Counters %A Ulf Andersson %A Phil Mucci %K papi %B Proceedings of Parallel Computing 2005 (ParCo) (to appear) %C Malaga, Spain %8 01-2005 %G eng %0 Conference Proceedings %B Proceedings of 5th International Conference on Computational Science (ICCS) %D 2005 %T Comparison of Nonlinear Conjugate-Gradient methods for computing the Electronic Properties of Nanostructure Architectures %A Stanimire Tomov %A Julien Langou %A Andrew Canning %A Lin-Wang Wang %A Jack Dongarra %E V. S. Sunderman %E Geert Dick van Albada %E Peter M. Sloot %E Jack Dongarra %K doe-nano %B Proceedings of 5th International Conference on Computational Science (ICCS) %I Springer's Lecture Notes in Computer Science %C Atlanta, GA, USA %P 317-325 %8 01-2005 %G eng %0 Conference Proceedings %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %D 2005 %T Fault Tolerant High Performance Computing by a Coding Approach %A Zizhong Chen %A Graham Fagg %A Edgar Gabriel %A Julien Langou %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %K grads %K lacsi %K sans %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %C Chicago, Illinois %8 01-2005 %G eng %0 Journal Article %J Grid Computing and New Frontiers of High Performance Processing %D 2005 %T NetSolve: Grid Enabling Scientific Computing Environments %A Keith Seymour %A Asim YarKhan %A Sudesh Agrawal %A Jack Dongarra %E Lucio Grandinetti %K netsolve %B Grid Computing and New Frontiers of High Performance Processing %I Elsevier %8 00-2005 %G eng %0 Conference Proceedings %B Proceedings of 2005 European Conference on Parallel Computers (Euro-Par) (to appear) %D 2005 %T PerfMiner: Cluster-Wide Collection, Storage and Presentation of Application Level Hardware Performance Data %A Phil Mucci %A Daniel Ahlin %A Johan Danielsson %A Per Ekman %A Lars Malinowski %K papi %B Proceedings of 2005 European Conference on Parallel Computers (Euro-Par) (to appear) %C Monte de Caparica, Portugal %8 01-2005 %G eng %0 Journal Article %J Cluster Computing Journal (to appear) %D 2005 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B Cluster Computing Journal (to appear) %8 01-2005 %G eng %0 Conference Proceedings %B 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05) %D 2005 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05) %C Denver, Colorado %8 04-2005 %G eng %0 Conference Proceedings %B Proceedings of DoD HPCMP UGC 2005 %D 2005 %T Performance Profiling and Analysis of DoD Applications using PAPI and TAU %A Shirley Moore %A David Cronk %A Felix Wolf %A Avi Purkayastha %A Patricia J. Teller %A Robert Araiza %A Gabriela Aguilera %A Jamie Nava %K papi %B Proceedings of DoD HPCMP UGC 2005 %I IEEE %C Nashville, TN %8 06-2005 %G eng %0 Conference Proceedings %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %D 2005 %T Scalable Fault Tolerant MPI: Extending the Recovery Algorithm %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %E Beniamino Di Martino %K ftmpi %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %I Springer-Verlag Berlin %C Sorrento (Naples) , Italy %V 3666 %P 67 %8 09-2005 %G eng %0 Journal Article %J Oak Ridge National Laboratory Report %D 2004 %T Cray X1 Evaluation Status Report %A Pratul Agarwal %A R. A. Alexander %A E. Apra %A Satish Balay %A Arthur S. Bland %A James Colgan %A Eduardo D'Azevedo %A Jack Dongarra %A Tom Dunigan %A Mark Fahey %A Al Geist %A M. Gordon %A Robert Harrison %A Dinesh Kaushik %A M. Krishnakumar %A Piotr Luszczek %A Tony Mezzacapa %A Jeff Nichols %A Jarek Nieplocha %A Leonid Oliker %A T. Packwood %A M. Pindzola %A Thomas C. Schulthess %A Jeffrey Vetter %A James B White %A T. Windus %A Patrick H. Worley %A Thomas Zacharia %B Oak Ridge National Laboratory Report %V /-2004/13 %8 01-2004 %G eng %0 Conference Proceedings %B International Conference on Computational Science %D 2004 %T Design of an Interactive Environment for Numerically Intensive Parallel Linear Algebra Calculations %A Piotr Luszczek %A Jack Dongarra %E Marian Bubak %E Geert Dick van Albada %E Peter M. Sloot %E Jack Dongarra %K lacsi %K lfc %B International Conference on Computational Science %I Springer Verlag %C Poland %8 06-2004 %G eng %R 10.1007/978-3-540-25944-2_35 %0 Conference Proceedings %B Proceedings of ISC2004 (to appear) %D 2004 %T Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems %A Graham Fagg %A Edgar Gabriel %A George Bosilca %A Thara Angskun %A Zizhong Chen %A Jelena Pjesivac–Grbovic %A Kevin London %A Jack Dongarra %K ftmpi %K lacsi %B Proceedings of ISC2004 (to appear) %C Heidelberg, Germany %8 06-2004 %G eng %0 Journal Article %J International Journal for High Performance Applications and Supercomputing (to appear) %D 2004 %T Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing %A Graham Fagg %A Edgar Gabriel %A Zizhong Chen %A Thara Angskun %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %K lacsi %B International Journal for High Performance Applications and Supercomputing (to appear) %8 04-2004 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2003 %T Computational Science — ICCS 2003 %A Peter M. Sloot %A David Abramson %A Alexander V. Bogdanov %A Jack Dongarra %A Albert Zomaya %A Yuriy Gorbachev %B Lecture Notes in Computer Science %I Springer-Verlag, Berlin %C ICCS 2003, International Conference. Melbourne, Australia %V 2657-2660 %8 06-2003 %G eng %0 Journal Article %J Special Issue on Biological Applications of Genetic and Evolutionary Computation (submitted) %D 2003 %T Energy Minimization of Protein Tertiary Structure by Parallel Simulated Annealing using Genetic Crossover %A Tomoyuki Hiroyasu %A Mitsunori Miki %A Shinya Ogura %A Keiko Aoi %A Takeshi Yoshida %A Yuko Okamoto %A Jack Dongarra %B Special Issue on Biological Applications of Genetic and Evolutionary Computation (submitted) %8 03-2003 %G eng %0 Conference Proceedings %B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented) %D 2003 %T Fault Tolerant Communication Library and Applications for High Performance Computing %A Graham Fagg %A Edgar Gabriel %A Zizhong Chen %A Thara Angskun %A George Bosilca %A Antonin Bukovsky %A Jack Dongarra %K ftmpi %K lacsi %B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented) %C Santa Fe, NM %8 10-2003 %G eng %0 Conference Proceedings %B 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science %D 2003 %T A Fault-Tolerant Communication Library for Grid Environments %A Edgar Gabriel %A Graham Fagg %A Antonin Bukovsky %A Thara Angskun %A Jack Dongarra %K ftmpi %K lacsi %B 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science %C San Francisco %8 06-2003 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2003 %T High Performance Computing for Computational Science %A Jose Palma %A Jack Dongarra %A Vicente Hernández %E Antonio Augusto Sousa %B Lecture Notes in Computer Science %I Springer-Verlag, Berlin %C VECPAR 2002, 5th International Conference June 26-28, 2002 %V 2565 %8 01-2003 %G eng %0 Journal Article %J Making the Global Infrastructure a Reality %D 2003 %T NetSolve: Past, Present, and Future - A Look at a Grid Enabled Server %A Sudesh Agrawal %A Jack Dongarra %A Keith Seymour %A Sathish Vadhiyar %E Francine Berman %E Geoffrey Fox %E Anthony Hey %K netsolve %B Making the Global Infrastructure a Reality %I Wiley Publishing %8 00-2003 %G eng %0 Journal Article %J Resource Management in the Grid %D 2003 %T Scheduling in the Grid Application Development Software Project %A Holly Dail %A Otto Sievert %A Francine Berman %A Henri Casanova %A Asim YarKhan %A Sathish Vadhiyar %A Jack Dongarra %A Chuang Liu %A Lingyun Yang %A Dave Angulo %A Ian Foster %K grads %B Resource Management in the Grid %I Kluwer Publishers %8 03-2003 %G eng %0 Journal Article %J Parallel and Distributed Computing Practices %D 2002 %T A Comparison of Parallel Solvers for General Narrow Banded Linear Systems %A Peter Arbenz %A Andrew Cleary %A Jack Dongarra %A Markus Hegland %B Parallel and Distributed Computing Practices %V 2 %P 385-400 %8 10-2002 %G eng %0 Generic %D 2002 %T Hardware Software Server in NetSolve %A Sudesh Agrawal %K netsolve %B ICL Technical Report %8 01-2002 %G eng %0 Journal Article %J Concurrency: Practice and Experience %D 2002 %T Innovations of the NetSolve Grid Computing System %A Dorian Arnold %A Henri Casanova %A Jack Dongarra %K netsolve %B Concurrency: Practice and Experience %V 14 %P 1457-1479 %8 01-2002 %G eng %0 Journal Article %J Parallel Computing %D 2002 %T Middleware for the Use of Storage in Communication %A Micah Beck %A Dorian Arnold %A Alessandro Bassi %A Francine Berman %A Henri Casanova %A Jack Dongarra %A Terry Moore %A Graziano Obertelli %A James Plank %A Martin Swany %A Sathish Vadhiyar %A Rich Wolski %K netsolve %B Parallel Computing %V 28 %P 1773-1788 %8 08-2002 %G eng %0 Conference Proceedings %B International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops %D 2002 %T Toward a Framework for Preparing and Executing Adaptive Grid Programs %A Ken Kennedy %A John Mellor-Crummey %A Keith Cooper %A Linda Torczon %A Francine Berman %A Andrew Chien %A Dave Angulo %A Ian Foster %A Dennis Gannon %A Lennart Johnsson %A Carl Kesselman %A Jack Dongarra %A Sathish Vadhiyar %K grads %B International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops %C Fort Lauderdale, FL %P 0171 %8 04-2002 %G eng %0 Generic %D 2002 %T Users' Guide to NetSolve v1.4.1 %A Sudesh Agrawal %A Dorian Arnold %A Susan Blackford %A Jack Dongarra %A Michelle Miller %A Kiran Sagi %A Zhiao Shi %A Keith Seymour %A Sathish Vadhiyar %K netsolve %B ICL Technical Report %8 06-2002 %G eng %0 Journal Article %J Parallel Processing Letters %D 2001 %T On the Convergence of Computational and Data Grids %A Dorian Arnold %A Sathish Vadhiyar %A Jack Dongarra %K netsolve %B Parallel Processing Letters %V 11 %P 187-202 %8 01-2001 %G eng %0 Conference Proceedings %B Tenth International World Wide Web Conference Proceedings (to appear), %D 2001 %T Enabling Full Service Surrogates Using the Portable Channel Representation %A Micah Beck %A Terry Moore %A Leif Abrahamsson %A Chistophe Achouiantz %A Patrik Johansson %B Tenth International World Wide Web Conference Proceedings (to appear), %C Hong Kong %8 05-2001 %G eng %0 Journal Article %J submitted to SC2001 %D 2001 %T Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication %A Micah Beck %A Dorian Arnold %A Alessandro Bassi %A Francine Berman %A Henri Casanova %A Jack Dongarra %A Terry Moore %A Graziano Obertelli %A James Plank %A Martin Swany %A Sathish Vadhiyar %A Rich Wolski %K netsolve %B submitted to SC2001 %C Denver, Colorado %8 11-2001 %G eng %0 Conference Proceedings %B Department of Defense Users' Group Conference (to appear) %D 2001 %T Metacomputing Support for the SARA3D Structural Acoustics Application %A Shirley Moore %A Dorian Arnold %A David Cronk %K netsolve %B Department of Defense Users' Group Conference (to appear) %C Biloxi, Mississippi %8 06-2001 %G eng %0 Journal Article %J Handbook of Massive Data Sets %D 2001 %T Overview of High Performance Computers %A Aad J. van der Steen %A Jack Dongarra %E James Abello %E Panos Pardalos %E Mauricio Resende %B Handbook of Massive Data Sets %I Kluwer Academic Publishers %P 791-852 %8 01-2001 %G eng %0 Conference Proceedings %B to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications %D 2000 %T Developing an Architecture to Support the Implementation and Development of Scientific Computing Applications %A Dorian Arnold %A Jack Dongarra %K netsolve %B to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications %C Ottawa, Canada %8 10-2000 %G eng %0 Conference Proceedings %B 2000 International Conference on Parallel Processing (ICPP-2000) %D 2000 %T The NetSolve Environment: Progressing Towards the Seamless Grid %A Dorian Arnold %A Jack Dongarra %K netsolve %B 2000 International Conference on Parallel Processing (ICPP-2000) %C Toronto, Canada %8 08-2000 %G eng %0 Journal Article %J ASTC-HPC 2000 %D 2000 %T Providing Infrastructure and Interface to High Performance Applications in a Distributed Setting %A Dorian Arnold %A Wonsuck Lee %A Jack Dongarra %A Mary Wheeler %B ASTC-HPC 2000 %C Washington, DC %8 04-2000 %G eng %0 Conference Proceedings %B Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing %D 2000 %T Request Sequencing: Optimizing Communication for the Grid %A Dorian Arnold %A Dieter Bachmann %A Jack Dongarra %K netsolve %B Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing %C (Germany: Springer Verlag 2000) %P V1900,1213-1222 %8 01-2000 %G eng %0 Conference Proceedings %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %D 2000 %T Seamless Access to Adaptive Solver Algorithms %A Dorian Arnold %A Susan Blackford %A Jack Dongarra %A Victor Eijkhout %A Tinghua Xu %K netsolve %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %C Lausanne, Switzerland %8 08-2000 %G eng %0 Generic %D 2000 %T Secure Remote Access to Numerical Software and Computation Hardware %A Dorian Arnold %A Shirley Browne %A Jack Dongarra %A Graham Fagg %A Keith Moore %B University of Tennessee Computer Science Technical Report, UT-CS-00-446 %8 07-2000 %G eng %0 Conference Proceedings %B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000 %D 2000 %T Secure Remote Access to Numerical Software and Computational Hardware %A Dorian Arnold %A Shirley Browne %A Jack Dongarra %A Graham Fagg %A Keith Moore %K netsolve %B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000 %C Albuquerque, NM %8 06-2000 %G eng %0 Generic %D 1999 %T A Comparison of Parallel Solvers for Diagonally Dominant and General Narrow Banded Linear Systems II (LAPACK Working Note 143) %A Peter Arbenz %A Andrew Cleary %A Jack Dongarra %A Markus Hegland %B University of Tennessee Computer Science Department Technical Report %8 01-1999 %G eng %0 Generic %D 1999 %T A Comparison of Parallel Solvers for General Narrow Banded Linear Systems (LAPACK Working Note 142) %A Peter Arbenz %A Andrew Cleary %A Jack Dongarra %A Markus Hegland %B University of Tennessee Computer Science Technical Report %8 01-1999 %G eng %0 Journal Article %J Philadelphia: Society for Industrial and Applied Mathematics %D 1999 %T LAPACK Users' Guide, 3rd ed. %A Ed Anderson %A Zhaojun Bai %A Christian Bischof %A Susan Blackford %A James Demmel %A Jack Dongarra %A Jeremy Du Croz %A Anne Greenbaum %A Sven Hammarling %A Alan McKenney %A Danny Sorensen %B Philadelphia: Society for Industrial and Applied Mathematics %8 01-1999 %G eng