%0 Generic
%D 2019
%T BDEC2 Platform White Paper
%A Todd Gamblin
%A Pete Beckman
%A Kate Keahey
%A Kento Sato
%A Masaaki Kondo
%A Gerofi Balazs
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-09
%G eng
%0 Report
%D 2018
%T Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
%A Jack Dongarra
%A Iain Duff
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jonathan Hogg
%A Pedro Valero Lara
%A Piotr Luszczek
%A Mawussi Zounon
%A Samuel D. Relton
%A Stanimire Tomov
%A Timothy Costa
%A Sarah Knepper
%X This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility.
%8 2018-07
%G eng
%0 Journal Article
%J Journal of Computational Science
%D 2018
%T Batched One-Sided Factorizations of Tiny Matrices Using GPUs: Challenges and Countermeasures
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K batch computation
%K GPU computing
%K matrix factorization
%X The use of batched matrix computations recently gained a lot of interest for applications, where the same operation is applied to many small independent matrices. The batched computational pattern is frequently encountered in applications of data analytics, direct/iterative solvers and preconditioners, computer vision, astrophysics, and more, and often requires specific designs for vectorization and extreme parallelism to map well on today's high-end many-core architectures. This has led to the development of optimized software for batch computations, and to an ongoing community effort to develop standard interfaces for batched linear algebra software. Furthering these developments, we present GPU design and optimization techniques for high-performance batched one-sided factorizations of millions of tiny matrices (of size 32 and less). We quantify the effects and relevance of different techniques in order to select the best-performing LU, QR, and Cholesky factorization designs. While we adapt common optimization techniques, such as optimal memory traffic, register blocking, and concurrency control, we also show that a different mindset and techniques are needed when matrices are tiny, and in particular, sub-vector/warp in size. The proposed routines are part of the MAGMA library and deliver significant speedups compared to their counterparts in currently available vendor-optimized libraries. Notably, we tune the developments for the newest V100 GPU from NVIDIA to show speedups of up to 11.8×.
%B Journal of Computational Science
%V 26
%P 226–236
%8 2018-05
%G eng
%R https://doi.org/10.1016/j.jocs.2018.01.005
%0 Generic
%D 2018
%T Bidiagonal SVD Computation via an Associated Tridiagonal Eigenproblem
%A Osni Marques
%A James Demmel
%A Paulo B. Vasconcelos
%X In this paper, we present an algorithm for the singular value decomposition (SVD) of a bidiagonal matrix by means of the eigenpairs of an associated symmetric tridiagonal matrix. The algorithm is particularly suited for the computation of a subset of singular values and corresponding vectors. We focus on a sequential version of the algorithm, and discuss special cases and implementation details. We use a large set of bidiagonal matrices to assess the accuracy of the implementation in single and double precision, as well as to identify potential shortcomings. We show that the algorithm can be up to three orders of magnitude faster than existing algorithms, which are limited to the computation of a full SVD. We also show time comparisons of an implementation that uses the strategy discussed in the paper as a building block for the computation of the SVD of general matrices.
%B LAPACK Working Note
%I University of Tennessee
%8 2018-04
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T Big Data and Extreme-Scale Computing: Pathways to Convergence - Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry
%A Mark Asch
%A Terry Moore
%A Rosa M. Badia
%A Micah Beck
%A Pete Beckman
%A Thierry Bidot
%A François Bodin
%A Franck Cappello
%A Alok Choudhary
%A Bronis R. de Supinski
%A Ewa Deelman
%A Jack Dongarra
%A Anshu Dubey
%A Geoffrey Fox
%A Haohuan Fu
%A Sergi Girona
%A Michael Heroux
%A Yutaka Ishikawa
%A Kate Keahey
%A David Keyes
%A William T. Kramer
%A Jean-François Lavignon
%A Yutong Lu
%A Satoshi Matsuoka
%A Bernd Mohr
%A Stéphane Requena
%A Joel Saltz
%A Thomas Schulthess
%A Rick Stevens
%A Martin Swany
%A Alexander Szalay
%A William Tang
%A Gaël Varoquaux
%A Jean-Pierre Vilotte
%A Robert W. Wisniewski
%A Zhiwei Xu
%A Igor Zacharov
%X Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review.
%B The International Journal of High Performance Computing Applications
%V 32
%P 435–479
%8 2018-07
%G eng
%N 4
%R https://doi.org/10.1177/1094342018778123
%0 Conference Paper
%B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2018
%T Budget-Aware Scheduling Algorithms for Scientific Workflows with Stochastic Task Weights on Heterogeneous IaaS Cloud Platforms
%A Yves Caniou
%A Eddy Caron
%A Aurélie Kong Win Chang
%A Yves Robert
%K budget aware algorithm
%K multi criteria scheduling
%K workflow
%X This paper introduces several budget-aware algorithms to deploy scientific workflows on IaaS cloud platforms, where users can request Virtual Machines (VMs) of different types, each with specific cost and speed parameters. We use a realistic application/platform model with stochastic task weights, and VMs communicating through a datacenter. We extend two well-known algorithms, MinMin and HEFT, and make scheduling decisions based upon machine availability and available budget. During the mapping process, the budget-aware algorithms make conservative assumptions to avoid exceeding the initial budget; we further improve our results with refined versions that aim at re-scheduling some tasks onto faster VMs, thereby spending any budget fraction leftover by the first allocation. These refined variants are much more time-consuming than the former algorithms, so there is a trade-off to find in terms of scalability. We report an extensive set of simulations with workflows from the Pegasus benchmark suite. Most of the time our budget-aware algorithms succeed in achieving efficient makespans while enforcing the given budget, despite (i) the uncertainty in task weights and (ii) the heterogeneity of VMs in both cost and speed values.
%B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Vancouver, BC, Canada
%8 2018-05
%G eng
%R 10.1109/IPDPSW.2018.00014
%0 Conference Proceedings
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%D 2017
%T Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K block-Jacobi preconditioner
%K Gauss-Jordan elimination
%K graphics processing units (GPUs)
%K iterative methods
%K matrix inversion
%K sparse linear systems
%X In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner. Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%S PMAM'17
%I ACM
%C New York, NY, USA
%P 1–10
%8 2017-02
%@ 978-1-4503-4883-6
%G eng
%U http://doi.acm.org/10.1145/3026937.3026940
%R 10.1145/3026937.3026940
%0 Generic
%D 2017
%T BDEC Pathways to Convergence: Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry
%E Terry Moore
%E Mark Asch
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-11
%G eng
%U http://www.exascale.org/bdec/report
%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2017
%T Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation
%A Mathieu Faverge
%A Julien Langou
%A Yves Robert
%A Jack Dongarra
%K Algorithm design and analysis
%K Approximation algorithms
%K Kernel
%K Multicore processing
%K Shape
%K Software algorithms
%K Transforms
%X We study tiled algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthog-onal transformations: (i) the tiled bidiagonalization algorithm BIDIAG, which is a tiled version of the standard scalar bidiago-nalization algorithm; and (ii) the R-bidiagonalization algorithm R-BIDIAG, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R- factor. For both BIDIAG and R-BIDIAG, we use four main types of reduction trees, namely FLATTS, FLATTT, GREEDY, and a newly introduced auto-adaptive tree, AUTO. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BIDIAG has a shorter critical path length than BIDIAG for tall and skinny matrices, and (ii) GREEDY based schemes are much better than earlier proposed algorithms with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared- memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Orlando, FL
%8 2017-05
%G eng
%R 10.1109/IPDPS.2017.46
%0 Book Section
%B Handbook of Big Data Technologies
%D 2017
%T Bringing High Performance Computing to Big Data Algorithms
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Handbook of Big Data Technologies
%I Springer
%@ 978-3-319-49339-8
%G eng
%R 10.1007/978-3-319-49340-4
%0 Conference Proceedings
%B Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2016
%T Batched Generation of Incomplete Sparse Approximate Inverses on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Thomas Huckle
%A Jack Dongarra
%X Incomplete Sparse Approximate Inverses (ISAI) have recently been shown to be an attractive alternative to exact sparse triangular solves in the context of incomplete factorization preconditioning. In this paper we propose a batched GPU-kernel for the efficient generation of ISAI matrices. Utilizing only thread-local memory allows for computing the ISAI matrix with very small memory footprint. We demonstrate that this strategy is faster than the existing strategy for generating ISAI matrices, and use a large number of test matrices to assess the algorithm's efficiency in an iterative solver setting.
%B Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%S ScalA '16
%P 49–56
%8 2016-11
%@ 978-1-5090-5222-6
%G eng
%R 10.1109/ScalA.2016.11
%0 Generic
%D 2016
%T On block-asynchronous execution on GPUs
%A Hartwig Anzt
%A Edmond Chow
%A Jack Dongarra
%X This paper experimentally investigates how GPUs execute instructions when used for general purpose computing (GPGPU). We use a light-weight realizing a vector operation to analyze which vector entries are updated subsequently, and identify regions where parallel execution can be expected. The results help us to understand how GPUs operate, and map this operation mode to the mathematical concept of asynchronism. In particular it helps to understand the effects that can occur when implementing a fixed-point method using in-place updates on GPU hardware.
%B LAPACK Working Note
%8 2016-11
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn291.pdf
%0 Conference Paper
%B EuroMPI/Asia 2015 Workshop
%D 2015
%T Batched Matrix Computations on Hardware Accelerators
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for effective approach to develop energy efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations: Cholesky, LU, and QR for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybridMAGMAfactorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient for in our applications’ context. We illustrate all these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared to a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5x speedup on the NVIDIA K40 GPU.
%B EuroMPI/Asia 2015 Workshop
%C Bordeaux, France
%8 2015-09
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2015
%T Batched matrix computations on hardware accelerators based on GPUs
%A Azzam Haidar
%A Tingxing Dong
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K batched factorization
%K hardware accelerators
%K numerical linear algebra
%K numerical software libraries
%K one-sided factorization algorithms
%X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.
%B International Journal of High Performance Computing Applications
%8 2015-02
%G eng
%R 10.1177/1094342014567546
%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%D 2015
%T Batched Matrix Computations on Hardware Accelerators Based on GPUs
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%X We will present techniques for small matrix computations on GPUs and their use for energy efficient, high-performance solvers. Work on small problems delivers high performance through improved data reuse. Many numerical libraries and applications need this functionality further developed. We describe the main factorizations LU, QR, and Cholesky for a set of small dense matrices in parallel. We achieve significant acceleration and reduced energy consumption against other solutions. Our techniques are of interest to GPU application developers in general.
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng
%0 Conference Paper
%B International Supercomputing Conference 2013 (ISC'13)
%D 2013
%T Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q
%A Heike McCraw
%A Dan Terpstra
%A Jack Dongarra
%A Kris Davis
%A Roy Musselman
%B International Supercomputing Conference 2013 (ISC'13)
%I Springer
%C Leipzig, Germany
%8 2013-06
%G eng
%0 Journal Article
%J The Computer Journal
%D 2013
%T BlackjackBench: Portable Hardware Characterization with Automated Results Analysis
%A Anthony Danalis
%A Piotr Luszczek
%A Gabriel Marin
%A Jeffrey Vetter
%A Jack Dongarra
%K hardware characterization
%K micro-benchmarks
%K statistical analysis
%X DARPA's AACE project aimed to develop Architecture Aware Compiler Environments. Such a compiler automatically characterizes the targeted hardware and optimizes the application codes accordingly. We present the BlackjackBench suite, a collection of portable micro-benchmarks that automate system characterization, plus statistical analysis techniques for interpreting the results. The BlackjackBench benchmarks discover the effective sizes and speeds of the hardware environment rather than the often unattainable peak values. We aim at hardware characteristics that can be observed by running executables generated by existing compilers from standard C codes. We characterize the memory hierarchy, including cache sharing and non-uniform memory access characteristics of the system, properties of the processing cores affecting the instruction execution speed and the length of the operating system scheduler time slot. We show how these features of modern multicores can be discovered programmatically. We also show how the features could potentially interfere with each other resulting in incorrect interpretation of the results, and how established classification and statistical analysis techniques can reduce experimental noise and aid automatic interpretation of results. We show how effective hardware metrics from our probes allow guided tuning of computational kernels that outperform an autotuning library further tuned by the hardware vendor.
%B The Computer Journal
%8 2013-03
%G eng
%R 10.1093/comjnl/bxt057
%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2013
%T A Block-Asynchronous Relaxation Method for Graphics Processing Units
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%A Vincent Heuveline
%X In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the ‘‘subdomain’’ handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing.
%B Journal of Parallel and Distributed Computing
%V 73
%P 1613–1626
%8 2013-12
%G eng
%N 12
%R http://dx.doi.org/10.1016/j.jpdc.2013.05.008
%0 Journal Article
%J ICCS 2012
%D 2012
%T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Gates
%A Jack Dongarra
%A Vincent Heuveline
%B ICCS 2012
%C Omaha, NE
%8 2012-06
%G eng
%0 Conference Proceedings
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%D 2011
%T BlackjackBench: Hardware Characterization with Portable Micro-Benchmarks and Automatic Statistical Analysis of Results
%A Anthony Danalis
%A Piotr Luszczek
%A Gabriel Marin
%A Jeffrey Vetter
%A Jack Dongarra
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%C Anchorage, AK
%8 2011-05
%G eng
%0 Journal Article
%D 2011
%T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Gates
%A Jack Dongarra
%A Vincent Heuveline
%K magma
%8 2011-12
%G eng
%0 Generic
%D 2011
%T A Block-Asynchronous Relaxation Method for Graphics Processing Units
%A Hartwig Anzt
%A Stanimire Tomov
%A Jack Dongarra
%A Vincent Heuveline
%K magma
%B University of Tennessee Computer Science Technical Report
%8 2011-11
%G eng
%0 Book Section
%B Scientific Computing with Multicore and Accelerators
%D 2010
%T Blas for GPUs
%A Rajib Nath
%A Stanimire Tomov
%A Jack Dongarra
%B Scientific Computing with Multicore and Accelerators
%S Chapman & Hall/CRC Computational Science
%I CRC Press
%C Boca Raton, Florida
%@ 9781439825365
%G eng
%& 4
%0 Conference Proceedings
%B Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07)
%D 2007
%T Binomial Graph: A Scalable and Fault- Tolerant Logical Network Topology
%A Thara Angskun
%A George Bosilca
%A Jack Dongarra
%K ftmpi
%B Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07)
%I Springer
%C Niagara Falls, Canada
%8 2007-08
%G eng
%0 Conference Proceedings
%B 19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (submitted)
%D 2007
%T Bi-objective Scheduling Algorithms for Optimizing Makespan and Reliability on Heterogeneous Systems
%A Jack Dongarra
%A Emmanuel Jeannot
%A Erik Saule
%A Zhiao Shi
%B 19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (submitted)
%C San Diego, CA
%8 2007-06
%G eng
%0 Journal Article
%J Future Generation Computing Systems
%D 2005
%T Biological Sequence Alignment on the Computational Grid Using the GrADS Framework
%A Asim YarKhan
%A Jack Dongarra
%K grads
%B Future Generation Computing Systems
%I Elsevier
%V 21
%P 980-986
%8 2005-06
%G eng
%0 Journal Article
%J International Journal of High Performance Applications and Supercomputing (to appear)
%D 2004
%T Building and using a Fault Tolerant MPI implementation
%A Graham Fagg
%A Jack Dongarra
%K ftmpi
%K lacsi
%K sans
%B International Journal of High Performance Applications and Supercomputing (to appear)
%8 2004-00
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications: Special Issue - Part I & II
%D 2002
%T Basic Linear Algebra Subprograms Technical (BLAST) Forum Standard
%B International Journal of High Performance Computing Applications: Special Issue - Part I & II
%V 16
%P 1-199
%8 2002-01
%G eng
%0 Journal Article
%J SIAM News
%D 2002
%T Biannual Top-500 Computer Lists Track Changing Environments for Scientific Computing
%A Jack Dongarra
%A Hans Meuer
%A Horst D. Simon
%A Erich Strohmaier
%K top500
%B SIAM News
%V 34
%8 2002-10
%G eng
%0 Journal Article
%J (an update), submitted to ACM TOMS
%D 2001
%T Basic Linear Algebra Subprograms (BLAS)
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Iain Duff
%A Sven Hammarling
%A Greg Henry
%A Michael Heroux
%A Linda Kaufman
%A Andrew Lumsdaine
%A Antoine Petitet
%A Roldan Pozo
%A Karin Remington
%A Clint Whaley
%B (an update), submitted to ACM TOMS
%8 2001-02
%G eng
%0 Journal Article
%D 2001
%T Basic Linear Algebra Subprograms Technical (BLAST) Forum Standard
%8 2001-01
%G eng