%0 Report
%D 2018
%T Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
%A Jack Dongarra
%A Iain Duff
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jonathan Hogg
%A Pedro Valero Lara
%A Piotr Luszczek
%A Mawussi Zounon
%A Samuel D. Relton
%A Stanimire Tomov
%A Timothy Costa
%A Sarah Knepper
%X This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility.
%8 07-2018
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T Big Data and Extreme-Scale Computing: Pathways to Convergence - Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry
%A Mark Asch
%A Terry Moore
%A Rosa M. Badia
%A Micah Beck
%A Pete Beckman
%A Thierry Bidot
%A François Bodin
%A Franck Cappello
%A Alok Choudhary
%A Bronis R. de Supinski
%A Ewa Deelman
%A Jack Dongarra
%A Anshu Dubey
%A Geoffrey Fox
%A Haohuan Fu
%A Sergi Girona
%A Michael Heroux
%A Yutaka Ishikawa
%A Kate Keahey
%A David Keyes
%A William T. Kramer
%A Jean-François Lavignon
%A Yutong Lu
%A Satoshi Matsuoka
%A Bernd Mohr
%A Stéphane Requena
%A Joel Saltz
%A Thomas Schulthess
%A Rick Stevens
%A Martin Swany
%A Alexander Szalay
%A William Tang
%A Gaël Varoquaux
%A Jean-Pierre Vilotte
%A Robert W. Wisniewski
%A Zhiwei Xu
%A Igor Zacharov
%X Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review.
%B The International Journal of High Performance Computing Applications
%V 32
%P 435–479
%8 07-2018
%G eng
%N 4
%R https://doi.org/10.1177/1094342018778123
%0 Journal Article
%J Journal of Advances in Modeling Earth Systems
%D 2018
%T Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling
%A Jian Sun
%A Joshua Fu
%A John Drake
%A Qingzhao Zhu
%A Azzam Haidar
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%K compiler
%K CUDA
%K data transfer
%K gpu
%K hybrid
%K memory layout
%X Global chemistry‐climate models are computationally burdened as the chemical mechanisms become more complex and realistic. Optimization for graphics processing units (GPU) may make longer global simulation with regional detail possible, but limited study has been done to explore the potential benefit for the atmospheric chemistry modeling. Hence, in this study, the second‐order Rosenbrock solver of the chemistry module of CAM4‐Chem is ported to the GPU to gauge potential speed‐up. We find that on the CPU, the fastest performance is achieved using the Intel compiler with a block interleaved memory layout. Different combinations of compiler and memory layout lead to ~11.02× difference in the computational time. In contrast, the GPU version performs the best when using a combination of fully interleaved memory layout with block size equal to the warp size, CUDA streams for independent kernels, and constant memory. Moreover, the most efficient data transfer between CPU and GPU is gained by allocating the memory contiguously during the data initialization on the GPU. Compared to one CPU core, the speed‐up of using one GPU alone reaches a factor of ~11.7× for the computation alone and ~3.82× when the data transfer between CPU and GPU is considered. Using one GPU alone is also generally faster than the multithreaded implementation for 16 CPU cores in a compute node and the single‐source solution (OpenACC). The best performance is achieved by the implementation of the hybrid CPU/GPU version, but rescheduling the workload among the CPU cores is required before the practical CAM4‐Chem simulation.
%B Journal of Advances in Modeling Earth Systems
%V 10
%P 1952–1969
%8 08-2018
%G eng
%N 8
%R https://doi.org/10.1029/2018MS001276
%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2018)
%D 2018
%T The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Panruo Wu
%A Srikara Pranesh
%A Stanimire Tomov
%A Jack Dongarra
%X As parallel computers approach exascale, power efficiency in high-performance computing (HPC) systems is of increasing concern. Exploiting both the hardware features and algorithms is an effective solution to achieve power efficiency, and to address the energy constraints in modern and future HPC systems. In this work, we present a novel design and implementation of an energy-efficient solution for dense linear systems of equations, which are at the heart of large-scale HPC applications. The proposed energy-efficient linear system solvers are based on two main components: (1) iterative refinement techniques, and (2) reduced-precision computing features in modern accelerators and coprocessors. While most of the energy efficiency approaches aim to reduce the consumption with a minimal performance penalty, our method improves both the performance and the energy efficiency. Compared to highly-optimized linear system solvers, our kernels deliver the same accuracy solution up to 2× faster and reduce the energy consumption up to half on Intel Knights Landing (KNL) architectures. By efficiently using the Tensor Cores available in the NVIDIA V100 PCIe GPUs, the speedups can be up to 4× , with more than 80% reduction in the energy consumption.
%B International Conference on Computational Science (ICCS 2018)
%I Springer
%C Wuxi, China
%V 10860
%P 586–600
%8 06-2018
%G eng
%U https://rdcu.be/bcKSC
%R https://doi.org/10.1007/978-3-319-93698-7_45
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Stanimire Tomov
%A Jack Dongarra
%X We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6x for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 973–984
%8 05-2018
%G eng
%N 5
%R 10.1109/TPDS.2017.2783929
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T Solving Linear Diophantine Systems on Parallel Architectures
%A Dmitry Zaitsev
%A Stanimire Tomov
%A Jack Dongarra
%K Mathematical model
%K Matrix decomposition
%K Parallel architectures
%K Petri nets
%K Software algorithms
%K Sparse matrices
%K Task analysis
%X Solving linear Diophantine systems of equations is applied in discrete-event systems, model checking, formal languages and automata, logic programming, cryptography, networking, signal processing, and chemistry. For modeling discrete systems with Petri nets, a solution in non-negative integer numbers is required, which represents an intractable problem. For this reason, solving such kinds of tasks with significant speedup is highly appreciated. In this paper we design a new solver of linear Diophantine systems based on the parallel-sequential composition of the system clans. The solver is studied and implemented to run on parallel architectures using a two-level parallelization concept based on MPI and OpenMP. A decomposable system is usually represented by a sparse matrix; a minimal clan size of the decomposition restricts the granulation of the technique. MPI is applied for solving systems for clans using a parallel-sequential composition on distributed-memory computing nodes, while OpenMP is applied in solving a single indecomposable system on a single node using multiple cores. A dynamic task-dispatching subsystem is developed for distributing systems on nodes in the process of compositional solution. Computational speedups are obtained on a series of test examples, e.g., illustrating that the best value constitutes up to 45 times speedup obtained on 5 nodes with 20 cores each.
%B IEEE Transactions on Parallel and Distributed Systems
%8 10-2018
%G eng
%U https://ieeexplore.ieee.org/document/8482295
%R 10.1109/TPDS.2018.2873354
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Panruo Wu
%A Mawussi Zounon
%A Jack Dongarra
%K linear algebra
%K multithreading
%K runtime
%K symmetric indefinite matrices
%X Recently, the Open Multi-Processing (OpenMP) standard has incorporated task-based programming, where a function call with input and output data is treated as a task. At run time, OpenMP's superscalar scheduler tracks the data dependencies among the tasks and executes the tasks as their dependencies are resolved. On a shared-memory architecture with multiple cores, the independent tasks are executed on different cores in parallel, thereby enabling parallel execution of a seemingly sequential code. With the emergence of many-core architectures, this type of programming paradigm is gaining attention-not only because of its simplicity, but also because it breaks the artificial synchronization points of the program and improves its thread-level parallelization. In this paper, we use these new OpenMP features to develop a portable high-performance implementation of a dense symmetric indefinite linear solver. Obtaining high performance from this kind of solver is a challenge because the symmetric pivoting, which is required to maintain numerical stability, leads to data dependencies that prevent us from using some common performance-improving techniques. To fully utilize a large number of cores through tasking, while conforming to the OpenMP standard, we describe several techniques. Our performance results on current many-core architectures-including Intel's Broadwell, Intel's Knights Landing, IBM's Power8, and Arm's ARMv8-demonstrate the portable and superior performance of our implementation compared with the Linear Algebra PACKage (LAPACK). The resulting solver is now available as a part of the PLASMA software package.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 1879–1892
%8 08-2018
%G eng
%N 8
%R 10.1109/TPDS.2018.2808964
%0 Conference Paper
%B ISC High Performance (ISC'18), Best Poster
%D 2018
%T Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Jack Dongarra
%B ISC High Performance (ISC'18), Best Poster
%C Frankfurt, Germany
%8 06-2018
%G eng
%0 Conference Paper
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
%A Jack Dongarra
%A Sven Hammarling
%A Nick Higham
%A Samuel Relton
%A Pedro Valero-Lara
%A Mawussi Zounon
%K Batched BLAS
%K BLAS
%K High-performance computing
%K Memory management
%K Parallel processing
%K Scientific computing
%X A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on a general matrix-matrix multiplication (GEMM), to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL1 CuBLAS2 by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, for large numbers of double precision GEMM operations using matrices of size 2 × 2 to 20 × 20.
%B International Conference on Computational Science (ICCS 2017)
%I Elsevier
%C Zürich, Switzerland
%8 06-2017
%G eng
%R DOI:10.1016/j.procs.2017.05.138
%0 Conference Paper
%B ACM MultiMedia Workshop 2017
%D 2017
%T Efficient Communications in Training Large Scale Neural Networks
%A Yiyang Zhao
%A Linnan Wan
%A Wei Wu
%A George Bosilca
%A Richard Vuduc
%A Jinmian Ye
%A Wenqi Tang
%A Zenglin Xu
%X We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for sub-gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to P, where P is the number of GPUs, while the cost of more conventional Minimum Spanning Tree (MST) scales like O(logP). LP also demonstrate up to 2x faster bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD.
%B ACM MultiMedia Workshop 2017
%I ACM
%C Mountain View, CA
%8 10-2017
%G eng
%0 Conference Paper
%B Euro-Par 2017
%D 2017
%T Optimized Batched Linear Algebra for Modern Architectures
%A Jack Dongarra
%A Sven Hammarling
%A Nick Higham
%A Samuel Relton
%A Mawussi Zounon
%X Solving large numbers of small linear algebra problems simultaneously is becoming increasingly important in many application areas. Whilst many researchers have investigated the design of efficient batch linear algebra kernels for GPU architectures, the common approach for many/multi-core CPUs is to use one core per subproblem in the batch. When solving batches of very small matrices, 2 × 2 for example, this design exhibits two main issues: it fails to fully utilize the vector units and the cache of modern architectures, since the matrices are too small. Our approach to resolve this is as follows: given a batch of small matrices spread throughout the primary memory, we first reorganize the elements of the matrices into a contiguous array, using a block interleaved memory format, which allows us to process the small independent problems as a single large matrix problem and enables cross-matrix vectorization. The large problem is solved using blocking strategies that attempt to optimize the use of the cache. The solution is then converted back to the original storage format. To explain our approach we focus on two BLAS routines: general matrix-matrix multiplication (GEMM) and the triangular solve (TRSM). We extend this idea to LAPACK routines using the Cholesky factorization and solve (POSV). Our focus is primarily on very small matrices ranging in size from 2 × 2 to 32 × 32. Compared to both MKL and OpenMP implementations, our approach can be up to 4 times faster for GEMM, up to 14 times faster for TRSM, and up to 40 times faster for POSV on the new Intel Xeon Phi processor, code-named Knights Landing (KNL). Furthermore, we discuss strategies to avoid data movement between sockets when using our interleaved approach on a NUMA node.
%B Euro-Par 2017
%I Springer
%C Santiago de Compostela, Spain
%8 08-2017
%G eng
%R https://doi.org/10.1007/978-3-319-64203-1_37
%0 Generic
%D 2017
%T PLASMA 17 Performance Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 06-2017
%G eng
%0 Generic
%D 2017
%T PLASMA 17.1 Functionality Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 06-2017
%G eng
%0 Generic
%D 2013
%T On the Combination of Silent Error Detection and Checkpointing
%A Guillaume Aupy
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%K checkpointing
%K error recovery
%K High-performance computing
%K silent data corruption
%K verification
%X In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
%B UT-CS-13-710
%I University of Tennessee Computer Science Technical Report
%8 06-2013
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn278.pdf
%0 Journal Article
%J Scalable Computing and Communications: Theory and Practice
%D 2013
%T Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Piotr Luszczek
%A Jack Dongarra
%E Samee Khan
%E Lin-Wang Wang
%E Albert Zomaya
%B Scalable Computing and Communications: Theory and Practice
%I John Wiley & Sons
%P 699-735
%8 03-2013
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
%A George Bosilca
%A Aurelien Bouteiller
%A Elisabeth Brunet
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%X In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
%B Concurrency and Computation: Practice and Experience
%8 11-2013
%G eng
%R 10.1002/cpe.3173
%0 Generic
%D 2012
%T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
%A George Bosilca
%A Aurelien Bouteiller
%A Elisabeth Brunet
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%B University of Tennessee Computer Science Technical Report (also LAWN 269)
%8 06-2012
%G eng
%0 Journal Article
%J Numerical Mathematics: Theory, Methods and Applications
%D 2010
%T Sparse approximations of the Schur complement for parallel algebraic hybrid solvers in 3D
%A Luc Giraud
%A Azzam Haidar
%A Yousef Saad
%E C. Zhiming
%B Numerical Mathematics: Theory, Methods and Applications
%I Golbal Science Press
%C Beijing
%V 3
%P 64-82
%8 00-2010
%G eng
%0 Journal Article
%J Cluster Computing Journal: Special Issue on High Performance Distributed Computing
%D 2009
%T Paravirtualization Effect on Single- and Multi-threaded Memory-Intensive Linear Algebra Software
%A Lamia Youseff
%A Keith Seymour
%A Haihang You
%A Dmitrii Zagorodnov
%A Jack Dongarra
%A Rich Wolski
%B Cluster Computing Journal: Special Issue on High Performance Distributed Computing
%I Springer Netherlands
%V 12
%P 101-122
%8 00-2009
%G eng
%0 Journal Article
%J in Handbook of Research on Scalable Computing Technologies (to appear)
%D 2009
%T Reliability and Performance Modeling and Analysis for Grid Computing
%A Yuan-Shun Dai
%A Jack Dongarra
%E Kuan-Ching Li
%E Ching-Hsien Hsu
%E Laurence Yang
%E Jack Dongarra
%E Hans Zima
%B in Handbook of Research on Scalable Computing Technologies (to appear)
%I IGI Global
%P 219-245
%8 00-2009
%G eng
%0 Conference Proceedings
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%D 2009
%T VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance
%A Lavanya Ramakrishan
%A Daniel Nurmi
%A Anirban Mandal
%A Charles Koelbel
%A Dennis Gannon
%A Mark Huang
%A Yang-Suk Kee
%A Graziano Obertelli
%A Kiran Thyagaraja
%A Rich Wolski
%A Asim YarKhan
%A Dmitrii Zagorodnov
%K grads
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%C Portland, OR
%8 00-2009
%G eng
%0 Journal Article
%J in Advances in Computers
%D 2008
%T DARPA's HPCS Program: History, Models, Tools, Languages
%A Jack Dongarra
%A Robert Graybill
%A William Harrod
%A Robert Lucas
%A Ewing Lusk
%A Piotr Luszczek
%A Janice McMahon
%A Allan Snavely
%A Jeffrey Vetter
%A Katherine Yelick
%A Sadaf Alam
%A Roy Campbell
%A Laura Carrington
%A Tzu-Yi Chen
%A Omid Khalili
%A Jeremy Meredith
%A Mustafa Tikir
%E M. Zelkowitz
%B in Advances in Computers
%I Elsevier
%V 72
%8 01-2008
%G eng
%0 Journal Article
%J Computing and Informatics
%D 2008
%T Interactive Grid-Access Using Gridsolve and Giggle
%A Marcus Hardt
%A Keith Seymour
%A Jack Dongarra
%A Michael Zapf
%A Nicole Ruiter
%K netsolve
%B Computing and Informatics
%V 27
%P 233-248,ISSN1335-9150
%8 00-2008
%G eng
%0 Journal Article
%J J. Phys.: Conf. Ser. 46
%D 2006
%T Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures
%A Alex Zunger
%A Alberto Franceschetti
%A Gabriel Bester
%A Wesley B. Jones
%A Kwiseon Kim
%A Peter A. Graf
%A Lin-Wang Wang
%A Andrew Canning
%A Osni Marques
%A Christof Voemel
%A Jack Dongarra
%A Julien Langou
%A Stanimire Tomov
%K DOE_NANO
%B J. Phys.: Conf. Ser. 46
%V :101088/1742-6596/46/1/040
%P 292-298
%8 01-2006
%G eng
%0 Conference Proceedings
%B In Proceedings of the 2005 SciDAC Conference
%D 2005
%T Performance Analysis of GYRO: A Tool Evaluation
%A Patrick H. Worley
%A Jeff Candy
%A Laura Carrington
%A Kevin Huck
%A Timothy Kaiser
%A Kumar Mahinthakumar
%A Allen Maloney
%A Shirley Moore
%A Dan Reed
%A Philip C. Roth
%A H. Shan
%A Sameer Shende
%A Allan Snavely
%A S. Sreepathi
%A Felix Wolf
%A Y. Zhang
%K kojak
%B In Proceedings of the 2005 SciDAC Conference
%C San Francisco, CA
%8 06-2005
%G eng
%0 Journal Article
%J Oak Ridge National Laboratory Report
%D 2004
%T Cray X1 Evaluation Status Report
%A Pratul Agarwal
%A R. A. Alexander
%A E. Apra
%A Satish Balay
%A Arthur S. Bland
%A James Colgan
%A Eduardo D'Azevedo
%A Jack Dongarra
%A Tom Dunigan
%A Mark Fahey
%A Al Geist
%A M. Gordon
%A Robert Harrison
%A Dinesh Kaushik
%A M. Krishnakumar
%A Piotr Luszczek
%A Tony Mezzacapa
%A Jeff Nichols
%A Jarek Nieplocha
%A Leonid Oliker
%A T. Packwood
%A M. Pindzola
%A Thomas C. Schulthess
%A Jeffrey Vetter
%A James B White
%A T. Windus
%A Patrick H. Worley
%A Thomas Zacharia
%B Oak Ridge National Laboratory Report
%V /-2004/13
%8 01-2004
%G eng
%0 Journal Article
%J Engineering the Grid (to appear)
%D 2004
%T An Overview of Heterogeneous High Performance and Grid Computing
%A Jack Dongarra
%A Alexey Lastovetsky
%E Beniamino Di Martino
%E Jack Dongarra
%E Adolfy Hoisie
%E Laurence Yang
%E Hans Zima
%B Engineering the Grid (to appear)
%I Nova Science Publishers, Inc.
%8 00-2004
%G eng
%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T Computational Science — ICCS 2003
%A Peter M. Sloot
%A David Abramson
%A Alexander V. Bogdanov
%A Jack Dongarra
%A Albert Zomaya
%A Yuriy Gorbachev
%B Lecture Notes in Computer Science
%I Springer-Verlag, Berlin
%C ICCS 2003, International Conference. Melbourne, Australia
%V 2657-2660
%8 06-2003
%G eng
%0 Conference Proceedings
%B PADTAD Workshop, IPDPS 2003
%D 2003
%T Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters
%A Jack Dongarra
%A Kevin London
%A Shirley Moore
%A Phil Mucci
%A Dan Terpstra
%A Haihang You
%A Min Zhou
%K lacsi
%K papi
%B PADTAD Workshop, IPDPS 2003
%C Nice, France
%8 04-2003
%G eng