%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part I
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 707
%8 2020-06
%@ 978-3-030-50371-0
%G eng
%R https://doi.org/10.1007/978-3-030-50371-0
%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part III
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 648
%8 2020-06
%@ 978-3-030-50420-5
%G eng
%R https://doi.org/10.1007/978-3-030-50420-5
%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part VII
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 775
%8 2020-06
%@ 978-3-030-50436-6
%G eng
%R https://doi.org/10.1007/978-3-030-50436-6
%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part VI
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 667
%8 2020-06
%@ 978-3-030-50433-5
%G eng
%R https://doi.org/10.1007/978-3-030-50433-5
%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part V
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 618
%8 2020-06
%@ 978-3-030-50426-7
%G eng
%R https://doi.org/10.1007/978-3-030-50426-7
%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part II
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 697
%8 2020-06
%@ 978-3-030-50417-5
%G eng
%R https://doi.org/10.1007/978-3-030-50417-5
%0 Book
%B Lecture Notes in Computer Science
%D 2020
%T Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part IV
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B Lecture Notes in Computer Science
%7 1
%I Springer International Publishing
%P 668
%8 2020-06
%@ 978-3-030-50423-6
%G eng
%R https://doi.org/10.1007/978-3-030-50423-6
%0 Conference Paper
%B Computer Modeling and Intelligent Systems CMIS-2020
%D 2020
%T Docker Container based PaaS Cloud Computing Comprehensive Benchmarks using LAPACK
%A Dmitry Zaitsev
%A Piotr Luszczek
%K docker containers
%K software containers
%X Platform as a Service (PaaS) cloud computing model becomes wide- spread implemented within Docker Containers. Docker uses operating system level virtualization to deliver software in packages called containers. Containers are isolated from one another and comprise all the required software, including operating system API, libraries and configuration files. With such advantageous integrity one can doubt on Docker performance. The present paper applies packet LAPACK, which is widely used for performance benchmarks of super- computers, to collect and compare benchmarks of Docker on Linux Ubuntu and MS Windows platforms. After a brief overview of Docker and LAPACK, a se- ries of Docker images containing LAPACK is created and run, abundant benchmarks obtained and represented in tabular and graphical form. From the final discussion, we conclude that Docker runs with nearly the same perfor- mance on both Linux and Windows platforms, the slowdown does not exceed some ten percent. Though Docker performance in Windows is essentially lim- ited by the amount of RAM allocated to Docker Engine.
%B Computer Modeling and Intelligent Systems CMIS-2020
%C Zaporizhzhoa
%8 2020-03
%G eng
%0 Conference Paper
%B 9th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 20)
%D 2020
%T FFT-Based Gradient Sparsification for the Distributed Training of Deep Neural Networks
%A Linnan Wang
%A Wei Wu
%A Junyu Zhang
%A Hang Liu
%A George Bosilca
%A Maurice Herlihy
%A Rodrigo Fonseca
%K FFT
%K Gradient Compression
%K Loosy Gradients
%K Machine Learning
%K Neural Networks
%X The performance and efficiency of distributed training of Deep Neural Networks (DNN) highly depend on the performance of gradient averaging among participating processes, a step bound by communication costs. There are two major approaches to reduce communication overhead: overlap communications with computations (lossless), or reduce communications (lossy). The lossless solution works well for linear neural architectures, e.g. VGG, AlexNet, but more recent networks such as ResNet and Inception limit the opportunity for such overlapping. Therefore, approaches that reduce the amount of data (lossy) become more suitable. In this paper, we present a novel, explainable lossy method that sparsifies gradients in the frequency domain, in addition to a new range-based float point representation to quantize and further compress gradients. These dynamic techniques strike a balance between compression ratio, accuracy, and computational overhead, and are optimized to maximize performance in heterogeneous environments. Unlike existing works that strive for a higher compression ratio, we stress the robustness of our methods, and provide guidance to recover accuracy from failures. To achieve this, we prove how the FFT sparsification affects the convergence and accuracy, and show that our method is guaranteed to converge using a diminishing θ in training. Reducing θ can also be used to recover accuracy from the failure. Compared to STOA lossy methods, e.g., QSGD, TernGrad, and Top-k sparsification, our approach incurs less approximation error, thereby better in both the wall-time and accuracy. On an 8 GPUs, InfiniBand interconnected cluster, our techniques effectively accelerate AlexNet training up to 2.26x to the baseline of no compression, and 1.31x to QSGD, 1.25x to Terngrad and 1.47x to Top-K sparsification.
%B 9th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 20)
%I ACM
%C Stockholm, Sweden
%8 2020-06
%G eng
%R https://doi.org/10.1145/3369583.3392681
%0 Conference Paper
%B IEEE International Conference on Cluster Computing (Cluster 2020)
%D 2020
%T Flexible Data Redistribution in a Task-Based Runtime System
%A Qinglei Cao
%A George Bosilca
%A Wei Wu
%A Dong Zhong
%A Aurelien Bouteiller
%A Jack Dongarra
%X Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal to increase the efficiency and therefore decrease the time-to-solution for the algorithm. The classical redistribution problem focuses on optimally scheduling communications when reshuffling data between two regular, usually block-cyclic, data distributions. Recently, task-based runtime systems have gained popularity as a potential candidate to address the programming complexity on the way to exascale. In addition to an increase in portability against complex hardware and software systems, task-based runtime systems have the potential to be able to more easily cope with less-regular data distribution, providing a more balanced computational load during the lifetime of the execution. In this scenario, it becomes paramount to develop a general redistribution algorithm for task-based runtime systems, which could support all types of regular and irregular data distributions. In this paper, we detail a flexible redistribution algorithm, capable of dealing with redistribution problems without constraints of data distribution and data size and implement it in a task-based runtime system, PaRSEC. Performance results show great capability compared to ScaLAPACK, and applications highlight an increased efficiency with little overhead in terms of data distribution and data size.
%B IEEE International Conference on Cluster Computing (Cluster 2020)
%I IEEE
%C Kobe, Japan
%8 2020-09
%G eng
%R https://doi.org/10.1109/CLUSTER49012.2020.00032
%0 Conference Paper
%B IEEE Cluster Conference
%D 2020
%T HAN: A Hierarchical AutotuNed Collective Communication Framework
%A Xi Luo
%A Wei Wu
%A George Bosilca
%A Yu Pei
%A Qinglei Cao
%A Thananon Patinyasakdikul
%A Dong Zhong
%A Jack Dongarra
%X High-performance computing (HPC) systems keep growing in scale and heterogeneity to satisfy the increasing computational need, and this brings new challenges to the design of MPI libraries, especially with regard to collective operations. To address these challenges, we present "HAN," a new hierarchical autotuned collective communication framework in Open MPI, which selects suitable homogeneous collective communication modules as submodules for each hardware level, uses collective operations from the submodules as tasks, and organizes these tasks to perform efficient hierarchical collective operations. With a task-based design, HAN can easily swap out submodules, while keeping tasks intact, to adapt to new hardware. This makes HAN suitable for the current platform and provides a strong and flexible support for future HPC systems. To provide a fast and accurate autotuning mechanism, we present a novel cost model based on benchmarking the tasks instead of a whole collective operation. This method drastically reduces tuning time, as the cost of tasks can be reused across different message sizes, and is more accurate than existing cost models. Our cost analysis suggests the autotuning component can find the optimal configuration in most cases. The evaluation of the HAN framework suggests our design significantly improves the default Open MPI and achieves decent speedups against state-of-the-art MPI implementations on tested applications.
%B IEEE Cluster Conference
%I Best Paper Award, IEEE Computer Society Press
%C Kobe, Japan
%8 2020-09
%G eng
%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2020
%T A Set of Batched Basic Linear Algebra Subprograms
%A Ahmad Abdelfattah
%A Timothy Costa
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Mawussi Zounon
%X This paper describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular half precision is used in many very large scale applications, such as those associated with machine learning.
%B ACM Transactions on Mathematical Software
%8 2020-10
%G eng
%0 Conference Paper
%B International Conference on Computational Science (ICCS 2020)
%D 2020
%T Twenty Years of Computational Science
%A Valeria Krzhizhanovskaya
%A Gábor Závodszky
%A Michael Lees
%A Jack Dongarra
%A Peter Sloot
%A Sérgio Brissos
%A João Teixeira
%B International Conference on Computational Science (ICCS 2020)
%C Amsterdam, Netherlands
%8 2020-06
%G eng
%0 Conference Paper
%B EuroMPI/USA '20: 27th European MPI Users' Group Meeting
%D 2020
%T Using Advanced Vector Extensions AVX-512 for MPI Reduction
%A Dong Zhong
%A Qinglei Cao
%A George Bosilca
%A Jack Dongarra
%K Instruction level parallelism
%K Intel AVX2/AVX-512
%K Long vector extension
%K MPI reduction operation
%K Single instruction multiple data
%K Vector operation
%X As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the timeto- solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.
%B EuroMPI/USA '20: 27th European MPI Users' Group Meeting
%C Austin, TX
%8 2020-09
%G eng
%R https://doi.org/10.1145/3416315.3416316
%0 Generic
%D 2020
%T Using Advanced Vector Extensions AVX-512 for MPI Reduction (Poster)
%A Dong Zhong
%A George Bosilca
%A Qinglei Cao
%A Jack Dongarra
%I EuroMPI/USA '20: 27th European MPI Users' Group Meeting
%C Austin, TX
%8 2020-09
%G eng
%0 Conference Paper
%B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2020)
%D 2020
%T Using Arm Scalable Vector Extension to Optimize Open MPI
%A Dong Zhong
%A Pavel Shamis
%A Qinglei Cao
%A George Bosilca
%A Jack Dongarra
%K ARMIE
%K datatype pack and unpack
%K local reduction
%K non-contiguous accesses
%K SVE
%K Vector Length Agnostic
%X As the scale of high-performance computing (HPC) systems continues to grow, increasing levels of parallelism must be implored to achieve optimal performance. Recently, the processors support wide vector extensions, vectorization becomes much more important to exploit the potential peak performance of target architecture. Novel processor architectures, such as the Armv8-A architecture, introduce Scalable Vector Extension (SVE) - an optional separate architectural extension with a new set of A64 instruction encodings, which enables even greater parallelisms. In this paper, we analyze the usage and performance of the SVE instructions in Arm SVE vector Instruction Set Architecture (ISA); and utilize those instructions to improve the memcpy and various local reduction operations. Furthermore, we propose new strategies to improve the performance of MPI operations including datatype packing/unpacking and MPI reduction. With these optimizations, we not only provide a higher-parallelism for a single node, but also achieve a more efficient communication scheme of message exchanging. The resulting efforts have been implemented in the context of OPEN MPI, providing efficient and scalable capabilities of SVE usage and extending the possible implementations of SVE to a more extensive range of programming and execution paradigms. The evaluation of the resulting software stack under different scenarios with both simulator and Fujitsu’s A64FX processor demonstrates that the solution is at the same time generic and efficient.
%B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2020)
%I IEEE/ACM
%C Melbourne, Australia
%8 2020-05
%G eng
%R https://doi.org/10.1109/CCGrid49817.2020.00-71
%0 Conference Paper
%B European MPI Users' Group Meeting (EuroMPI '19)
%D 2019
%T Runtime Level Failure Detection and Propagation in HPC Systems
%A Dong Zhong
%A Aurelien Bouteiller
%A Xi Luo
%A George Bosilca
%X As the scale of high-performance computing (HPC) systems continues to grow, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. In order to efficiently run long computing jobs on these systems, handling system failures becomes a prime challenge. We present here the design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Multiple overlapping topologies are used to optimize the detection and propagation, minimizing the incurred overheads and guaranteeing the scalability of the entire framework. The resulting framework has been implemented in the context of a system-level runtime for parallel environments, PMIx Reference RunTime Environment (PRRTE), providing efficient and scalable capabilities of fault management to a large range of programming and execution paradigms. The experimental evaluation of the resulting software stack on different machines demonstrate that the solution is at the same time generic and efficient.
%B European MPI Users' Group Meeting (EuroMPI '19)
%I ACM
%C Zürich, Switzerland
%8 2019-09
%@ 978-1-4503-7175-9
%G eng
%R https://doi.org/10.1145/3343211.3343225
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2019
%T Solving Linear Diophantine Systems on Parallel Architectures
%A Dmitry Zaitsev
%A Stanimire Tomov
%A Jack Dongarra
%K Mathematical model
%K Matrix decomposition
%K Parallel architectures
%K Petri nets
%K Software algorithms
%K Sparse matrices
%K Task analysis
%X Solving linear Diophantine systems of equations is applied in discrete-event systems, model checking, formal languages and automata, logic programming, cryptography, networking, signal processing, and chemistry. For modeling discrete systems with Petri nets, a solution in non-negative integer numbers is required, which represents an intractable problem. For this reason, solving such kinds of tasks with significant speedup is highly appreciated. In this paper we design a new solver of linear Diophantine systems based on the parallel-sequential composition of the system clans. The solver is studied and implemented to run on parallel architectures using a two-level parallelization concept based on MPI and OpenMP. A decomposable system is usually represented by a sparse matrix; a minimal clan size of the decomposition restricts the granulation of the technique. MPI is applied for solving systems for clans using a parallel-sequential composition on distributed-memory computing nodes, while OpenMP is applied in solving a single indecomposable system on a single node using multiple cores. A dynamic task-dispatching subsystem is developed for distributing systems on nodes in the process of compositional solution. Computational speedups are obtained on a series of test examples, e.g., illustrating that the best value constitutes up to 45 times speedup obtained on 5 nodes with 20 cores each.
%B IEEE Transactions on Parallel and Distributed Systems
%V 30
%P 1158-1169
%8 2019-05
%G eng
%U https://ieeexplore.ieee.org/document/8482295
%N 5
%R http://dx.doi.org/10.1109/TPDS.2018.2873354
%0 Report
%D 2018
%T Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
%A Jack Dongarra
%A Iain Duff
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jonathan Hogg
%A Pedro Valero Lara
%A Piotr Luszczek
%A Mawussi Zounon
%A Samuel D. Relton
%A Stanimire Tomov
%A Timothy Costa
%A Sarah Knepper
%X This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility.
%8 2018-07
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T Big Data and Extreme-Scale Computing: Pathways to Convergence - Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry
%A Mark Asch
%A Terry Moore
%A Rosa M. Badia
%A Micah Beck
%A Pete Beckman
%A Thierry Bidot
%A François Bodin
%A Franck Cappello
%A Alok Choudhary
%A Bronis R. de Supinski
%A Ewa Deelman
%A Jack Dongarra
%A Anshu Dubey
%A Geoffrey Fox
%A Haohuan Fu
%A Sergi Girona
%A Michael Heroux
%A Yutaka Ishikawa
%A Kate Keahey
%A David Keyes
%A William T. Kramer
%A Jean-François Lavignon
%A Yutong Lu
%A Satoshi Matsuoka
%A Bernd Mohr
%A Stéphane Requena
%A Joel Saltz
%A Thomas Schulthess
%A Rick Stevens
%A Martin Swany
%A Alexander Szalay
%A William Tang
%A Gaël Varoquaux
%A Jean-Pierre Vilotte
%A Robert W. Wisniewski
%A Zhiwei Xu
%A Igor Zacharov
%X Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review.
%B The International Journal of High Performance Computing Applications
%V 32
%P 435–479
%8 2018-07
%G eng
%N 4
%R https://doi.org/10.1177/1094342018778123
%0 Generic
%D 2018
%T A Collection of White Papers from the BDEC2 Workshop in Bloomington, IN
%A James Ahrens
%A Christopher M. Biwer
%A Alexandru Costan
%A Gabriel Antoniu
%A Maria S. Pérez
%A Nenad Stojanovic
%A Rosa Badia
%A Oliver Beckstein
%A Geoffrey Fox
%A Shantenu Jha
%A Micah Beck
%A Terry Moore
%A Sunita Chandrasekaran
%A Carlos Costa
%A Thierry Deutsch
%A Luigi Genovese
%A Tarek El-Ghazawi
%A Ian Foster
%A Dennis Gannon
%A Toshihiro Hanawa
%A Tevfik Kosar
%A William Kramer
%A Madhav V. Marathe
%A Christopher L. Barrett
%A Takemasa Miyoshi
%A Alex Pothen
%A Ariful Azad
%A Judy Qiu
%A Bo Peng
%A Ravi Teja
%A Sahil Tyagi
%A Chathura Widanage
%A Jon Koskey
%A Maryam Rahnemoonfar
%A Umakishore Ramachandran
%A Miles Deegan
%A William Tang
%A Osamu Tatebe
%A Michela Taufer
%A Michel Cuende
%A Ewa Deelman
%A Trilce Estrada
%A Rafael Ferreira Da Silva
%A Harrel Weinstein
%A Rodrigo Vargas
%A Miwako Tsuji
%A Kevin G. Yager
%A Wanling Gao
%A Jianfeng Zhan
%A Lei Wang
%A Chunjie Luo
%A Daoyi Zheng
%A Xu Wen
%A Rui Ren
%A Chen Zheng
%A Xiwen He
%A Hainan Ye
%A Haoning Tang
%A Zheng Cao
%A Shujie Zhang
%A Jiahui Dai
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2018-11
%G eng
%0 Journal Article
%J Journal of Advances in Modeling Earth Systems
%D 2018
%T Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling
%A Jian Sun
%A Joshua Fu
%A John Drake
%A Qingzhao Zhu
%A Azzam Haidar
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%K compiler
%K CUDA
%K data transfer
%K gpu
%K hybrid
%K memory layout
%X Global chemistry‐climate models are computationally burdened as the chemical mechanisms become more complex and realistic. Optimization for graphics processing units (GPU) may make longer global simulation with regional detail possible, but limited study has been done to explore the potential benefit for the atmospheric chemistry modeling. Hence, in this study, the second‐order Rosenbrock solver of the chemistry module of CAM4‐Chem is ported to the GPU to gauge potential speed‐up. We find that on the CPU, the fastest performance is achieved using the Intel compiler with a block interleaved memory layout. Different combinations of compiler and memory layout lead to ~11.02× difference in the computational time. In contrast, the GPU version performs the best when using a combination of fully interleaved memory layout with block size equal to the warp size, CUDA streams for independent kernels, and constant memory. Moreover, the most efficient data transfer between CPU and GPU is gained by allocating the memory contiguously during the data initialization on the GPU. Compared to one CPU core, the speed‐up of using one GPU alone reaches a factor of ~11.7× for the computation alone and ~3.82× when the data transfer between CPU and GPU is considered. Using one GPU alone is also generally faster than the multithreaded implementation for 16 CPU cores in a compute node and the single‐source solution (OpenACC). The best performance is achieved by the implementation of the hybrid CPU/GPU version, but rescheduling the workload among the CPU cores is required before the practical CAM4‐Chem simulation.
%B Journal of Advances in Modeling Earth Systems
%V 10
%P 1952–1969
%8 2018-08
%G eng
%N 8
%R https://doi.org/10.1029/2018MS001276
%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2018)
%D 2018
%T The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Panruo Wu
%A Srikara Pranesh
%A Stanimire Tomov
%A Jack Dongarra
%X As parallel computers approach exascale, power efficiency in high-performance computing (HPC) systems is of increasing concern. Exploiting both the hardware features and algorithms is an effective solution to achieve power efficiency, and to address the energy constraints in modern and future HPC systems. In this work, we present a novel design and implementation of an energy-efficient solution for dense linear systems of equations, which are at the heart of large-scale HPC applications. The proposed energy-efficient linear system solvers are based on two main components: (1) iterative refinement techniques, and (2) reduced-precision computing features in modern accelerators and coprocessors. While most of the energy efficiency approaches aim to reduce the consumption with a minimal performance penalty, our method improves both the performance and the energy efficiency. Compared to highly-optimized linear system solvers, our kernels deliver the same accuracy solution up to 2× faster and reduce the energy consumption up to half on Intel Knights Landing (KNL) architectures. By efficiently using the Tensor Cores available in the NVIDIA V100 PCIe GPUs, the speedups can be up to 4× , with more than 80% reduction in the energy consumption.
%B International Conference on Computational Science (ICCS 2018)
%I Springer
%C Wuxi, China
%V 10860
%P 586–600
%8 2018-06
%G eng
%U https://rdcu.be/bcKSC
%R https://doi.org/10.1007/978-3-319-93698-7_45
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Stanimire Tomov
%A Jack Dongarra
%X We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6x for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 973–984
%8 2018-05
%G eng
%N 5
%R 10.1109/TPDS.2017.2783929
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Panruo Wu
%A Mawussi Zounon
%A Jack Dongarra
%K linear algebra
%K multithreading
%K runtime
%K symmetric indefinite matrices
%X Recently, the Open Multi-Processing (OpenMP) standard has incorporated task-based programming, where a function call with input and output data is treated as a task. At run time, OpenMP's superscalar scheduler tracks the data dependencies among the tasks and executes the tasks as their dependencies are resolved. On a shared-memory architecture with multiple cores, the independent tasks are executed on different cores in parallel, thereby enabling parallel execution of a seemingly sequential code. With the emergence of many-core architectures, this type of programming paradigm is gaining attention-not only because of its simplicity, but also because it breaks the artificial synchronization points of the program and improves its thread-level parallelization. In this paper, we use these new OpenMP features to develop a portable high-performance implementation of a dense symmetric indefinite linear solver. Obtaining high performance from this kind of solver is a challenge because the symmetric pivoting, which is required to maintain numerical stability, leads to data dependencies that prevent us from using some common performance-improving techniques. To fully utilize a large number of cores through tasking, while conforming to the OpenMP standard, we describe several techniques. Our performance results on current many-core architectures-including Intel's Broadwell, Intel's Knights Landing, IBM's Power8, and Arm's ARMv8-demonstrate the portable and superior performance of our implementation compared with the Linear Algebra PACKage (LAPACK). The resulting solver is now available as a part of the PLASMA software package.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 1879–1892
%8 2018-08
%G eng
%N 8
%R 10.1109/TPDS.2018.2808964
%0 Generic
%D 2018
%T Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Jack Dongarra
%I ISC High Performance (ISC18), Best Poster Award
%C Frankfurt, Germany
%8 2018-06
%G eng
%0 Conference Paper
%B ISC High Performance (ISC'18), Best Poster
%D 2018
%T Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Jack Dongarra
%B ISC High Performance (ISC'18), Best Poster
%C Frankfurt, Germany
%8 2018-06
%G eng
%0 Conference Paper
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
%A Jack Dongarra
%A Sven Hammarling
%A Nicholas J. Higham
%A Samuel Relton
%A Pedro Valero-Lara
%A Mawussi Zounon
%K Batched BLAS
%K BLAS
%K High-performance computing
%K Memory management
%K Parallel processing
%K Scientific computing
%X A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on a general matrix-matrix multiplication (GEMM), to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL1 CuBLAS2 by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, for large numbers of double precision GEMM operations using matrices of size 2 × 2 to 20 × 20.
%B International Conference on Computational Science (ICCS 2017)
%I Elsevier
%C Zürich, Switzerland
%8 2017-06
%G eng
%R DOI:10.1016/j.procs.2017.05.138
%0 Conference Paper
%B ACM MultiMedia Workshop 2017
%D 2017
%T Efficient Communications in Training Large Scale Neural Networks
%A Yiyang Zhao
%A Linnan Wan
%A Wei Wu
%A George Bosilca
%A Richard Vuduc
%A Jinmian Ye
%A Wenqi Tang
%A Zenglin Xu
%X We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for sub-gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to P, where P is the number of GPUs, while the cost of more conventional Minimum Spanning Tree (MST) scales like O(logP). LP also demonstrate up to 2x faster bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD.
%B ACM MultiMedia Workshop 2017
%I ACM
%C Mountain View, CA
%8 2017-10
%G eng
%0 Conference Paper
%B Euro-Par 2017
%D 2017
%T Optimized Batched Linear Algebra for Modern Architectures
%A Jack Dongarra
%A Sven Hammarling
%A Nicholas J. Higham
%A Samuel Relton
%A Mawussi Zounon
%X Solving large numbers of small linear algebra problems simultaneously is becoming increasingly important in many application areas. Whilst many researchers have investigated the design of efficient batch linear algebra kernels for GPU architectures, the common approach for many/multi-core CPUs is to use one core per subproblem in the batch. When solving batches of very small matrices, 2 × 2 for example, this design exhibits two main issues: it fails to fully utilize the vector units and the cache of modern architectures, since the matrices are too small. Our approach to resolve this is as follows: given a batch of small matrices spread throughout the primary memory, we first reorganize the elements of the matrices into a contiguous array, using a block interleaved memory format, which allows us to process the small independent problems as a single large matrix problem and enables cross-matrix vectorization. The large problem is solved using blocking strategies that attempt to optimize the use of the cache. The solution is then converted back to the original storage format. To explain our approach we focus on two BLAS routines: general matrix-matrix multiplication (GEMM) and the triangular solve (TRSM). We extend this idea to LAPACK routines using the Cholesky factorization and solve (POSV). Our focus is primarily on very small matrices ranging in size from 2 × 2 to 32 × 32. Compared to both MKL and OpenMP implementations, our approach can be up to 4 times faster for GEMM, up to 14 times faster for TRSM, and up to 40 times faster for POSV on the new Intel Xeon Phi processor, code-named Knights Landing (KNL). Furthermore, we discuss strategies to avoid data movement between sockets when using our interleaved approach on a NUMA node.
%B Euro-Par 2017
%I Springer
%C Santiago de Compostela, Spain
%8 2017-08
%G eng
%R https://doi.org/10.1007/978-3-319-64203-1_37
%0 Generic
%D 2017
%T PLASMA 17 Performance Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng
%0 Generic
%D 2017
%T PLASMA 17.1 Functionality Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng
%0 Generic
%D 2016
%T A Standard for Batched BLAS Routines
%A Pedro Valero-Lara
%A Jack Dongarra
%A Azzam Haidar
%A Samuel D. Relton
%A Stanimire Tomov
%A Mawussi Zounon
%I 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16)
%C Paris, France
%8 2016-04
%G eng
%0 Generic
%D 2013
%T On the Combination of Silent Error Detection and Checkpointing
%A Guillaume Aupy
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%K checkpointing
%K error recovery
%K High-performance computing
%K silent data corruption
%K verification
%X In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
%B UT-CS-13-710
%I University of Tennessee Computer Science Technical Report
%8 2013-06
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn278.pdf
%0 Journal Article
%J Scalable Computing and Communications: Theory and Practice
%D 2013
%T Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Piotr Luszczek
%A Jack Dongarra
%E Samee Khan
%E Lin-Wang Wang
%E Albert Zomaya
%B Scalable Computing and Communications: Theory and Practice
%I John Wiley & Sons
%P 699-735
%8 2013-03
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
%A George Bosilca
%A Aurelien Bouteiller
%A Elisabeth Brunet
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%X In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
%B Concurrency and Computation: Practice and Experience
%8 2013-11
%G eng
%R 10.1002/cpe.3173
%0 Generic
%D 2012
%T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
%A George Bosilca
%A Aurelien Bouteiller
%A Elisabeth Brunet
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%B University of Tennessee Computer Science Technical Report (also LAWN 269)
%8 2012-06
%G eng
%0 Journal Article
%J Numerical Mathematics: Theory, Methods and Applications
%D 2010
%T Sparse approximations of the Schur complement for parallel algebraic hybrid solvers in 3D
%A Luc Giraud
%A Azzam Haidar
%A Yousef Saad
%E C. Zhiming
%B Numerical Mathematics: Theory, Methods and Applications
%I Golbal Science Press
%C Beijing
%V 3
%P 64-82
%8 2010-00
%G eng
%0 Journal Article
%J Cluster Computing Journal: Special Issue on High Performance Distributed Computing
%D 2009
%T Paravirtualization Effect on Single- and Multi-threaded Memory-Intensive Linear Algebra Software
%A Lamia Youseff
%A Keith Seymour
%A Haihang You
%A Dmitrii Zagorodnov
%A Jack Dongarra
%A Rich Wolski
%B Cluster Computing Journal: Special Issue on High Performance Distributed Computing
%I Springer Netherlands
%V 12
%P 101-122
%8 2009-00
%G eng
%0 Journal Article
%J in Handbook of Research on Scalable Computing Technologies (to appear)
%D 2009
%T Reliability and Performance Modeling and Analysis for Grid Computing
%A Yuan-Shun Dai
%A Jack Dongarra
%E Kuan-Ching Li
%E Ching-Hsien Hsu
%E Laurence Yang
%E Jack Dongarra
%E Hans Zima
%B in Handbook of Research on Scalable Computing Technologies (to appear)
%I IGI Global
%P 219-245
%8 2009-00
%G eng
%0 Conference Proceedings
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%D 2009
%T VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance
%A Lavanya Ramakrishan
%A Daniel Nurmi
%A Anirban Mandal
%A Charles Koelbel
%A Dennis Gannon
%A Mark Huang
%A Yang-Suk Kee
%A Graziano Obertelli
%A Kiran Thyagaraja
%A Rich Wolski
%A Asim YarKhan
%A Dmitrii Zagorodnov
%K grads
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%C Portland, OR
%8 2009-00
%G eng
%0 Journal Article
%J in Advances in Computers
%D 2008
%T DARPA's HPCS Program: History, Models, Tools, Languages
%A Jack Dongarra
%A Robert Graybill
%A William Harrod
%A Robert Lucas
%A Ewing Lusk
%A Piotr Luszczek
%A Janice McMahon
%A Allan Snavely
%A Jeffrey Vetter
%A Katherine Yelick
%A Sadaf Alam
%A Roy Campbell
%A Laura Carrington
%A Tzu-Yi Chen
%A Omid Khalili
%A Jeremy Meredith
%A Mustafa Tikir
%E M. Zelkowitz
%B in Advances in Computers
%I Elsevier
%V 72
%8 2008-01
%G eng
%0 Journal Article
%J Computing and Informatics
%D 2008
%T Interactive Grid-Access Using Gridsolve and Giggle
%A Marcus Hardt
%A Keith Seymour
%A Jack Dongarra
%A Michael Zapf
%A Nicole Ruiter
%K netsolve
%B Computing and Informatics
%V 27
%P 233-248,ISSN1335-9150
%8 2008-00
%G eng
%0 Journal Article
%J J. Phys.: Conf. Ser. 46
%D 2006
%T Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures
%A Alex Zunger
%A Alberto Franceschetti
%A Gabriel Bester
%A Wesley B. Jones
%A Kwiseon Kim
%A Peter A. Graf
%A Lin-Wang Wang
%A Andrew Canning
%A Osni Marques
%A Christof Voemel
%A Jack Dongarra
%A Julien Langou
%A Stanimire Tomov
%K DOE_NANO
%B J. Phys.: Conf. Ser. 46
%V :101088/1742-6596/46/1/040
%P 292-298
%8 2006-01
%G eng
%0 Journal Article
%J Journal of Physics: Conference Series
%D 2005
%T NanoPSE: A Nanoscience Problem Solving Environment for Atomistic Electronic Structure of Semiconductor Nanostructures
%A Wesley B. Jones
%A Gabriel Bester
%A Andrew Canning
%A Alberto Franceschetti
%A Peter A. Graf
%A Kwiseon Kim
%A Julien Langou
%A Lin-Wang Wang
%A Jack Dongarra
%A Alex Zunger
%X Researchers at the National Renewable Energy Laboratory and their collaborators have developed over the past ~10 years a set of algorithms for an atomistic description of the electronic structure of nanostructures, based on plane-wave pseudopotentials and configuration interaction. The present contribution describes the first step in assembling these various codes into a single, portable, integrated set of software packages. This package is part of an ongoing research project in the development stage. Components of NanoPSE include codes for atomistic nanostructure generation and passivation, valence force field model for atomic relaxation, code for potential field generation, empirical pseudopotential method solver, strained linear combination of bulk bands method solver, configuration interaction solver for excited states, selection of linear algebra methods, and several inverse band structure solvers. Although not available for general distribution at this time as it is being developed and tested, the design goal of the NanoPSE software is to provide a software context for collaboration. The software package is enabled by fcdev, an integrated collection of best practice GNU software for open source development and distribution augmented to better support FORTRAN.
%B Journal of Physics: Conference Series
%P 277-282
%8 2005-06
%G eng
%U https://iopscience.iop.org/article/10.1088/1742-6596/16/1/038/meta
%N 16
%R https://doi.org/10.1088/1742-6596/16/1/038
%0 Conference Proceedings
%B In Proceedings of the 2005 SciDAC Conference
%D 2005
%T Performance Analysis of GYRO: A Tool Evaluation
%A Patrick H. Worley
%A Jeff Candy
%A Laura Carrington
%A Kevin Huck
%A Timothy Kaiser
%A Kumar Mahinthakumar
%A Allen D. Malony
%A Shirley Moore
%A Dan Reed
%A Philip C. Roth
%A H. Shan
%A Sameer Shende
%A Allan Snavely
%A S. Sreepathi
%A Felix Wolf
%A Y. Zhang
%K kojak
%B In Proceedings of the 2005 SciDAC Conference
%C San Francisco, CA
%8 2005-06
%G eng
%0 Journal Article
%J Oak Ridge National Laboratory Report
%D 2004
%T Cray X1 Evaluation Status Report
%A Pratul Agarwal
%A R. A. Alexander
%A E. Apra
%A Satish Balay
%A Arthur S. Bland
%A James Colgan
%A Eduardo D'Azevedo
%A Jack Dongarra
%A Tom Dunigan
%A Mark Fahey
%A Al Geist
%A M. Gordon
%A Robert Harrison
%A Dinesh Kaushik
%A M. Krishnakumar
%A Piotr Luszczek
%A Tony Mezzacapa
%A Jeff Nichols
%A Jarek Nieplocha
%A Leonid Oliker
%A T. Packwood
%A M. Pindzola
%A Thomas C. Schulthess
%A Jeffrey Vetter
%A James B White
%A T. Windus
%A Patrick H. Worley
%A Thomas Zacharia
%B Oak Ridge National Laboratory Report
%V /-2004/13
%8 2004-01
%G eng
%0 Journal Article
%J Engineering the Grid (to appear)
%D 2004
%T An Overview of Heterogeneous High Performance and Grid Computing
%A Jack Dongarra
%A Alexey Lastovetsky
%E Beniamino Di Martino
%E Jack Dongarra
%E Adolfy Hoisie
%E Laurence Yang
%E Hans Zima
%B Engineering the Grid (to appear)
%I Nova Science Publishers, Inc.
%8 2004-00
%G eng
%0 Journal Article
%J Lecture Notes in Computer Science
%D 2003
%T Computational Science — ICCS 2003
%A Peter M. Sloot
%A David Abramson
%A Alexander V. Bogdanov
%A Jack Dongarra
%A Albert Zomaya
%A Yuriy Gorbachev
%B Lecture Notes in Computer Science
%I Springer-Verlag, Berlin
%C ICCS 2003, International Conference. Melbourne, Australia
%V 2657-2660
%8 2003-06
%G eng
%0 Conference Paper
%B PADTAD Workshop, IPDPS 2003
%D 2003
%T Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters
%A Jack Dongarra
%A Kevin London
%A Shirley Moore
%A Phil Mucci
%A Dan Terpstra
%A Haihang You
%A Min Zhou
%K lacsi
%K papi
%X The PAPI project has defined and implemented a cross-platform interface to the hardware counters available on most modern microprocessors. The interface has gained widespread use and acceptance from hardware vendors, users, and tool developers. This paper reports on experiences with the community-based open-source effort to define the PAPI specification and implement it on a variety of platforms. Collaborations with tool developers who have incorporated support for PAPI are described. Issues related to interpretation and accuracy of hardware counter data and to the overheads of collecting this data are discussed. The paper concludes with implications for the design of the next version of PAPI.
%B PADTAD Workshop, IPDPS 2003
%I IEEE
%C Nice, France
%8 2003-04
%@ 0-7695-1926-1
%G eng