%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2020
%T Overhead of Using Spare Nodes
%A Atsushi Hori
%A Kazumi Yoshinaga
%A Thomas Herault
%A Aurelien Bouteiller
%A George Bosilca
%A Yutaka Ishikawa
%K communication performance
%K fault mitigation
%K Fault tolerance
%K sliding method
%K spare node
%X With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even when spare nodes are present, they are not always substituted for failed nodes in an effective way. This article considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the optimal node-rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this article, several spare node allocation and failed node substitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. The proposed substitution methods are named sliding methods. The sliding methods are analyzed by using our developed simulation program and evaluated by using the K computer, Blue Gene/Q (BG/Q), and TSUBAME 2.5. It will be shown that when failures occur, the stencil communication performance on the K and BG/Q can be slowed around 10 times depending on the number of node failures. The barrier performance on the K can be cut in half. On BG/Q, barrier performance can be slowed by a factor of 10. Further, it will also be shown that almost no such communication performance degradation can be seen on TSUBAME 2.5. This is because TSUBAME 2.5 has an Infiniband network connected with a FatTree topology, while the K computer and BG/Q have dedicated Cartesian networks. Thus, the communication performance degradation depends on network characteristics.
%B The International Journal of High Performance Computing Applications
%8 2020-02
%G eng
%U https://journals.sagepub.com/doi/10.1177/1094342020901885
%! The International Journal of High Performance Computing Applications
%R https://doi.org/10.1177%2F1094342020901885
%0 Generic
%D 2020
%T Performance Tuning SLATE
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Dalal Sukkari
%A Mohammed Al Farhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2020-01
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2020
%T Reducing the Amount of out-of-core Data Access for GPU-Accelerated Randomized SVD
%A Yuechao Lu
%A Ichitaro Yamazaki
%A Fumihiko Ino
%A Yasuyuki Matsushita
%A Stanimire Tomov
%A Jack Dongarra
%K Divide and conquer
%K gpu
%K out-of-core computation
%K Singular value decomposition
%B Concurrency and Computation: Practice and Experience
%8 2020-04
%G eng
%R https://doi.org/10.1002/cpe.5754
%0 Generic
%D 2020
%T SLATE Tutorial
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%A Ali Charara
%A Jamie Finney
%A Dalal Sukkari
%A Mohammed Al Farhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Jack Dongarra
%I 2020 ECP Annual Meeting
%C Houston, TX
%8 2020-02
%G eng
%0 Generic
%D 2019
%T A Collection of White Papers from the BDEC2 Workshop in San Diego, CA
%A Ilkay Altintas
%A Kyle Marcus
%A Volkan Vural
%A Shweta Purawat
%A Daniel Crawl
%A Gabriel Antoniu
%A Alexandru Costan
%A Ovidiu Marcu
%A Prasanna Balaprakash
%A Rongqiang Cao
%A Yangang Wang
%A Franck Cappello
%A Robert Underwood
%A Sheng Di
%A Justin M. Wozniak
%A Jon C. Calhoun
%A Cong Xu
%A Antonio Lain
%A Paolo Faraboschi
%A Nic Dube
%A Dejan Milojicic
%A Balazs Gerofi
%A Maria Girone
%A Viktor Khristenko
%A Tony Hey
%A Erza Kissel
%A Yu Liu
%A Richard Loft
%A Pekka Manninen
%A Sebastian von Alfthan
%A Takemasa Miyoshi
%A Bruno Raffin
%A Olivier Richard
%A Denis Trystram
%A Maryam Rahnemoonfar
%A Robin Murphy
%A Joel Saltz
%A Kentaro Sano
%A Rupak Roy
%A Kento Sato
%A Jian Guo
%A Jen s Domke
%A Weikuan Yu
%A Takaki Hatsui
%A Yasumasa Joti
%A Alex Szalay
%A William M. Tang
%A Michael R. Wyatt II
%A Michela Taufer
%A Todd Gamblin
%A Stephen Herbein
%A Adam Moody
%A Dong H. Ahn
%A Rich Wolski
%A Chandra Krintz
%A Fatih Bakir
%A Wei-tsung Lin
%A Gareth George
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-10
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T Distributed-Memory Lattice H-Matrix Factorization
%A Ichitaro Yamazaki
%A Akihiro Ida
%A Rio Yokota
%A Jack Dongarra
%X We parallelize the LU factorization of a hierarchical low-rank matrix (ℋ-matrix) on a distributed-memory computer. This is much more difficult than the ℋ-matrix-vector multiplication due to the dataflow of the factorization, and it is much harder than the parallelization of a dense matrix factorization due to the irregular hierarchical block structure of the matrix. Block low-rank (BLR) format gets rid of the hierarchy and simplifies the parallelization, often increasing concurrency. However, this comes at a price of losing the near-linear complexity of the ℋ-matrix factorization. In this work, we propose to factorize the matrix using a “lattice ℋ-matrix” format that generalizes the BLR format by storing each of the blocks (both diagonals and off-diagonals) in the ℋ-matrix format. These blocks stored in the ℋ-matrix format are referred to as lattices. Thus, this lattice format aims to combine the parallel scalability of BLR factorization with the near-linear complexity of ℋ-matrix factorization. We first compare factorization performances using the ℋ-matrix, BLR, and lattice ℋ-matrix formats under various conditions on a shared-memory computer. Our performance results show that the lattice format has storage and computational complexities similar to those of the ℋ-matrix format, and hence a much lower cost of factorization than BLR. We then compare the BLR and lattice ℋ-matrix factorization on distributed-memory computers. Our performance results demonstrate that compared with BLR, the lattice format with the lower cost of factorization may lead to faster factorization on the distributed-memory computer.
%B The International Journal of High Performance Computing Applications
%V 33
%P 1046–1063
%8 2019-08
%G eng
%N 5
%R https://doi.org/10.1177/1094342019861139
%0 Generic
%D 2019
%T An Empirical View of SLATE Algorithms on Scalable Hybrid System
%A Asim YarKhan
%A Jakub Kurzak
%A Ahmad Abdelfattah
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2019-09
%G eng
%0 Conference Paper
%B PAW-ATM Workshop at SC19
%D 2019
%T Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization
%A Yu Pei
%A George Bosilca
%A Ichitaro Yamazaki
%A Akihiro Ida
%A Jack Dongarra
%B PAW-ATM Workshop at SC19
%I ACM
%C Denver, CO
%8 2019-11
%G eng
%0 Conference Paper
%B IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist
%D 2019
%T Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Jack Dongarra
%X The emergence of deep learning as a leading computational workload for machine learning tasks on large-scale cloud infrastructure installations has led to plethora of accelerator hardware releases. However, the reduced precision and range of the floating-point numbers on these new platforms makes it a non-trivial task to leverage these unprecedented advances in computational power for numerical linear algebra operations that come with a guarantee of robust error bounds. In order to address these concerns, we present a number of strategies that can be used to increase the accuracy of limited-precision iterative refinement. By limited precision, we mean 16-bit floating-point formats implemented in modern hardware accelerators and are not necessarily compliant with the IEEE half-precision specification. We include the explanation of a broader context and connections to established IEEE floating-point standards and existing high-performance computing (HPC) benchmarks. We also present a new formulation of LU factorization that we call signed square root LU which produces more numerically balanced L and U factors which directly address the problems of limited range of the low-precision storage formats. The experimental results indicate that it is possible to recover substantial amounts of the accuracy in the system solution that would otherwise be lost. Previously, this could only be achieved by using iterative refinement based on single-precision floating-point arithmetic. The discussion will also explore the numerical stability issues that are important for robust linear solvers on these new hardware platforms.
%B IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist
%I IEEE
%C Waltham, MA
%8 2019-09
%G eng
%1 Best Paper Finalist
%0 Conference Proceedings
%B ACM International Conference on Supercomputing (ICS '19)
%D 2019
%T Least Squares Solvers for Distributed-Memory Machines with GPU Accelerators
%A Jakub Kurzak
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%Y Rudolf Eigenmann
%Y Chen Ding
%Y Sally A. McKee
%B ACM International Conference on Supercomputing (ICS '19)
%I ACM
%C Phoenix, Arizona
%P 117–126
%8 2019-06
%@ 9781450360791
%G eng
%U http://dl.acm.org/citation.cfm?doid=3330345
%R https://doi.org/10.1145/3324989.3325719
%0 Conference Proceedings
%B Euro-Par 2019: Parallel Processing
%D 2019
%T Linear Systems Solvers for Distributed-Memory Machines with GPU Accelerators
%A Kurzak, Jakub
%A Gates, Mark
%A Charara, Ali
%A YarKhan, Asim
%A Yamazaki, Ichitaro
%A Jack Dongarra
%E Yahyapour, Ramin
%B Euro-Par 2019: Parallel Processing
%I Springer
%V 11725
%P 495–506
%8 2019-08
%@ 978-3-030-29399-4
%G eng
%U https://link.springer.com/chapter/10.1007/978-3-030-29400-7_35
%R https://doi.org/10.1007/978-3-030-29400-7_35
%0 Conference Paper
%B International Parallel and Distributed Processing Symposium (IPDPS)
%D 2019
%T Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation
%A Zhaojun Bai
%A Jack Dongarra
%A Ding Lu
%A Ichitaro Yamazaki
%X Some scientific and engineering applications need to compute a large number of eigenpairs of a large Hermitian matrix. Though the Lanczos method is effective for computing a few eigenvalues, it can be expensive for computing a large number of eigenpairs (e.g., in terms of computation and communication). To improve the performance of the method, in this paper, we study an s-step variant of thick-restart Lanczos (TRLan) combined with an explicit external deflation (EED). The s-step method generates a set of s basis vectors at a time and reduces the communication costs of generating the basis vectors. We then design a specialized matrix powers kernel (MPK) that reduces both the communication and computational costs by taking advantage of the special properties of the deflation matrix. We conducted numerical experiments of the new TRLan eigensolver using synthetic matrices and matrices from electronic structure calculations. The performance results on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) demonstrate the potential of the specialized MPK to significantly reduce the execution time of the TRLan eigensolver. The speedups of up to 3.1× and 5.3× were obtained in our sequential and parallel runs, respectively.
%B International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%0 Journal Article
%J Parallel Computing
%D 2019
%T Performance of Asynchronous Optimized Schwarz with One-sided Communication
%A Ichitaro Yamazaki
%A Edmond Chow
%A Aurelien Bouteiller
%A Jack Dongarra
%X In asynchronous iterative methods on distributed-memory computers, processes update their local solutions using data from other processes without an implicit or explicit global synchronization that corresponds to advancing the global iteration counter. In this work, we test the asynchronous optimized Schwarz domain-decomposition iterative method using various one-sided (remote direct memory access) communication schemes with passive target completion. The results show that when one-sided communication is well-supported, the asynchronous version of optimized Schwarz can outperform the synchronous version even for perfectly balanced partitionings of the problem on a supercomputer with uniform nodes.
%B Parallel Computing
%V 86
%P 66-81
%8 2019-08
%G eng
%U http://www.sciencedirect.com/science/article/pii/S0167819118301261
%R https://doi.org/10.1016/j.parco.2019.05.004
%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2019
%T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Sven Hammarling
%A Jakub Sistek
%B ACM Transactions on Mathematical Software
%V 45
%8 2019-06
%G eng
%N 2
%R https://doi.org/10.1145/3264491
%0 Conference Paper
%B International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%D 2019
%T SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
%A Mark Gates
%A Jakub Kurzak
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%X The SLATE (Software for Linear Algebra Targeting Exascale) library is being developed to provide fundamental dense linear algebra capabilities for current and upcoming distributed high-performance systems, both accelerated CPU-GPU based and CPU based. SLATE will provide coverage of existing ScaLAPACK functionality, including the parallel BLAS; linear systems using LU and Cholesky; least squares problems using QR; and eigenvalue and singular value problems. In this respect, it will serve as a replacement for ScaLAPACK, which after two decades of operation, cannot adequately be retrofitted for modern accelerated architectures. SLATE uses modern techniques such as communication-avoiding algorithms, lookahead panels to overlap communication and computation, and task-based scheduling, along with a modern C++ framework. Here we present the design of SLATE and initial reports of several of its components.
%B International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%I ACM
%C Denver, CO
%8 2019-11
%G eng
%R https://doi.org/10.1145/3295500.3356223
%0 Generic
%D 2019
%T SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
%A Mark Gates
%A Jakub Kurzak
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%I International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%C Denver, CO
%8 2019-11
%G eng
%0 Generic
%D 2019
%T SLATE Developers' Guide
%A Ali Charara
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-12
%G eng
%9 SLATE Working Notes
%0 Generic
%D 2019
%T SLATE Mixed Precision Performance Report
%A Ali Charara
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-04
%G eng
%0 Generic
%D 2019
%T SLATE Working Note 12: Implementing Matrix Inversions
%A Jakub Kurzak
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-06
%G eng
%0 Generic
%D 2019
%T SLATE Working Note 13: Implementing Singular Value and Symmetric/Hermitian Eigenvalue Solvers
%A Mark Gates
%A Mohammed Al Farhan
%A Ali Charara
%A Jakub Kurzak
%A Dalal Sukkari
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-09
%G eng
%9 SLATE Working Notes
%0 Book Section
%B Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project
%D 2019
%T System Software for Many-Core and Multi-Core Architectures
%A Atsushi Hori
%A Tsujita, Yuichi
%A Shimada, Akio
%A Yoshinaga, Kazumi
%A Mitaro, Namiki
%A Fukazawa, Go
%A Sato, Mikiko
%A George Bosilca
%A Aurelien Bouteiller
%A Thomas Herault
%E Sato, Mitsuhisa
%X In this project, the software technologies for the post-peta scale computing were explored. More specifically, OS technologies for heterogeneous architectures, lightweight thread, scalable I/O, and fault mitigation were investigated. As for the OS technologies, a new parallel execution model, Partitioned Virtual Address Space (PVAS), for the many-core CPU was proposed. For the heterogeneous architectures, where multi-core CPU and many-core CPU are connected with an I/O bus, an extension of PVAS, Multiple-PVAS, to have a unified virtual address space of multi-core and many-core CPUs was proposed. The proposed PVAS was also enhanced to have multiple processes where process context switch can take place at the user level (named User-Level Process: ULP). As for the scalable I/O, EARTH, optimization techniques for MPI collective I/O, was proposed. Lastly, for the fault mitigation, User Level Fault Mitigation, ULFM was improved to have faster agreement process, and sliding methods to substitute failed nodes with spare nodes was proposed. The funding of this project was ended in 2016; however, many proposed technologies are still being propelled.
%B Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project
%I Springer Singapore
%C Singapore
%P 59–75
%@ 978-981-13-1924-2
%G eng
%U https://doi.org/10.1007/978-981-13-1924-2_4
%R 10.1007/978-981-13-1924-2_4
%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2018
%T Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters
%A Ichitaro Yamazaki
%A Ahmad Abdelfattah
%A Akihiro Ida
%A Satoshi Ohshima
%A Stanimire Tomov
%A Rio Yokota
%A Jack Dongarra
%X ppohBEM is an open-source software package im- plementing the boundary element method. One of its main software tasks is the solution of the dense linear system of equations, for which, ppohBEM relies on another software package called HACApK. To reduce the cost of solving the linear system, HACApK hierarchically compresses the coefficient matrix using adaptive cross approximation. This hierarchical compression greatly reduces the storage and time complexities of the solver and enables the solution of large-scale boundary value problems. To extend the capability of ppohBEM, in this paper, we carefully port the HACApK’s linear solver onto GPU clusters. Though the potential of the GPUs has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUs for a solver, like HACApK’s, that requires fine-grained computation and global communication. First, to utilize the GPUs, we integrate the batched GPU kernel that was recently released in the MAGMA software package. We discuss several techniques to improve the performance of the batched kernel. We then study various techniques to address the inter-GPU communication and study their effects on state-of- the-art GPU clusters. We believe that the techniques studied in this paper are of interest to a wide range of software packages running on GPUs, especially with the increasingly complex node architectures and the growing costs of the communication. We also hope that our efforts to integrate the GPU kernel or to setup the inter-GPU communication will influence the design of the future-generation batched kernels or the communication layer within a software stack.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Vancouver, BC, Canada
%8 2018-05
%G eng
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2018
%T Autotuning Techniques for Performance-Portable Point Set Registration in 3D
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A David Keffer
%A Vasileios Maroulas
%A Jack Dongarra
%X We present an autotuning approach applied to exhaustive performance engineering of the EM-ICP algorithm for the point set registration problem with a known reference. We were able to achieve progressively higher performance levels through a variety of code transformations and an automated procedure of generating a large number of implementation variants. Furthermore, we managed to exploit code patterns that are not common when only attempting manual optimization but which yielded in our tests better performance for the chosen registration algorithm. Finally, we also show how we maintained high levels of the performance rate in a portable fashion across a wide range of hardware platforms including multicore, manycore coprocessors, and accelerators. Each of these hardware classes is much different from the others and, consequently, cannot reliably be mastered by a single developer in a short time required to deliver a close-to-optimal implementation. We assert in our concluding remarks that our methodology as well as the presented tools provide a valid automation system for software optimization tasks on modern HPC hardware.
%B Supercomputing Frontiers and Innovations
%V 5
%8 2018-12
%G eng
%& 42
%R 10.14529/jsfi180404
%0 Generic
%D 2018
%T A Collection of White Papers from the BDEC2 Workshop in Bloomington, IN
%A James Ahrens
%A Christopher M. Biwer
%A Alexandru Costan
%A Gabriel Antoniu
%A Maria S. Pérez
%A Nenad Stojanovic
%A Rosa Badia
%A Oliver Beckstein
%A Geoffrey Fox
%A Shantenu Jha
%A Micah Beck
%A Terry Moore
%A Sunita Chandrasekaran
%A Carlos Costa
%A Thierry Deutsch
%A Luigi Genovese
%A Tarek El-Ghazawi
%A Ian Foster
%A Dennis Gannon
%A Toshihiro Hanawa
%A Tevfik Kosar
%A William Kramer
%A Madhav V. Marathe
%A Christopher L. Barrett
%A Takemasa Miyoshi
%A Alex Pothen
%A Ariful Azad
%A Judy Qiu
%A Bo Peng
%A Ravi Teja
%A Sahil Tyagi
%A Chathura Widanage
%A Jon Koskey
%A Maryam Rahnemoonfar
%A Umakishore Ramachandran
%A Miles Deegan
%A William Tang
%A Osamu Tatebe
%A Michela Taufer
%A Michel Cuende
%A Ewa Deelman
%A Trilce Estrada
%A Rafael Ferreira Da Silva
%A Harrel Weinstein
%A Rodrigo Vargas
%A Miwako Tsuji
%A Kevin G. Yager
%A Wanling Gao
%A Jianfeng Zhan
%A Lei Wang
%A Chunjie Luo
%A Daoyi Zheng
%A Xu Wen
%A Rui Ren
%A Chen Zheng
%A Xiwen He
%A Hainan Ye
%A Haoning Tang
%A Zheng Cao
%A Shujie Zhang
%A Jiahui Dai
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2018-11
%G eng
%0 Generic
%D 2018
%T Initial Integration and Evaluation of SLATE and STRUMPACK
%A Pieter Ghysels
%A Sherry Li
%A Asim YarKhan
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-12
%G eng
%0 Generic
%D 2018
%T Initial Integration and Evaluation of SLATE Parallel BLAS in LATTE
%A Asim YarKhan
%A Gerald Ragghianti
%A Jack Dongarra
%A Marc Cawkwell
%A Danny Perez
%A Arthur Voter
%B Innovative Computing Laboratory Technical Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng
%0 Journal Article
%J Concurrency Computation: Practice and Experience
%D 2018
%T Investigating Power Capping toward Energy-Efficient Scientific Applications
%A Azzam Haidar
%A Heike Jagode
%A Phil Vaccaro
%A Asim YarKhan
%A Stanimire Tomov
%A Jack Dongarra
%K energy efficiency
%K High Performance Computing
%K Intel Xeon Phi
%K Knights landing
%K papi
%K performance analysis
%K Performance Counters
%K power efficiency
%X The emergence of power efficiency as a primary constraint in processor and system design poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers, which may house petascale or exascale-level computing systems. At these extreme scales, understanding and improving the energy efficiency of numerical libraries and their related applications becomes a crucial part of the successful implementation and operation of the computing system. In this paper, we study and investigate the practice of controlling a compute system's power usage, and we explore how different power caps affect the performance of numerical algorithms with different computational intensities. Further, we determine the impact, in terms of performance and energy usage, that these caps have on a system running scientific applications. This analysis will enable us to characterize the types of algorithms that benefit most from these power management schemes. Our experiments are performed using a set of representative kernels and several popular scientific benchmarks. We quantify a number of power and performance measurements and draw observations and conclusions that can be viewed as a roadmap to achieving energy efficiency in the design and execution of scientific algorithms.
%B Concurrency Computation: Practice and Experience
%V 2018
%P 1-14
%8 2018-04
%G eng
%U https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4485
%N e4485
%R https://doi.org/10.1002/cpe.4485
%0 Generic
%D 2018
%T Least Squares Performance Report
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-12
%G eng
%9 SLATE Working Notes
%0 Generic
%D 2018
%T Linear Systems Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Ali Charara
%A Asim YarKhan
%A Jamie Finney
%A Gerald Ragghianti
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-09
%G eng
%9 SLATE Working Notes
%0 Generic
%D 2018
%T MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Azzam Haidar
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster
%C Dallas, TX
%8 2018-11
%G eng
%0 Generic
%D 2018
%T MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR)
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Ichitaro Yamazaki
%A Jack Dongarra
%I NSF PI Meeting, Poster
%C Washington, DC
%8 2018-04
%G eng
%R https://doi.org/10.6084/m9.figshare.6174143.v3
%0 Generic
%D 2018
%T Parallel BLAS Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I University of Tennessee
%8 2018-04
%G eng
%0 Generic
%D 2018
%T Parallel Norms Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng
%0 Generic
%D 2018
%T Production Implementations of Pipelined & Communication-Avoiding Iterative Linear Solvers
%A Mark Hoemmen
%A Ichitaro Yamazaki
%I SIAM Conference on Parallel Processing for Scientific Computing
%C Tokyo, Japan
%8 2018-03
%G eng
%0 Journal Article
%J SIAM Review
%D 2018
%T The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%K bidiagonal matrix
%K bisection
%K Divide and conquer
%K Hestenes method
%K Jacobi method
%K Kogbetliantz method
%K MRRR
%K QR iteration
%K Singular value decomposition
%K SVD
%X The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EISPACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy.
%B SIAM Review
%V 60
%P 808–865
%8 2018-11
%G eng
%U https://epubs.siam.org/doi/10.1137/17M1117732
%N 4
%! SIAM Rev.
%R 10.1137/17M1117732
%0 Generic
%D 2018
%T Software-Defined Events (SDEs) in MAGMA-Sparse
%A Heike Jagode
%A Anthony Danalis
%A Hartwig Anzt
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Erik Boman
%A Stanimire Tomov
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-12
%G eng
%0 Generic
%D 2018
%T Solver Interface & Performance on Cori
%A Hartwig Anzt
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Erik Boman
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-06
%G eng
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Panruo Wu
%A Mawussi Zounon
%A Jack Dongarra
%K linear algebra
%K multithreading
%K runtime
%K symmetric indefinite matrices
%X Recently, the Open Multi-Processing (OpenMP) standard has incorporated task-based programming, where a function call with input and output data is treated as a task. At run time, OpenMP's superscalar scheduler tracks the data dependencies among the tasks and executes the tasks as their dependencies are resolved. On a shared-memory architecture with multiple cores, the independent tasks are executed on different cores in parallel, thereby enabling parallel execution of a seemingly sequential code. With the emergence of many-core architectures, this type of programming paradigm is gaining attention-not only because of its simplicity, but also because it breaks the artificial synchronization points of the program and improves its thread-level parallelization. In this paper, we use these new OpenMP features to develop a portable high-performance implementation of a dense symmetric indefinite linear solver. Obtaining high performance from this kind of solver is a challenge because the symmetric pivoting, which is required to maintain numerical stability, leads to data dependencies that prevent us from using some common performance-improving techniques. To fully utilize a large number of cores through tasking, while conforming to the OpenMP standard, we describe several techniques. Our performance results on current many-core architectures-including Intel's Broadwell, Intel's Knights Landing, IBM's Power8, and Arm's ARMv8-demonstrate the portable and superior performance of our implementation compared with the Linear Algebra PACKage (LAPACK). The resulting solver is now available as a part of the PLASMA software package.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 1879–1892
%8 2018-08
%G eng
%N 8
%R 10.1109/TPDS.2018.2808964
%0 Journal Article
%J International Journal of Computational Science and Engineering (IJCSE)
%D 2018
%T Task Based Cholesky Decomposition on Xeon Phi Architectures using OpenMP
%A Joseph Dorris
%A Asim YarKhan
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X The increasing number of computational cores in modern many-core processors, as represented by the Intel Xeon Phi architectures, has created the need for an open-source, high performance and scalable task-based dense linear algebra package that can efficiently use this type of many-core hardware. In this paper, we examined the design modifications necessary when porting PLASMA, a task-based dense linear algebra library, run effectively on two generations of Intel's Xeon Phi architecture, known as knights corner (KNC) and knights landing (KNL). First, we modified PLASMA's tiled Cholesky decomposition to use OpenMP tasks for its scheduling mechanism to enable Xeon Phi compatibility. We then compared the performance of our modified code to that of the original dynamic scheduler running on an Intel Xeon Sandy Bridge CPU. Finally, we looked at the performance of the OpenMP tiled Cholesky decomposition on knights corner and knights landing processors. We detail the optimisations required to obtain performance on these platforms and compare with the highly tuned Intel MKL math library.
%B International Journal of Computational Science and Engineering (IJCSE)
%V 17
%8 2018-10
%G eng
%R http://dx.doi.org/10.1504/IJCSE.2018.095851
%0 Book Section
%B Handbook of Big Data Technologies
%D 2017
%T Bringing High Performance Computing to Big Data Algorithms
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Handbook of Big Data Technologies
%I Springer
%@ 978-3-319-49339-8
%G eng
%R 10.1007/978-3-319-49340-4
%0 Generic
%D 2017
%T Comparing performance of s-step and pipelined GMRES on distributed-memory multicore CPUs
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Piotr Luszczek
%A Jack Dongarra
%I SIAM Annual Meeting
%C Pittsburgh, Pennsylvania
%8 2017-07
%G eng
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2017
%T Design and Implementation of the PULSAR Programming System for Large Scale Computing
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Yves Robert
%A Jack Dongarra
%X The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.
%B Supercomputing Frontiers and Innovations
%V 4
%G eng
%U http://superfri.org/superfri/article/view/121/210
%N 1
%R 10.14529/jsfi170101
%0 Generic
%D 2017
%T Designing SLATE: Software for Linear Algebra Targeting Exascale
%A Jakub Kurzak
%A Panruo Wu
%A Mark Gates
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Gerald Ragghianti
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-10
%G eng
%9 SLATE Working Notes
%0 Conference Paper
%B ACM MultiMedia Workshop 2017
%D 2017
%T Efficient Communications in Training Large Scale Neural Networks
%A Yiyang Zhao
%A Linnan Wan
%A Wei Wu
%A George Bosilca
%A Richard Vuduc
%A Jinmian Ye
%A Wenqi Tang
%A Zenglin Xu
%X We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for sub-gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to P, where P is the number of GPUs, while the cost of more conventional Minimum Spanning Tree (MST) scales like O(logP). LP also demonstrate up to 2x faster bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD.
%B ACM MultiMedia Workshop 2017
%I ACM
%C Mountain View, CA
%8 2017-10
%G eng
%0 Conference Proceedings
%B Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award
%D 2017
%T Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Piotr Luszczek
%A Jack Dongarra
%X We compare the performance of pipelined and s-step GMRES, respectively referred to as l-GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s-GMRES requires fewer all-reduces, while l-GMRES overlaps the all-reduces with computation. To combine the best features of two algorithms, we propose another variant, (l, t)-GMRES, that not only does fewer global all-reduces than standard GMRES, but also overlaps those all-reduces with other work. We implemented the thread-parallelism and communication-overlap in two different ways. The first uses nonblocking MPI collectives with thread-parallel computational kernels. The second relies on a shared-memory task scheduler. In our experiments, (l, t)-GMRES performed better than l-GMRES by factors of up to 1.67×. In addition, though we only used 50 nodes, when the latency cost became significant, our variant performed up to 1.22× better than s-GMRES by hiding all-reduces.
%B Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award
%C Orlando, FL
%8 2017-06
%G eng
%R https://doi.org/10.1109/IPDPSW.2017.65
%0 Generic
%D 2017
%T LAWN 294: Aasen's Symmetric Indenite Linear Solvers in LAPACK
%A Ichitaro Yamazaki
%A Jack Dongarra
%X Recently, we released two LAPACK subroutines that implement Aasen's algorithms for solving a symmetric indefinite linear system of equations. The first implementation is based on a partitioned right-looking variant of Aasen's algorithm (the column-wise left-looking panel factorization, followed by the right-looking trailing submatrix update using the panel). The second implements the two-stage left-looking variant of the algorithm (the block-wise left- looking algorithm that reduces the matrix to the symmetric band form, followed by the band LU factorization). In this report, we discuss our implementations and present our experimental results to compare the stability and performance of these two new solvers with those of the other two symmetric indefinite solvers in LAPACK (i.e., the Bunch-Kaufman and rook pivoting algorithms).
%B LAPACK Working Note
%I University of Tennessee
%8 2017-12
%G eng
%0 Generic
%D 2017
%T MAGMA-sparse Interface Design Whitepaper
%A Hartwig Anzt
%A Erik Boman
%A Jack Dongarra
%A Goran Flegar
%A Mark Gates
%A Mike Heroux
%A Mark Hoemmen
%A Jakub Kurzak
%A Piotr Luszczek
%A Sivasankaran Rajamanickam
%A Stanimire Tomov
%A Stephen Wood
%A Ichitaro Yamazaki
%X In this report we describe the logic and interface we develop for the MAGMA-sparse library to allow for easy integration as third-party library into a top-level software ecosystem. The design choices are based on extensive consultation with other software library developers, in particular the Trilinos software development team. The interface documentation is at this point not exhaustive, but a first proposal for setting a standard. Although the interface description targets the MAGMA-sparse software module, we hope that the design choices carry beyond this specific library, and are attractive for adoption in other packages. This report is not intended as static document, but will be updated over time to reflect the agile software development in the ECP 1.3.3.11 STMS11-PEEKS project.
%B Innovative Computing Laboratory Technical Report
%8 2017-09
%G eng
%9 Technical Report
%0 Generic
%D 2017
%T PLASMA 17 Performance Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng
%0 Generic
%D 2017
%T PLASMA 17.1 Functionality Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng
%0 Conference Paper
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist
%D 2017
%T Power-aware Computing: Measurement, Control, and Performance Analysis for Intel Xeon Phi
%A Azzam Haidar
%A Heike Jagode
%A Asim YarKhan
%A Phil Vaccaro
%A Stanimire Tomov
%A Jack Dongarra
%X The emergence of power efficiency as a primary constraint in processor and system designs poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers in particular for peta- and exa- scale systems. Understanding and improving the energy efficiency of numerical simulation becomes very crucial. We present a detailed study and investigation toward control- ling power usage and exploring how different power caps affect the performance of numerical algorithms with different computa- tional intensities, and determine the impact and correlation with performance of scientific applications. Our analyses is performed using a set of representatives kernels, as well as many highly used scientific benchmarks. We quantify a number of power and performance measurements, and draw observations and conclusions that can be viewed as a roadmap toward achieving energy efficiency computing algorithms.
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist
%I IEEE
%C Waltham, MA
%8 2017-09
%G eng
%R https://doi.org/10.1109/HPEC.2017.8091085
%0 Generic
%D 2017
%T Power-Aware HPC on Intel Xeon Phi KNL Processors
%A Azzam Haidar
%A Heike Jagode
%A Asim YarKhan
%A Phil Vaccaro
%A Stanimire Tomov
%A Jack Dongarra
%I ISC High Performance (ISC17), Intel Booth Presentation
%C Frankfurt, Germany
%8 2017-06
%G eng
%0 Generic
%D 2017
%T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Aurelien Bouteiller
%A Anthony Danalis
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Stephen Wood
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-06
%G eng
%9 SLATE Working Notes
%0 Conference Paper
%B IEEE International Conference on Big Data
%D 2017
%T Sampling Algorithms to Update Truncated SVD
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%B IEEE International Conference on Big Data
%I IEEE
%C Boston, MA
%8 2017-12
%G eng
%0 Conference Paper
%B IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017)
%D 2017
%T Scaling Point Set Registration in 3D Across Thread Counts on Multicore and Hardware Accelerator Platforms through Autotuning for Large Scale Analysis of Scientific Point Clouds
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A David Keffer
%A Jack Dongarra
%X In this article, we present an autotuning approach applied to systematic performance engineering of the EM-ICP (Expectation-Maximization Iterative Closest Point) algorithm for the point set registration problem. We show how we were able to exceed the performance achieved by the reference code through multiple dependence transformations and automated procedure of generating and evaluating numerous implementation variants. Furthermore, we also managed to exploit code transformations that are not that common during manual optimization but yielded better performance in our tests for the EM-ICP algorithm. Finally, we maintained high levels of performance rate in a portable fashion across a wide range of HPC hardware platforms including multicore, many-core, and GPU-based accelerators. More importantly, the results indicate consistently high performance level and ability to move the task of data analysis through point-set registration to any modern compute platform without the concern of inferior asymptotic efficiency.
%B IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017)
%I IEEE
%C Boston, MA
%8 2017-12
%G eng
%R https://doi.org/10.1109/BigData.2017.8258258
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2017
%T Solving Dense Symmetric Indefinite Systems using GPUs
%A Marc Baboulin
%A Jack Dongarra
%A Adrien Remy
%A Stanimire Tomov
%A Ichitaro Yamazaki
%X This paper studies the performance of different algorithms for solving a dense symmetric indefinite linear system of equations on multicore CPUs with a Graphics Processing Unit (GPU). To ensure the numerical stability of the factorization, pivoting is required. Obtaining high performance of such algorithms on the GPU is difficult because all the existing pivoting strategies lead to frequent synchronizations and irregular data accesses. Until recently, there has not been any implementation of these algorithms on a hybrid CPU/GPU architecture. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive data transfer and synchronization between the CPU and GPU, or on the GPU (e.g., factorizing the matrix entirely on the GPU or in a communication-avoiding fashion). We also study the performance of the solver using iterative refinements along with the factorization without pivoting combined with the preprocessing technique based on random butterfly transformations, or with the mixed-precision algorithm where the matrix is factorized in single precision. This randomization algorithm only has a probabilistic proof on the numerical stability, and for this paper, we only focused on the mixed-precision algorithm without pivoting. However, they demonstrate that we can obtain good performance on the GPU by avoiding the pivoting and using the lower precision arithmetics, respectively. As illustrated with the application in acoustics studied in this paper, in many practical cases, the matrices can be factorized without pivoting. Because the componentwise backward error computed in the iterative refinement signals when the algorithm failed to obtain the desired accuracy, the user can use these potentially unstable but efficient algorithms in most of the cases and fall back to a more stable algorithm with pivoting only in the case of the failure.
%B Concurrency and Computation: Practice and Experience
%V 29
%8 2017-03
%G eng
%U http://onlinelibrary.wiley.com/doi/10.1002/cpe.4055/full
%N 9
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.4055
%0 Journal Article
%J IEEE Embedded Systems Letters
%D 2017
%T Structure-aware Linear Solver for Realtime Convex Optimization for Embedded Systems
%A Ichitaro Yamazaki
%A Saeid Nooshabadi
%A Stanimire Tomov
%A Jack Dongarra
%K Karush Kuhn Tucker (KKT)
%K Realtime embedded convex optimization solver
%X With the increasing sophistication in the use of optimization algorithms such as deep learning on embedded systems, the convex optimization solvers on embedded systems have found widespread use. This letter presents a novel linear solver technique to reduce the run-time of convex optimization solver by using the property that some parameters are fixed during the solution iterations of a solve instance. Our experimental results show that the run-time can be reduced by two orders of magnitude.
%B IEEE Embedded Systems Letters
%V 9
%P 61–64
%8 2017-05
%G eng
%U http://ieeexplore.ieee.org/document/7917357/
%N 3
%R 10.1109/LES.2017.2700401
%0 Conference Paper
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC)
%D 2017
%T Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A Jack Dongarra
%X With NVIDA Tegra Jetson X1 and Pascal P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. In this talk, we will introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how they relate to NVIDIA platforms.
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC)
%I IEEE
%C Waltham, MA
%8 2017-09
%G eng
%R https://doi.org/10.1109/HPEC.2017.8091031
%0 Journal Article
%J Computing in Science & Engineering
%D 2017
%T With Extreme Computing, the Rules Have Changed
%A Jack Dongarra
%A Stanimire Tomov
%A Piotr Luszczek
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Azzam Haidar
%A Ahmad Abdelfattah
%X On the eve of exascale computing, traditional wisdom no longer applies. High-performance computing is gone as we know it. This article discusses a range of new algorithmic techniques emerging in the context of exascale computing, many of which defy the common wisdom of high-performance computing and are considered unorthodox, but could turn out to be a necessity in near future.
%B Computing in Science & Engineering
%V 19
%P 52-62
%8 2017-05
%G eng
%N 3
%R https://doi.org/10.1109/MCSE.2017.48
%0 Book Section
%B Lecture Notes in Computer Science
%D 2016
%T Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures
%A Marc Baboulin
%A Jack Dongarra
%A Adrien Remy
%A Stanimire Tomov
%A Ichitaro Yamazaki
%E Roman Wyrzykowski
%E Ewa Deelman
%E Konrad Karczewski
%E Jacek Kitowski
%E Kazimierz Wiatr
%K Communication-avoiding
%K Dense symmetric indefinite factorization
%K gpu computation
%K randomization
%X We study the performance of dense symmetric indefinite factorizations (Bunch-Kaufman and Aasen’s algorithms) on multicore CPUs with a Graphics Processing Unit (GPU). Though such algorithms are needed in many scientific and engineering simulations, obtaining high performance of the factorization on the GPU is difficult because the pivoting that is required to ensure the numerical stability of the factorization leads to frequent synchronizations and irregular data accesses. As a result, until recently, there has not been any implementation of these algorithms on hybrid CPU/GPU architectures. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive communication and synchronization between the CPU and GPU, or on the GPU. We also study the performance of an LDL^T factorization with no pivoting combined with the preprocessing technique based on Random Butterfly Transformations. Though such transformations only have probabilistic results on the numerical stability, they avoid the pivoting and obtain a great performance on the GPU.
%B Lecture Notes in Computer Science
%S 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I
%I Springer International Publishing
%V 9573
%P 86-95
%8 2015-09
%@ 978-3-319-32149-3
%G eng
%& Parallel Processing and Applied Mathematics
%R 10.1007/978-3-319-32149-3_9
%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%D 2016
%T Heterogeneous Streaming
%A Chris J. Newburn
%A Gaurav Bansal
%A Michael Wood
%A Luis Crivelli
%A Judit Planas
%A Alejandro Duran
%A Paulo Souza
%A Leonardo Borges
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%A Hartwig Anzt
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Ichitaro Yamazaki
%A Jesus Labarta
%X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%0 Generic
%D 2016
%T High Performance Realtime Convex Solver for Embedded Systems
%A Ichitaro Yamazaki
%A Saeid Nooshabadi
%A Stanimire Tomov
%A Jack Dongarra
%K KKT
%K Realtime embedded convex optimization solver
%X Convex optimization solvers for embedded systems find widespread use. This letter presents a novel technique to reduce the run-time of decomposition of KKT matrix for the convex optimization solver for an embedded system, by two orders of magnitude. We use the property that although the KKT matrix changes, some of its block sub-matrices are fixed during the solution iterations and the associated solving instances.
%B University of Tennessee Computer Science Technical Report
%8 2016-10
%G eng
%0 Journal Article
%J Acta Numerica
%D 2016
%T Linear Algebra Software for Large-Scale Accelerated Multicore Computing
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A undefined
%A Asim YarKhan
%X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems.
%B Acta Numerica
%V 25
%P 1-160
%8 2016-05
%G eng
%R 10.1017/S0962492916000015
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2016
%T Non-GPU-resident Dense Symmetric Indefinite Factorization
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X We study various algorithms to factorize a symmetric indefinite matrix that does not fit in the core memory of a computer. There are two sources of the data movement into the memory: one needed for selecting and applying pivots and the other needed to update each column of the matrix for the factorization. It is a challenge to obtain high performance of such an algorithm when the pivoting is required to ensure the numerical stability of the factorization. For example, when factorizing each column of the matrix, a diagonal entry, which ensures the stability, may need to be selected as a pivot among the remaining diagonals, and moved to the leading diagonal by swapping both the corresponding rows and columns of the matrix. If the pivot is not in the core memory, then it must be loaded into the core memory. For updating the matrix, the data locality may be improved by partitioning the matrix. For example, a right-looking partitioned algorithm first factorizes the leading columns, called panel, and then uses the factorized panel to update the trailing submatrix. This algorithm only accesses the trailing submatrix after each panel factorization (instead of after each column factorization) and performs most of its floating-point operations (flops) using BLAS-3, which can take advantage of the memory hierarchy. However, because the pivots cannot be predetermined, the whole trailing submatrix must be updated before the next panel factorization can start. When the whole submatrix does not fit in the core memory all at once, loading the block columns into the memory can become the performance bottleneck. Similarly, the left-looking variant of the algorithm would require to update each panel with all of the previously factorized columns. This makes it a much greater challenge to implement an efficient out-of-core symmetric indefinite factorization compared with an out-of-core nonsymmetric LU factorization with partial pivoting, which only requires to swap the rows of the matrix and accesses the trailing submatrix after each in-core factorization (instead of after each panel factorization by the symmetric factorization). To reduce the amount of the data transfer, in this paper we uses the recently proposed left-looking communication-avoiding variant of the symmetric factorization algorithm to factorize the columns in the core memory, and then perform the partitioned right-looking out-of-core trailing submatrix updates. This combination may still require to load the pivots into the core memory, but it only updates the trailing submatrix after each in-core factorization, while the previous algorithm updates it after each panel factorization.Although these in-core and out-of-core algorithms can be applied at any level of the memory hierarchy, we apply our designs to the GPU and CPU memory, respectively. We call this specific implementation of the algorithm a non–GPU-resident implementation. Our performance results on the current hybrid CPU/GPU architecture demonstrate that when the matrix is much larger than the GPU memory, the proposed algorithm can obtain significant speedups over the communication-hiding implementations of the previous algorithms.
%B Concurrency and Computation: Practice and Experience
%8 2016-11
%G eng
%R 10.1002/cpe.4012
%0 Journal Article
%J International Journal of Parallel Programming
%D 2016
%T Porting the PLASMA Numerical Library to the OpenMP Standard
%A Asim YarKhan
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X PLASMA is a numerical library intended as a successor to LAPACK for solving problems in dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for efficient multithreading of algorithms expressed in a serial fashion. QUARK is a superscalar scheduler and implements automatic parallelization by tracking data dependencies and resolving data hazards at runtime. Recently, this type of scheduling has been incorporated in the OpenMP standard, which allows to transition PLASMA from the proprietary solution offered by QUARK to the standard solution offered by OpenMP. This article studies the feasibility of such transition.
%B International Journal of Parallel Programming
%8 2016-06
%G eng
%U http://link.springer.com/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6.pdfhttp://link.springer.com/article/10.1007/s10766-016-0441-6/fulltext.html
%! Int J Parallel Prog
%R 10.1007/s10766-016-0441-6
%0 Conference Proceedings
%B Tools for High Performance Computing 2015: Proceedings of the 9th International Workshop on Parallel Tools for High Performance Computing, September 2015, Dresden, Germany
%D 2016
%T Power Management and Event Verification in PAPI
%A Heike Jagode
%A Asim YarKhan
%A Anthony Danalis
%A Jack Dongarra
%X For more than a decade, the PAPI performance monitoring library has helped to implement the familiar maxim attributed to Lord Kelvin: “If you cannot measure it, you cannot improve it.” Widely deployed and widely used, PAPI provides a generic, portable interface for the hardware performance counters available on all modern CPUs and some other components of interest that are scattered across the chip and system. Recent and radical changes in processor and system design—systems that combine multicore CPUs and accelerators, shared and distributed memory, PCI- express and other interconnects—as well as the emergence of power efficiency as a primary design constraint, and reduced data movement as a primary programming goal, pose new challenges and bring new opportunities to PAPI. We discuss new developments of PAPI that allow for multiple sources of performance data to be measured simultaneously via a common software interface. Specifically, a new PAPI component that controls power is discussed. We explore the challenges of shared hardware counters that include system-wide measurements in existing multicore architectures. We conclude with an exploration of future directions for the PAPI interface.
%B Tools for High Performance Computing 2015: Proceedings of the 9th International Workshop on Parallel Tools for High Performance Computing, September 2015, Dresden, Germany
%I Springer International Publishing
%C Dresden, Germany
%P pp. 41-51
%@ 978-3-319-39589-0
%G eng
%R https://doi.org/10.1007/978-3-319-39589-0_4
%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2016
%T Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X To orthonormalize a set of dense vectors, Singular Value QR (SVQR) requires only one global reduction between the parallel processing units, and uses BLAS-3 kernels to perform most of its local computation. As a result, compared to other orthogonalization schemes, SVQR obtains superior performance on many of the current computers. In this paper, we study the stability and performance of various SVQR implementations on multicore CPUs with a GPU, focusing on the dense triangular solve, which performs half of the total floating-point operations in SVQR. As a part of this study, we examine its adaptive mixed-precision variant that decides if a lower-precision arithmetic can be used for the triangular solution at runtime without increasing the order of its orthogonality error. Since the backward error of this adaptive mixed-precision variant is significantly greater than that of the standard SVQR, we study its effects on the solution convergence of several subspace projection methods for solving a linear system of equations and for computing singular values or eigenvalues of a sparse matrix. Our experimental results indicate that in some cases, the convergence rate of the solver may not be affected by the larger backward errors, while reducing the time to solution.
%B ACM Transactions on Mathematical Software (TOMS)
%V 43
%8 2016-10
%G eng
%N 2
%0 Conference Paper
%B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)
%D 2015
%T Cholesky Across Accelerators
%A Asim YarKhan
%A Azzam Haidar
%A Chongxiao Cao
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)
%I IEEE
%C Elizabeth, NJ
%8 2015-08
%G eng
%0 Journal Article
%J Scientific Programming
%D 2015
%T Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X Low-rank matrices arise in many scientific and engineering computation. Both computational and storage costs of manipulating such matrices may be reduced by taking advantages of their low-rank properties. To compute a low-rank approximation of a dense matrix, in this paper, we study the performance of QR factorization with column pivoting or with restricted pivoting on multicore CPUs with a GPU. We first propose several techniques to reduce the postprocessing time, which is required for restricted pivoting, on a modern CPU. We then examine the potential of using a GPU to accelerate the factorization process with both column and restricted pivoting. Our performance results on two eight-core Intel Sandy Bridge CPUs with one NVIDIA Kepler GPU demonstrate that using the GPU, the factorization time can be reduced by a factor of more than two. In addition, to study the performance of our implementations in practice, we integrate them into a recently-developed software StruMF which algebraically exploits such low-rank structures for solving a general sparse linear system of equations. Our performance results for solving Poisson's equations demonstrate that the proposed techniques can significantly reduce the preconditioner construction time of StruMF on the CPUs, and the construction time can be further reduced by 10%-50% using the GPU.
%B Scientific Programming
%G eng
%0 Conference Paper
%B 17th IEEE International Conference on High Performance Computing and Communications
%D 2015
%T Flexible Linear Algebra Development and Scheduling with Cholesky Factorization
%A Azzam Haidar
%A Asim YarKhan
%A Chongxiao Cao
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore-CPUs, GPUs, and coprocessors. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. On the other hand, a Xeon Phi coprocessor will work most effectively with degree of parallelism between GPUs and multicore-CPUs. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization.
%B 17th IEEE International Conference on High Performance Computing and Communications
%C Newark, NJ
%8 2015-08
%G eng
%0 Generic
%D 2015
%T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I ISC High Performance (ISC15), Intel Booth Presentation
%C Frankfurt, Germany
%8 2015-06
%G eng
%0 Conference Paper
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2015
%T Mixed-precision Block Gram Schmidt Orthogonalization
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jakub Kurzak
%A Jack Dongarra
%A Jesse Barlow
%X The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a significant impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7:1 while maintaining about the same order of the numerical errors.
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2015
%T Mixed-Precision Cholesky QR Factorization and its Case Studies on Multicore CPU with Multiple GPUs
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X To orthonormalize the columns of a dense matrix, the Cholesky QR (CholQR) requires only one global reduction between the parallel processing units and performs most of its computation using BLAS-3 kernels. As a result, compared to other orthogonalization algorithms, CholQR obtains superior performance on many of the current computer architectures, where the communication is becoming increasingly expensive compared to the arithmetic operations. This is especially true when the input matrix is tall-skinny. Unfortunately, the orthogonality error of CholQR depends quadratically on the condition number of the input matrix, and it is numerically unstable when the matrix is ill-conditioned. To enhance the stability of CholQR, we recently used mixed-precision arithmetic; the input and output matrices are in the working precision, but some of its intermediate results are accumulated in the doubled precision. In this paper, we analyze the numerical properties of this mixed-precision CholQR. Our analysis shows that by selectively using the doubled precision, the orthogonality error of the mixed-precision CholQR only depends linearly on the condition number of the input matrix. We provide numerical results to demonstrate the improved numerical stability of the mixed-precision CholQR in practice. We then study its performance. When the target hardware does not support the desired higher precision, software emulation is needed. For example, using software-emulated double-double precision for the working 64-bit double precision, the mixed-precision CholQR requires about 8.5x more floating-point instructions than that required by the standard CholQR. On the other hand, the increase in the communication cost using the double-double precision is less significant, and our performance results on multicore CPU with a different graphics processing unit (GPU) demonstrate that the overhead of using the double-double arithmetic is decreasing on a newer architecture, where the computation is becoming less expensive compared to the communication. As a result, with a latest NVIDIA GPU, the mixed-precision CholQR was only 1.4x slower than the standard CholQR. Finally, we present case studies of using the mixed-precision CholQR within communication-avoiding variants of Krylov subspace projection methods for solving a nonsymmetric linear system of equations and for solving a symmetric eigenvalue problem, on a multicore CPU with multiple GPUs. These case studies demonstrate that by using the higher precision for this small but critical segment of the Krylov methods, we can improve not only the overall numerical stability of the solvers but also, in some cases, their performance.
%B SIAM Journal on Scientific Computing
%V 37
%P C203-C330
%8 2015-05
%G eng
%R DOI:10.1137/14M0973773
%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra
%D 2015
%T Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs
%A Ichitaro Yamazaki
%A Jesse Barlow
%A Stanimire Tomov
%A Jakub Kurzak
%A Jack Dongarra
%X Orthogonalizing a set of dense vectors is an important computational kernel in subspace projection methods for solving large-scale problems. In this talk, we discuss our efforts to improve the performance of the kernel, while maintaining its numerical accuracy. Our experimental results demonstrate the effectiveness of our approaches.
%B 2015 SIAM Conference on Applied Linear Algebra
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2015
%T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems
%A Maksims Abalenkovs
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%A Asim YarKhan
%K dense linear algebra
%K gpu
%K HPC
%K Multicore
%K Programming models
%K runtime
%X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.
%B Supercomputing Frontiers and Innovations
%V 2
%8 2015-10
%G eng
%R 10.14529/jsfi1504
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs
%A Theo Mary
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Randomized Algorithms to Update Partial Singular Value Decomposition on a Hybrid CPU/GPU Cluster
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%K Gaussian elimination
%K lu factorization
%K Multicore
%K parallel
%K shared memory
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 1292-1309
%8 2015-04
%G eng
%N 5
%R 10.1002/cpe.3306
%0 Conference Proceedings
%B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15)
%D 2015
%T Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators
%A Azzam Haidar
%A Yulu Jia
%A Piotr Luszczek
%A Stanimire Tomov
%A Asim YarKhan
%A Jack Dongarra
%K dataflow scheduling
%K hardware accelerators
%K multi-grain parallelism
%X A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications.
%B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15)
%I ACM
%C Austin, TX
%V No. 5
%8 2015-11
%G eng
%0 Book Section
%B Numerical Computations with GPUs
%D 2014
%T Accelerating Numerical Dense Linear Algebra Calculations with GPUs
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Numerical Computations with GPUs
%I Springer International Publishing
%P 3-28
%@ 978-3-319-06547-2
%G eng
%& 1
%R 10.1007/978-3-319-06548-9_1
%0 Conference Paper
%B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining
%D 2014
%T Access-averse Framework for Computing Low-rank Matrix Approximations
%A Ichitaro Yamazaki
%A Theo Mary
%A Jakub Kurzak
%A Stanimire Tomov
%A Jack Dongarra
%B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining
%C Washington, DC
%8 2014-10
%G eng
%0 Journal Article
%J SIAM Journal on Matrix Analysis and Application
%D 2014
%T Communication-Avoiding Symmetric-Indefinite Factorization
%A Grey Ballard
%A Dulceneia Becker
%A James Demmel
%A Jack Dongarra
%A Alex Druinsky
%A I Peled
%A Oded Schwartz
%A Sivan Toledo
%A Ichitaro Yamazaki
%X We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen’s triangular tridiagonalization. It factors a dense symmetric matrix A as the product A = P LT L T P T where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. The current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal.
%B SIAM Journal on Matrix Analysis and Application
%V 35
%P 1364-1406
%8 2014-07
%G eng
%N 4
%0 Conference Paper
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2014
%T Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C New Orleans, LA
%8 2014-11
%G eng
%0 Conference Paper
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%D 2014
%T Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K dataflow
%K message-passing
%K multithreading
%K QR decomposition
%K runtime
%K systolic array
%X A systolic array provides an alternative computing paradigm to the von Neuman architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems.
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14)
%D 2014
%T Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster
%A Ichitaro Yamazaki
%A Sivasankaran Rajamanickam
%A Eric G. Boman
%A Mark Hoemmen
%A Michael A. Heroux
%A Stanimire Tomov
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14)
%I IEEE
%C New Orleans, LA
%8 2014-11
%G eng
%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Improving the performance of CA-GMRES on multicores with multiple GPUs
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Hoemmen
%A Jack Dongarra
%X Abstract—The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Conference Paper
%B VECPAR 2014 (Best Paper)
%D 2014
%T Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Tingxing Dong
%A Jack Dongarra
%X We propose a mixed-precision orthogonalization scheme that takes the input matrix in a standard 32 or 64-bit floating-point precision, but uses higher-precision arithmetics to accumulate its intermediate results. For the 64-bit precision, our scheme uses software emulation for the higher-precision arithmetics, and requires about 20x more computation but about the same amount of communication as the standard orthogonalization scheme. Since the computation is becoming less expensive compared to the communication on new and emerging architectures, the relative cost of our mixed-precision scheme is decreasing. Our case studies with CA-GMRES on a GPU demonstrate that using mixed-precision for this small but critical segment of CA-GMRES can improve not only its overall numerical stability but also, in some cases, its performance.
%B VECPAR 2014 (Best Paper)
%C Eugene, OR
%8 2014-06
%G eng
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2014
%T Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems
%A Jack Dongarra
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Asim YarKhan
%K dense linear algebra
%K hardware accelerators
%K task superscalar scheduling
%X Hardware heterogeneity of the HPC platforms is no longer considered unusual but instead have become the most viable way forward towards Exascale. In fact, the multitude of the heterogeneous resources available to modern computers are designed for different workloads and their efficient use is closely aligned with the specialized role envisaged by their design. Commonly in order to efficiently use such GPU resources, the workload in question must have a much greater degree of parallelism than workloads often associated with multicore processors (CPUs). Available GPU variants differ in their internal architecture and, as a result, are capable of handling workloads of varying degrees of complexity and a range of computational patterns. This vast array of applicable workloads will likely lead to an ever accelerated mixing of multicore-CPUs and GPUs in multi-user environments with the ultimate goal of offering adequate computing facilities for a wide range of scientific and technical workloads. In the following paper, we present a research prototype that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems. Our lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. This concept is reminiscent of dataflow and systolic architectures in its conceptualization of a workload as a set of side-effect-free tasks that pass data items whenever the associated work assignment have been completed. Additionally, our task abstractions and their parametrization enable uniformity in the algorithmic development across all the heterogeneous resources without sacrificing precious compute cycles. We include performance results for dense linear algebra functions which demonstrate the practicality and effectiveness of our approach that is aptly capable of full utilization of a wide range of accelerator hardware.
%B Supercomputing Frontiers and Innovations
%V 1
%G eng
%N 1
%R http://dx.doi.org/10.14529/jsfi1401
%0 Conference Paper
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%D 2014
%T Optimizing Krylov Subspace Solvers on Graphics Processing Units
%A Stanimire Tomov
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Hartwig Anzt
%A William Sawyer
%X Krylov subspace solvers are often the method of choice when solving sparse linear systems iteratively. At the same time, hardware accelerators such as graphics processing units (GPUs) continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to leverage the full potential of the accelerator. In this paper we target the acceleration of the BiCGSTAB solver for GPUs, showing that significant improvement can be achieved by reformulating the method and developing application-specific kernels instead of using the generic CUBLAS library provided by NVIDIA. We propose an implementation that benefits from a significantly reduced number of kernel launches and GPUhost communication events, by means of increased data locality and a simultaneous reduction of multiple scalar products. Using experimental data, we show that, depending on the dominance of the untouched sparse matrix vector products, significant performance improvements can be achieved compared to a reference implementation based on the CUBLAS library. We feel that such optimizations are crucial for the subsequent development of highlevel sparse linear algebra libraries.
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Conference Paper
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%D 2014
%T Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors
%A Azzam Haidar
%A Chongxiao Cao
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly.
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%I IEEE
%C New Orleans, LA
%8 2014-11
%G eng
%R 10.1109/ScalA.2014.8
%0 Generic
%D 2014
%T PULSAR Users’ Guide, Parallel Ultra-Light Systolic Array Runtime
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X PULSAR version 2.0, released in November 2014, is a complete programming platform for large-scale distributed memory systems with multicore processors and hardware accelerators. PULSAR provides a simple abstraction layer over multithreading, message passing, and multi-GPU, multi-stream programming. PULSAR offers a general-purpose programming model, suitable for a wide range of scientific and engineering applications. PULSAR was inspired by systolic arrays, popularized by Hsiang-Tsung Kung and Charles E. Leiserson.
%B University of Tennessee EECS Technical Report
%I University of Tennessee
%8 2014-11
%G eng
%0 Journal Article
%J Journal of Supercomputing
%D 2013
%T Enabling Workflows in GridSolve: Request Sequencing and Service Trading
%A Yinan Li
%A Asim YarKhan
%A Jack Dongarra
%A Keith Seymour
%A Aurlie Hurault
%K grid computing
%K gridpac
%K netsolve
%K service trading
%K workflow applications
%X GridSolve employs a RPC-based client-agent-server model for solving computational problems. There are two deficiencies associated with GridSolve when a computational problem essentially forms a workflow consisting of a sequence of tasks with data dependencies between them. First, intermediate results are always passed through the client, resulting in unnecessary data transport. Second, since the execution of each individual task is a separate RPC session, it is difficult to enable any potential parallelism among tasks. This paper presents a request sequencing technique that addresses these deficiencies and enables workflow executions. Building on the request sequencing work, one way to generate workflows is by taking higher level service requests and decomposing them into a sequence of simpler service requests using a technique called service trading. A service trading component is added to GridSolve to take advantage of the new dynamic request sequencing. The features described here include automatic DAG construction and data dependency analysis, direct interserver data transfer, parallel task execution capabilities, and a service trading component.
%B Journal of Supercomputing
%V 64
%P 1133-1152
%8 2013-06
%G eng
%N 3
%& 1133
%R 10.1007/s11227-010-0549-1
%0 Journal Article
%J IPDPS 2013 (submitted)
%D 2013
%T Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures
%A Ichitaro Yamazaki
%A Dulceneia Becker
%A Jack Dongarra
%A Alex Druinsky
%A I. Peled
%A Sivan Toledo
%A Grey Ballard
%A James Demmel
%A Oded Schwartz
%X Factorization of a dense symmetric indeﬁnite matrix is a key computational kernel in many scientiﬁc and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symmetry and guarantees numerical stability through pivoting at the same time. This is because such an algorithm exhibits many of the fundamental challenges in parallel programming like irregular data accesses and irregular task dependencies. In this paper, we address these challenges in a tiled implementation of a blocked Aasen’s algorithm using a dynamic scheduler. To fully exploit the limited parallelism in this left-looking algorithm, we study several performance enhancing techniques; e.g., parallel reduction to update a panel, tall-skinny LU factorization algorithms to factorize the panel, and a parallel implementation of symmetric pivoting. Our performance results on up to 48 AMD Opteron processors demonstrate that our implementation obtains speedups of up to 2.8 over MKL, while losing only one or two digits in the computed residual norms.
%B IPDPS 2013 (submitted)
%C Boston, MA
%8 2013-00
%G eng
%0 Book Section
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%D 2013
%T Keeneland: Computational Science Using Heterogeneous GPU Computing
%A Jeffrey Vetter
%A Richard Glassbrook
%A Karsten Schwan
%A Sudha Yalamanchili
%A Mitch Horton
%A Ada Gavrilovska
%A Magda Slawinska
%A Jack Dongarra
%A Jeremy Meredith
%A Philip Roth
%A Kyle Spafford
%A Stanimire Tomov
%A John Wynkoop
%X The Keeneland Project is a five year Track 2D grant awarded by the National Science Foundation (NSF) under solicitation NSF 08-573 in August 2009 for the development and deployment of an innovative high performance computing system. The Keeneland project is led by the Georgia Institute of Technology (Georgia Tech) in collaboration with the University of Tennessee at Knoxville, National Institute of Computational Sciences, and Oak Ridge National Laboratory.
%B Contemporary High Performance Computing: From Petascale Toward Exascale
%S CRC Computational Science Series
%I Taylor and Francis
%C Boca Raton, FL
%G eng
%& 7
%0 Journal Article
%J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%D 2013
%T Multithreading in the PLASMA Library
%A Jakub Kurzak
%A Piotr Luszczek
%A Asim YarKhan
%A Mathieu Faverge
%A Julien Langou
%A Henricus Bouwmeester
%A Jack Dongarra
%E Mohamed Ahmed
%E Reda Ammar
%E Sanguthevar Rajasekaran
%B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%I Taylor & Francis
%8 2013-00
%G eng
%0 Conference Paper
%B 17th IEEE High Performance Extreme Computing Conference (HPEC '13)
%D 2013
%T Standards for Graph Algorithm Primitives
%A Tim Mattson
%A David Bader
%A Jon Berry
%A Aydin Buluc
%A Jack Dongarra
%A Christos Faloutsos
%A John Feo
%A John Gilbert
%A Joseph Gonzalez
%A Bruce Hendrickson
%A Jeremy Kepner
%A Charles Lieserson
%A Andrew Lumsdaine
%A David Padua
%A Steve W. Poole
%A Steve Reinhardt
%A Mike Stonebraker
%A Steve Wallach
%A Andrew Yoo
%K algorithms
%K graphs
%K linear algebra
%K software standards
%X It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.
%B 17th IEEE High Performance Extreme Computing Conference (HPEC '13)
%I IEEE
%C Waltham, MA
%8 2013-09
%G eng
%R 10.1109/HPEC.2013.6670338
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Raffaele Solcà
%A Stanimire Tomov
%A Jack Dongarra
%A Thomas C. Schulthess
%X For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)-2 symmetric matrix-vector multiplication, and the BLAS-3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi-GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU-GPU kernel into computational kernels at higher-levels of software stacks, that is, a shared-memory dense eigensolver and a distributed-memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher-level kernels, not only reducing the solution time but also enabling the solution of larger-scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques.
%B Concurrency and Computation: Practice and Experience
%8 2013-10
%G eng
%0 Conference Paper
%B The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES)
%D 2013
%T Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Stanimire Tomov
%A Jack Dongarra
%B The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES)
%8 2013-05
%G eng
%0 Conference Paper
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%D 2013
%T Virtual Systolic Array for QR Decomposition
%A Jakub Kurzak
%A Piotr Luszczek
%A Mark Gates
%A Ichitaro Yamazaki
%A Jack Dongarra
%K dataflow programming
%K message passing
%K multi-core
%K QR decomposition
%K roofline model
%K systolic array
%X Systolic arrays offer a very attractive, data-centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for the QR decomposition is developed and a virtualization layer is used for mapping of the algorithm to a large distributed memory system. Strong scaling properties are discovered, superior to existing solutions.
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%I IEEE
%C Boston, MA
%8 2013-05
%G eng
%R 10.1109/IPDPS.2013.119
%0 Generic
%D 2012
%T On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng
%0 Generic
%D 2012
%T Dynamic Task Execution on Shared and Distributed Memory Architectures
%A Asim YarKhan
%X Multicore architectures with high core counts have come to dominate the world of high performance computing, from shared memory machines to the largest distributed memory clusters. The multicore route to increased performance has a simpler design and better power efficiency than the traditional approach of increasing processor frequencies. But, standard programming techniques are not well adapted to this change in computer architecture design. In this work, we study the use of dynamic runtime environments executing data driven applications as a solution to programming multicore architectures. The goals of our runtime environments are productivity, scalability and performance. We demonstrate productivity by defining a simple programming interface to express code. Our runtime environments are experimentally shown to be scalable and give competitive performance on large multicore and distributed memory machines. This work is driven by linear algebra algorithms, where state-of-the-art libraries (e.g., LAPACK and ScaLAPACK) using a fork-join or block-synchronous execution style do not use the available resources in the most efficient manner. Research work in linear algebra has reformulated these algorithms as tasks acting on tiles of data, with data dependency relationships between the tasks. This results in a task-based DAG for the reformulated algorithms, which can be executed via asynchronous data-driven execution paths analogous to dataflow execution. We study an API and runtime environment for shared memory architectures that efficiently executes serially presented tile based algorithms. This runtime is used to enable linear algebra applications and is shown to deliver performance competitive with state-ofthe-art commercial and research libraries. We develop a runtime environment for distributed memory multicore architectures extended from our shared memory implementation. The runtime takes serially presented algorithms designed for the shared memory environment, and schedules and executes them on distributed memory architectures in a scalable and high performance manner. We design a distributed data coherency protocol and a distributed task scheduling mechanism which avoid global coordination. Experimental results with linear algebra applications show the scalability and performance of our runtime environment.
%9 Dissertation
%0 Generic
%D 2012
%T MAGMA: A Breakthrough in Solvers for Eigenvalue Problems
%A Stanimire Tomov
%A Jack Dongarra
%A Azzam Haidar
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Thomas Schulthess
%A Raffaele Solcà
%I GPU Technology Conference (GTC12), Presentation
%C San Jose, CA
%8 2012-05
%G eng
%0 Generic
%D 2012
%T MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures
%A Jack Dongarra
%A Tingxing Dong
%A Mark Gates
%A Azzam Haidar
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation
%C Salt Lake City, UT
%8 2012-11
%G eng
%0 Conference Proceedings
%B The International Conference on Computational Science (ICCS)
%D 2012
%T One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B The International Conference on Computational Science (ICCS)
%8 2012-06
%G eng
%0 Generic
%D 2011
%T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
%A Azzam Haidar
%A Hatem Ltaeif
%A Asim YarKhan
%A Jack Dongarra
%K plasma
%K quark
%B University of Tennessee Computer Science Technical Report, UT-CS-11-666, (also Lawn 243)
%8 2011-00
%G eng
%0 Journal Article
%J TeraGrid'11
%D 2011
%T Autotuned Parallel I/O for Highly Scalable Biosequence Analysis
%A Haihang You
%A Bhanu Rekapalli
%A Qing Liu
%A Shirley Moore
%B TeraGrid'11
%C Salt Lake City, Utah
%8 2011-07
%G eng
%0 Conference Proceedings
%B Cray Users Group Conference (CUG'11) (Best Paper Finalist)
%D 2011
%T The Design of an Auto-tuning I/O Framework on Cray XT5 System
%A Haihang You
%A Qing Liu
%A Zhiqiang Li
%A Shirley Moore
%K gco
%B Cray Users Group Conference (CUG'11) (Best Paper Finalist)
%C Fairbanks, Alaska
%8 2011-05
%G eng
%0 Conference Proceedings
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%D 2011
%T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%I IEEE
%C Anchorage, Alaska, USA
%P 1432-1441
%8 2011-05
%G eng
%0 Journal Article
%J International Journal of High Performance Computing
%D 2011
%T The International Exascale Software Project Roadmap
%A Jack Dongarra
%A Pete Beckman
%A Terry Moore
%A Patrick Aerts
%A Giovanni Aloisio
%A Jean-Claude Andre
%A David Barkai
%A Jean-Yves Berthou
%A Taisuke Boku
%A Bertrand Braunschweig
%A Franck Cappello
%A Barbara Chapman
%A Xuebin Chi
%A Alok Choudhary
%A Sudip Dosanjh
%A Thom Dunning
%A Sandro Fiore
%A Al Geist
%A Bill Gropp
%A Robert Harrison
%A Mark Hereld
%A Michael Heroux
%A Adolfy Hoisie
%A Koh Hotta
%A Zhong Jin
%A Yutaka Ishikawa
%A Fred Johnson
%A Sanjay Kale
%A Richard Kenway
%A David Keyes
%A Bill Kramer
%A Jesus Labarta
%A Alain Lichnewsky
%A Thomas Lippert
%A Bob Lucas
%A Barney MacCabe
%A Satoshi Matsuoka
%A Paul Messina
%A Peter Michielse
%A Bernd Mohr
%A Matthias S. Mueller
%A Wolfgang E. Nagel
%A Hiroshi Nakashima
%A Michael E. Papka
%A Dan Reed
%A Mitsuhisa Sato
%A Ed Seidel
%A John Shalf
%A David Skinner
%A Marc Snir
%A Thomas Sterling
%A Rick Stevens
%A Fred Streitz
%A Bob Sugar
%A Shinji Sumimoto
%A William Tang
%A John Taylor
%A Rajeev Thakur
%A Anne Trefethen
%A Mateo Valero
%A Aad van der Steen
%A Jeffrey Vetter
%A Peg Williams
%A Robert Wisniewski
%A Kathy Yelick
%X Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.
%B International Journal of High Performance Computing
%V 25
%P 3-60
%8 2011-01
%G eng
%R https://doi.org/10.1177/1094342010391989
%0 Generic
%D 2011
%T Power-aware Computing on GPGPUs
%A Kiran Kasichayanula
%A Haihang You
%A Shirley Moore
%A Stanimire Tomov
%A Heike Jagode
%A Matt Johnson
%I Fall Creek Falls Conference, Poster
%C Gatlinburg, TN
%8 2011-09
%G eng
%0 Generic
%D 2011
%T QUARK Users' Guide: QUeueing And Runtime for Kernels
%A Asim YarKhan
%A Jakub Kurzak
%A Jack Dongarra
%K magma
%K plasma
%K quark
%B University of Tennessee Innovative Computing Laboratory Technical Report
%8 2011-00
%G eng
%0 Journal Article
%J Submitted to Concurrency and Computations: Practice and Experience
%D 2010
%T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
%A Azzam Haidar
%A Hatem Ltaeif
%A Asim YarKhan
%A Jack Dongarra
%K plasma
%K quark
%B Submitted to Concurrency and Computations: Practice and Experience
%8 2010-11
%G eng
%0 Journal Article
%J Tools for High Performance Computing 2009
%D 2010
%T Collecting Performance Data with PAPI-C
%A Dan Terpstra
%A Heike Jagode
%A Haihang You
%A Jack Dongarra
%K mumi
%K papi
%X Modern high performance computer systems continue to increase in size and complexity. Tools to measure application performance in these increasingly complex environments must also increase the richness of their measurements to provide insights into the increasingly intricate ways in which software and hardware interact. PAPI (the Performance API) has provided consistent platform and operating system independent access to CPU hardware performance counters for nearly a decade. Recent trends toward massively parallel multi-core systems with often heterogeneous architectures present new challenges for the measurement of hardware performance information, which is now available not only on the CPU core itself, but scattered across the chip and system. We discuss the evolution of PAPI into Component PAPI, or PAPI-C, in which multiple sources of performance data can be measured simultaneously via a common software interface. Several examples of components and component data measurements are discussed. We explore the challenges to hardware performance measurement in existing multi-core architectures. We conclude with an exploration of future directions for the PAPI interface.
%B Tools for High Performance Computing 2009
%I Springer Berlin / Heidelberg
%C 3rd Parallel Tools Workshop, Dresden, Germany
%P 157-173
%8 2010-05
%G eng
%R https://doi.org/10.1007/978-3-642-11261-4_11
%0 Generic
%D 2010
%T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-10-660
%8 2010-09
%G eng
%0 Generic
%D 2010
%T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K plasma
%B Innovative Computing Laboratory Technical Report
%8 2010-00
%G eng
%0 Journal Article
%J VECPAR 2010, 9th International Meeting on High Performance Computing for Computational Science
%D 2010
%T Intelligent Service Trading and Brokering for Distributed Network Services in GridSolve
%A Aurlie Hurault
%A Asim YarKhan
%K gridpac
%K netsolve
%B VECPAR 2010, 9th International Meeting on High Performance Computing for Computational Science
%C Berkeley, CA
%8 2010-06
%G eng
%0 Journal Article
%J Parallel Computing
%D 2010
%T Using multiple levels of parallelism to enhance the performance of domain decomposition solvers
%A Luc Giraud
%A Azzam Haidar
%A Stephane Pralet
%E Costas Bekas
%E Pascua D’Ambra
%E Ananth Grama
%E Yousef Saad
%E Petko Yanev
%B Parallel Computing
%I Elsevier journals
%V 36
%P 285-296
%8 2010-00
%G eng
%0 Journal Article
%J SciDAC Review
%D 2009
%T Accelerating Time-To-Solution for Computational Science and Engineering
%A James Demmel
%A Jack Dongarra
%A Armando Fox
%A Sam Williams
%A Vasily Volkov
%A Katherine Yelick
%B SciDAC Review
%8 2009-00
%G eng
%0 Conference Proceedings
%B International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09)
%D 2009
%T Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems
%A Fengguang Song
%A Asim YarKhan
%A Jack Dongarra
%K mumi
%K plasma
%B International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09)
%C Portland, OR
%8 2009-11
%G eng
%0 Conference Proceedings
%B Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering
%D 2009
%T Grid Computing applied to the Boundary Element Method
%A Manoel Cunha
%A Jose Telles
%A Asim YarKhan
%A Jack Dongarra
%E B. H. V. Topping
%E Peter Iványi
%K netsolve
%B Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering
%I Civil-Comp Press
%C Stirlingshire, UK
%V 27
%8 2009-00
%G eng
%0 Generic
%D 2009
%T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
%A Emmanuel Agullo
%A James Demmel
%A Jack Dongarra
%A Bilel Hadri
%A Jakub Kurzak
%A Julien Langou
%A Hatem Ltaeif
%A Piotr Luszczek
%A Rajib Nath
%A Stanimire Tomov
%A Asim YarKhan
%A Vasily Volkov
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09)
%C Portland, OR
%8 2009-11
%G eng
%0 Journal Article
%J Cluster Computing Journal: Special Issue on High Performance Distributed Computing
%D 2009
%T Paravirtualization Effect on Single- and Multi-threaded Memory-Intensive Linear Algebra Software
%A Lamia Youseff
%A Keith Seymour
%A Haihang You
%A Dmitrii Zagorodnov
%A Jack Dongarra
%A Rich Wolski
%B Cluster Computing Journal: Special Issue on High Performance Distributed Computing
%I Springer Netherlands
%V 12
%P 101-122
%8 2009-00
%G eng
%0 Journal Article
%J in Handbook of Research on Scalable Computing Technologies (to appear)
%D 2009
%T Reliability and Performance Modeling and Analysis for Grid Computing
%A Yuan-Shun Dai
%A Jack Dongarra
%E Kuan-Ching Li
%E Ching-Hsien Hsu
%E Laurence Yang
%E Jack Dongarra
%E Hans Zima
%B in Handbook of Research on Scalable Computing Technologies (to appear)
%I IGI Global
%P 219-245
%8 2009-00
%G eng
%0 Journal Article
%J in Cloud Computing and Software Services: Theory and Techniques (to appear)
%D 2009
%T Transparent Cross-Platform Access to Software Services using GridSolve and GridRPC
%A Keith Seymour
%A Asim YarKhan
%A Jack Dongarra
%E Syed Ahson
%E Mohammad Ilyas
%K netsolve
%B in Cloud Computing and Software Services: Theory and Techniques (to appear)
%I CRC Press
%8 2009-00
%G eng
%0 Conference Proceedings
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%D 2009
%T VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance
%A Lavanya Ramakrishan
%A Daniel Nurmi
%A Anirban Mandal
%A Charles Koelbel
%A Dennis Gannon
%A Mark Huang
%A Yang-Suk Kee
%A Graziano Obertelli
%A Kiran Thyagaraja
%A Rich Wolski
%A Asim YarKhan
%A Dmitrii Zagorodnov
%K grads
%B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear)
%C Portland, OR
%8 2009-00
%G eng
%0 Conference Proceedings
%B The 3rd international Workshop on Automatic Performance Tuning
%D 2008
%T A Comparison of Search Heuristics for Empirical Code Optimization
%A Keith Seymour
%A Haihang You
%A Jack Dongarra
%K gco
%B The 3rd international Workshop on Automatic Performance Tuning
%C Tsukuba, Japan
%8 2008-10
%G eng
%0 Journal Article
%J in Advances in Computers
%D 2008
%T DARPA's HPCS Program: History, Models, Tools, Languages
%A Jack Dongarra
%A Robert Graybill
%A William Harrod
%A Robert Lucas
%A Ewing Lusk
%A Piotr Luszczek
%A Janice McMahon
%A Allan Snavely
%A Jeffrey Vetter
%A Katherine Yelick
%A Sadaf Alam
%A Roy Campbell
%A Laura Carrington
%A Tzu-Yi Chen
%A Omid Khalili
%A Jeremy Meredith
%A Mustafa Tikir
%E M. Zelkowitz
%B in Advances in Computers
%I Elsevier
%V 72
%8 2008-01
%G eng
%0 Conference Proceedings
%B ACM/IEEE International Symposium on High Performance Distributed Computing
%D 2008
%T The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software
%A Lamia Youseff
%A Keith Seymour
%A Haihang You
%A Jack Dongarra
%A Rich Wolski
%K gco
%K netsolve
%B ACM/IEEE International Symposium on High Performance Distributed Computing
%C Boston, MA.
%8 2008-06
%G eng
%0 Journal Article
%J Proc. SciDAC 2008
%D 2008
%T PERI Auto-tuning
%A David Bailey
%A Jacqueline Chame
%A Chun Chen
%A Jack Dongarra
%A Mary Hall
%A Jeffrey K. Hollingsworth
%A Paul D. Hovland
%A Shirley Moore
%A Keith Seymour
%A Jaewook Shin
%A Ananta Tiwari
%A Sam Williams
%A Haihang You
%K gco
%B Proc. SciDAC 2008
%I Journal of Physics
%C Seatlle, Washington
%V 125
%8 2008-01
%G eng
%0 Conference Proceedings
%B International Conference on Grid and Cooperative Computing (GCC 2008) (submitted)
%D 2008
%T Request Sequencing: Enabling Workflow for Efficient Problem Solving in GridSolve
%A Yinan Li
%A Jack Dongarra
%A Keith Seymour
%A Asim YarKhan
%B International Conference on Grid and Cooperative Computing (GCC 2008) (submitted)
%C Shenzhen, China
%8 2008-10
%G eng
%0 Generic
%D 2007
%T Automated Empirical Tuning of a Multiresolution Analysis Kernel
%A Haihang You
%A Keith Seymour
%A Jack Dongarra
%A Shirley Moore
%K gco
%B ICL Technical Report
%P 10
%8 2007-01
%G eng
%0 Journal Article
%J DOE SciDAC Review (to appear)
%D 2007
%T Creating Software Technology to Harness the Power of Leadership-class Computing Systems
%A John Mellor-Crummey
%A Pete Beckman
%A Jack Dongarra
%A Barton Miller
%A Katherine Yelick
%B DOE SciDAC Review (to appear)
%8 2007-06
%G eng
%0 Generic
%D 2007
%T Empirical Tuning of a Multiresolution Analysis Kernel using a Specialized Code Generator
%A Haihang You
%A Keith Seymour
%A Jack Dongarra
%A Shirley Moore
%K gco
%B ICL Technical Report
%8 2007-01
%G eng
%0 Conference Proceedings
%B Grid-Based Problem Solving Environments: IFIP TC2/WG 2.5 Working Conference on Grid-Based Problem Solving Environments (Prescott, AZ, July 2006)
%D 2007
%T GridSolve: The Evolution of Network Enabled Solver
%A Asim YarKhan
%A Jack Dongarra
%A Keith Seymour
%E Patrick Gaffney
%K netsolve
%B Grid-Based Problem Solving Environments: IFIP TC2/WG 2.5 Working Conference on Grid-Based Problem Solving Environments (Prescott, AZ, July 2006)
%I Springer
%P 215-226
%8 2007-00
%G eng
%0 Journal Article
%J Parallel Processing Letters
%D 2007
%T Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Servers Middleware
%A Emmanuel Jeannot
%A Keith Seymour
%A Asim YarKhan
%A Jack Dongarra
%B Parallel Processing Letters
%V 17
%P 47-59
%8 2007-03
%G eng
%0 Conference Proceedings
%B Journal of Physics: Conference Series, SciDAC 2007
%D 2007
%T Multithreading for synchronization tolerance in matrix factorization
%A Alfredo Buttari
%A Jack Dongarra
%A Parry Husbands
%A Jakub Kurzak
%A Katherine Yelick
%B Journal of Physics: Conference Series, SciDAC 2007
%V 78
%8 2007-01
%G eng
%0 Conference Proceedings
%B Proceedings of Workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing at IPDPS
%D 2007
%T Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing
%A Zizhong Chen
%A Ming Yang
%A Guillermo Francia III
%A Jack Dongarra
%B Proceedings of Workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing at IPDPS
%P 1-8
%8 2007-03
%G eng
%0 Generic
%D 2006
%T ATLAS on the BlueGene/L – Preliminary Results
%A Keith Seymour
%A Haihang You
%A Jack Dongarra
%K gco
%B ICL Technical Report
%8 2006-01
%G eng
%0 Journal Article
%J Parallel Processing Letters
%D 2006
%T Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Server
%A Emmanuel Jeannot
%A Keith Seymour
%A Asim YarKhan
%A Jack Dongarra
%K netsolve
%B Parallel Processing Letters
%V 17
%P 47-59
%8 2006-03
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications (Special Issue: Scheduling for Large-Scale Heterogeneous Platforms)
%D 2006
%T Recent Developments in GridSolve
%A Asim YarKhan
%A Keith Seymour
%A Kiran Sagi
%A Zhiao Shi
%A Jack Dongarra
%E Yves Robert
%K netsolve
%B International Journal of High Performance Computing Applications (Special Issue: Scheduling for Large-Scale Heterogeneous Platforms)
%I Sage Science Press
%V 20
%8 2006-00
%G eng
%0 Journal Article
%J IBM Journal of Research and Development
%D 2006
%T Self Adapting Numerical Software SANS Effort
%A George Bosilca
%A Zizhong Chen
%A Jack Dongarra
%A Victor Eijkhout
%A Graham Fagg
%A Erika Fuentes
%A Julien Langou
%A Piotr Luszczek
%A Jelena Pjesivac–Grbovic
%A Keith Seymour
%A Haihang You
%A Sathish Vadhiyar
%K gco
%B IBM Journal of Research and Development
%V 50
%P 223-238
%8 2006-01
%G eng
%0 Journal Article
%J Future Generation Computing Systems
%D 2005
%T Biological Sequence Alignment on the Computational Grid Using the GrADS Framework
%A Asim YarKhan
%A Jack Dongarra
%K grads
%B Future Generation Computing Systems
%I Elsevier
%V 21
%P 980-986
%8 2005-06
%G eng
%0 Generic
%D 2005
%T An Effective Empirical Search Method for Automatic Software Tuning
%A Haihang You
%A Keith Seymour
%A Jack Dongarra
%K gco
%B ICL Technical Report
%8 2005-01
%G eng
%0 Journal Article
%J Grid Computing and New Frontiers of High Performance Processing
%D 2005
%T NetSolve: Grid Enabling Scientific Computing Environments
%A Keith Seymour
%A Asim YarKhan
%A Sudesh Agrawal
%A Jack Dongarra
%E Lucio Grandinetti
%K netsolve
%B Grid Computing and New Frontiers of High Performance Processing
%I Elsevier
%8 2005-00
%G eng
%0 Journal Article
%J International Journal of Parallel Programming
%D 2005
%T New Grid Scheduling and Rescheduling Methods in the GrADS Project
%A Francine Berman
%A Henri Casanova
%A Andrew Chien
%A Keith Cooper
%A Holly Dail
%A Anshuman Dasgupta
%A Wei Deng
%A Jack Dongarra
%A Lennart Johnsson
%A Ken Kennedy
%A Charles Koelbel
%A Bo Liu
%A Xu Liu
%A Anirban Mandal
%A Gabriel Marin
%A Mark Mazina
%A John Mellor-Crummey
%A Celso Mendes
%A A. Olugbile
%A Jignesh M. Patel
%A Dan Reed
%A Zhiao Shi
%A Otto Sievert
%A H. Xia
%A Asim YarKhan
%K grads
%B International Journal of Parallel Programming
%I Springer
%V 33
%P 209-229
%8 2005-06
%G eng
%0 Conference Paper
%B International Conference on Computational Science (ICCS 2004)
%D 2004
%T Accurate Cache and TLB Characterization Using Hardware Counters
%A Jack Dongarra
%A Shirley Moore
%A Phil Mucci
%A Keith Seymour
%A Haihang You
%K gco
%K lacsi
%K papi
%X We have developed a set of microbenchmarks for accurately determining the structural characteristics of data cache memories and TLBs. These characteristics include cache size, cache line size, cache associativity, memory page size, number of data TLB entries, and data TLB associativity. Unlike previous microbenchmarks that used time-based measurements, our microbenchmarks use hardware event counts to more accurately and quickly determine these characteristics while requiring fewer limiting assumptions.
%B International Conference on Computational Science (ICCS 2004)
%I Springer
%C Krakow, Poland
%8 2004-06
%G eng
%R https://doi.org/10.1007/978-3-540-24688-6_57
%0 Conference Paper
%B 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004)
%D 2004
%T Automatic Blocking of QR and LU Factorizations for Locality
%A Qing Yi
%A Ken Kennedy
%A Haihang You
%A Keith Seymour
%A Jack Dongarra
%K gco
%K papi
%K sans
%X QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures.
%B 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004)
%I ACM
%C Washington, DC
%8 2004-06
%G eng
%R 10.1145/1065895.1065898
%0 Journal Article
%J Engineering the Grid (to appear)
%D 2004
%T An Overview of Heterogeneous High Performance and Grid Computing
%A Jack Dongarra
%A Alexey Lastovetsky
%E Beniamino Di Martino
%E Jack Dongarra
%E Adolfy Hoisie
%E Laurence Yang
%E Hans Zima
%B Engineering the Grid (to appear)
%I Nova Science Publishers, Inc.
%8 2004-00
%G eng
%0 Conference Proceedings
%B IEEE Proceedings (to appear)
%D 2004
%T Self Adapting Linear Algebra Algorithms and Software
%A James Demmel
%A Jack Dongarra
%A Victor Eijkhout
%A Erika Fuentes
%A Antoine Petitet
%A Rich Vuduc
%A Clint Whaley
%A Katherine Yelick
%K salsa
%K sans
%B IEEE Proceedings (to appear)
%8 2004-00
%G eng
%0 Journal Article
%J Special Issue on Biological Applications of Genetic and Evolutionary Computation (submitted)
%D 2003
%T Energy Minimization of Protein Tertiary Structure by Parallel Simulated Annealing using Genetic Crossover
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Shinya Ogura
%A Keiko Aoi
%A Takeshi Yoshida
%A Yuko Okamoto
%A Jack Dongarra
%B Special Issue on Biological Applications of Genetic and Evolutionary Computation (submitted)
%8 2003-03
%G eng
%0 Conference Paper
%B PADTAD Workshop, IPDPS 2003
%D 2003
%T Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters
%A Jack Dongarra
%A Kevin London
%A Shirley Moore
%A Phil Mucci
%A Dan Terpstra
%A Haihang You
%A Min Zhou
%K lacsi
%K papi
%X The PAPI project has defined and implemented a cross-platform interface to the hardware counters available on most modern microprocessors. The interface has gained widespread use and acceptance from hardware vendors, users, and tool developers. This paper reports on experiences with the community-based open-source effort to define the PAPI specification and implement it on a variety of platforms. Collaborations with tool developers who have incorporated support for PAPI are described. Issues related to interpretation and accuracy of hardware counter data and to the overheads of collecting this data are discussed. The paper concludes with implications for the design of the next version of PAPI.
%B PADTAD Workshop, IPDPS 2003
%I IEEE
%C Nice, France
%8 2003-04
%@ 0-7695-1926-1
%G eng
%0 Conference Proceedings
%B Lecture Notes in Computer Science, Proceedings of the 9th International Euro-Par Conference
%D 2003
%T GrADSolve - RPC for High Performance Computing on the Grid
%A Sathish Vadhiyar
%A Jack Dongarra
%A Asim YarKhan
%E Harald Kosch
%E Laszlo Boszormenyi
%E Hermann Hellwagner
%K netsolve
%B Lecture Notes in Computer Science, Proceedings of the 9th International Euro-Par Conference
%I Springer-Verlag, Berlin
%C Klagenfurt, Austria
%V 2790
%P 394-403
%8 2003-01
%G eng
%R 10.1007/978-3-540-45209-6_58
%0 Journal Article
%J Resource Management in the Grid
%D 2003
%T Scheduling in the Grid Application Development Software Project
%A Holly Dail
%A Otto Sievert
%A Francine Berman
%A Henri Casanova
%A Asim YarKhan
%A Sathish Vadhiyar
%A Jack Dongarra
%A Chuang Liu
%A Lingyun Yang
%A Dave Angulo
%A Ian Foster
%K grads
%B Resource Management in the Grid
%I Kluwer Publishers
%8 2003-03
%G eng
%0 Conference Proceedings
%B Grid Computing - GRID 2002, Third International Workshop
%D 2002
%T Experiments with Scheduling Using Simulated Annealing in a Grid Environment
%A Asim YarKhan
%A Jack Dongarra
%E Manish Parashar
%K grads
%B Grid Computing - GRID 2002, Third International Workshop
%I Springer
%C Baltimore, MD
%V 2536
%P 232-242
%8 2002-11
%G eng
%0 Journal Article
%J Meeting of the Japan Society of Mechanical Engineers
%D 2002
%T Truss Structural Optimization Using NetSolve System
%A Tomoyuki Hiroyasu
%A Mitsunori Miki
%A Hisashi Shimosaka
%A Masaki Sano
%A Yusuke Tanimura
%A Yasunari Mimura
%A Shinobu Yoshimura
%A Jack Dongarra
%K netsolve
%B Meeting of the Japan Society of Mechanical Engineers
%C Kyoto University, Kyoto, Japan
%8 2002-10
%G eng