%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2020
%T Reducing the Amount of out-of-core Data Access for GPU-Accelerated Randomized SVD
%A Yuechao Lu
%A Ichitaro Yamazaki
%A Fumihiko Ino
%A Yasuyuki Matsushita
%A Stanimire Tomov
%A Jack Dongarra
%K Divide and conquer
%K gpu
%K out-of-core computation
%K Singular value decomposition
%B Concurrency and Computation: Practice and Experience
%8 2020-04
%G eng
%R https://doi.org/10.1002/cpe.5754
%0 Generic
%D 2020
%T SLATE Tutorial
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%A Ali Charara
%A Jamie Finney
%A Dalal Sukkari
%A Mohammed Al Farhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Jack Dongarra
%I 2020 ECP Annual Meeting
%C Houston, TX
%8 2020-02
%G eng
%0 Generic
%D 2020
%T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Erik Boman
%A Erin Carson
%A Terry Cojean
%A Jack Dongarra
%A Mark Gates
%A Thomas Gruetzmacher
%A Nicholas J. Higham
%A Sherry Li
%A Neil Lindquist
%A Yang Liu
%A Jennifer Loe
%A Piotr Luszczek
%A Pratik Nayak
%A Sri Pranesh
%A Siva Rajamanickam
%A Tobias Ribizel
%A Barry Smith
%A Kasia Swirydowicz
%A Stephen Thomas
%A Stanimire Tomov
%A Yaohung Tsai
%A Ichitaro Yamazaki
%A Urike Meier Yang
%B SLATE Working Notes
%I University of Tennessee
%8 2020-07
%G eng
%9 SLATE Working Notes
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T Distributed-Memory Lattice H-Matrix Factorization
%A Ichitaro Yamazaki
%A Akihiro Ida
%A Rio Yokota
%A Jack Dongarra
%X We parallelize the LU factorization of a hierarchical low-rank matrix (ℋ-matrix) on a distributed-memory computer. This is much more difficult than the ℋ-matrix-vector multiplication due to the dataflow of the factorization, and it is much harder than the parallelization of a dense matrix factorization due to the irregular hierarchical block structure of the matrix. Block low-rank (BLR) format gets rid of the hierarchy and simplifies the parallelization, often increasing concurrency. However, this comes at a price of losing the near-linear complexity of the ℋ-matrix factorization. In this work, we propose to factorize the matrix using a “lattice ℋ-matrix” format that generalizes the BLR format by storing each of the blocks (both diagonals and off-diagonals) in the ℋ-matrix format. These blocks stored in the ℋ-matrix format are referred to as lattices. Thus, this lattice format aims to combine the parallel scalability of BLR factorization with the near-linear complexity of ℋ-matrix factorization. We first compare factorization performances using the ℋ-matrix, BLR, and lattice ℋ-matrix formats under various conditions on a shared-memory computer. Our performance results show that the lattice format has storage and computational complexities similar to those of the ℋ-matrix format, and hence a much lower cost of factorization than BLR. We then compare the BLR and lattice ℋ-matrix factorization on distributed-memory computers. Our performance results demonstrate that compared with BLR, the lattice format with the lower cost of factorization may lead to faster factorization on the distributed-memory computer.
%B The International Journal of High Performance Computing Applications
%V 33
%P 1046–1063
%8 2019-08
%G eng
%N 5
%R https://doi.org/10.1177/1094342019861139
%0 Conference Paper
%B PAW-ATM Workshop at SC19
%D 2019
%T Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization
%A Yu Pei
%A George Bosilca
%A Ichitaro Yamazaki
%A Akihiro Ida
%A Jack Dongarra
%B PAW-ATM Workshop at SC19
%I ACM
%C Denver, CO
%8 2019-11
%G eng
%0 Conference Paper
%B IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist
%D 2019
%T Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Jack Dongarra
%X The emergence of deep learning as a leading computational workload for machine learning tasks on large-scale cloud infrastructure installations has led to plethora of accelerator hardware releases. However, the reduced precision and range of the floating-point numbers on these new platforms makes it a non-trivial task to leverage these unprecedented advances in computational power for numerical linear algebra operations that come with a guarantee of robust error bounds. In order to address these concerns, we present a number of strategies that can be used to increase the accuracy of limited-precision iterative refinement. By limited precision, we mean 16-bit floating-point formats implemented in modern hardware accelerators and are not necessarily compliant with the IEEE half-precision specification. We include the explanation of a broader context and connections to established IEEE floating-point standards and existing high-performance computing (HPC) benchmarks. We also present a new formulation of LU factorization that we call signed square root LU which produces more numerically balanced L and U factors which directly address the problems of limited range of the low-precision storage formats. The experimental results indicate that it is possible to recover substantial amounts of the accuracy in the system solution that would otherwise be lost. Previously, this could only be achieved by using iterative refinement based on single-precision floating-point arithmetic. The discussion will also explore the numerical stability issues that are important for robust linear solvers on these new hardware platforms.
%B IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist
%I IEEE
%C Waltham, MA
%8 2019-09
%G eng
%1 Best Paper Finalist
%0 Conference Paper
%B International Parallel and Distributed Processing Symposium (IPDPS)
%D 2019
%T Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation
%A Zhaojun Bai
%A Jack Dongarra
%A Ding Lu
%A Ichitaro Yamazaki
%X Some scientific and engineering applications need to compute a large number of eigenpairs of a large Hermitian matrix. Though the Lanczos method is effective for computing a few eigenvalues, it can be expensive for computing a large number of eigenpairs (e.g., in terms of computation and communication). To improve the performance of the method, in this paper, we study an s-step variant of thick-restart Lanczos (TRLan) combined with an explicit external deflation (EED). The s-step method generates a set of s basis vectors at a time and reduces the communication costs of generating the basis vectors. We then design a specialized matrix powers kernel (MPK) that reduces both the communication and computational costs by taking advantage of the special properties of the deflation matrix. We conducted numerical experiments of the new TRLan eigensolver using synthetic matrices and matrices from electronic structure calculations. The performance results on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) demonstrate the potential of the specialized MPK to significantly reduce the execution time of the TRLan eigensolver. The speedups of up to 3.1× and 5.3× were obtained in our sequential and parallel runs, respectively.
%B International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%0 Journal Article
%J Parallel Computing
%D 2019
%T Performance of Asynchronous Optimized Schwarz with One-sided Communication
%A Ichitaro Yamazaki
%A Edmond Chow
%A Aurelien Bouteiller
%A Jack Dongarra
%X In asynchronous iterative methods on distributed-memory computers, processes update their local solutions using data from other processes without an implicit or explicit global synchronization that corresponds to advancing the global iteration counter. In this work, we test the asynchronous optimized Schwarz domain-decomposition iterative method using various one-sided (remote direct memory access) communication schemes with passive target completion. The results show that when one-sided communication is well-supported, the asynchronous version of optimized Schwarz can outperform the synchronous version even for perfectly balanced partitionings of the problem on a supercomputer with uniform nodes.
%B Parallel Computing
%V 86
%P 66-81
%8 2019-08
%G eng
%U http://www.sciencedirect.com/science/article/pii/S0167819118301261
%R https://doi.org/10.1016/j.parco.2019.05.004
%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2019
%T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Sven Hammarling
%A Jakub Sistek
%B ACM Transactions on Mathematical Software
%V 45
%8 2019-06
%G eng
%N 2
%R https://doi.org/10.1145/3264491
%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2018
%T Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters
%A Ichitaro Yamazaki
%A Ahmad Abdelfattah
%A Akihiro Ida
%A Satoshi Ohshima
%A Stanimire Tomov
%A Rio Yokota
%A Jack Dongarra
%X ppohBEM is an open-source software package im- plementing the boundary element method. One of its main software tasks is the solution of the dense linear system of equations, for which, ppohBEM relies on another software package called HACApK. To reduce the cost of solving the linear system, HACApK hierarchically compresses the coefficient matrix using adaptive cross approximation. This hierarchical compression greatly reduces the storage and time complexities of the solver and enables the solution of large-scale boundary value problems. To extend the capability of ppohBEM, in this paper, we carefully port the HACApK’s linear solver onto GPU clusters. Though the potential of the GPUs has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUs for a solver, like HACApK’s, that requires fine-grained computation and global communication. First, to utilize the GPUs, we integrate the batched GPU kernel that was recently released in the MAGMA software package. We discuss several techniques to improve the performance of the batched kernel. We then study various techniques to address the inter-GPU communication and study their effects on state-of- the-art GPU clusters. We believe that the techniques studied in this paper are of interest to a wide range of software packages running on GPUs, especially with the increasingly complex node architectures and the growing costs of the communication. We also hope that our efforts to integrate the GPU kernel or to setup the inter-GPU communication will influence the design of the future-generation batched kernels or the communication layer within a software stack.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Vancouver, BC, Canada
%8 2018-05
%G eng
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2018
%T Autotuning Techniques for Performance-Portable Point Set Registration in 3D
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A David Keffer
%A Vasileios Maroulas
%A Jack Dongarra
%X We present an autotuning approach applied to exhaustive performance engineering of the EM-ICP algorithm for the point set registration problem with a known reference. We were able to achieve progressively higher performance levels through a variety of code transformations and an automated procedure of generating a large number of implementation variants. Furthermore, we managed to exploit code patterns that are not common when only attempting manual optimization but which yielded in our tests better performance for the chosen registration algorithm. Finally, we also show how we maintained high levels of the performance rate in a portable fashion across a wide range of hardware platforms including multicore, manycore coprocessors, and accelerators. Each of these hardware classes is much different from the others and, consequently, cannot reliably be mastered by a single developer in a short time required to deliver a close-to-optimal implementation. We assert in our concluding remarks that our methodology as well as the presented tools provide a valid automation system for software optimization tasks on modern HPC hardware.
%B Supercomputing Frontiers and Innovations
%V 5
%8 2018-12
%G eng
%& 42
%R 10.14529/jsfi180404
%0 Generic
%D 2018
%T Least Squares Performance Report
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-12
%G eng
%9 SLATE Working Notes
%1 09
%0 Generic
%D 2018
%T Linear Systems Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Ali Charara
%A Asim YarKhan
%A Jamie Finney
%A Gerald Ragghianti
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-09
%G eng
%9 SLATE Working Notes
%1 08
%0 Generic
%D 2018
%T MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Azzam Haidar
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster
%C Dallas, TX
%8 2018-11
%G eng
%0 Generic
%D 2018
%T MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR)
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Ichitaro Yamazaki
%A Jack Dongarra
%I NSF PI Meeting, Poster
%C Washington, DC
%8 2018-04
%G eng
%R https://doi.org/10.6084/m9.figshare.6174143.v3
%0 Generic
%D 2018
%T Parallel BLAS Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I University of Tennessee
%8 2018-04
%G eng
%1 05
%0 Generic
%D 2018
%T Parallel Norms Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng
%1 06
%0 Generic
%D 2018
%T Production Implementations of Pipelined & Communication-Avoiding Iterative Linear Solvers
%A Mark Hoemmen
%A Ichitaro Yamazaki
%I SIAM Conference on Parallel Processing for Scientific Computing
%C Tokyo, Japan
%8 2018-03
%G eng
%0 Journal Article
%J SIAM Review
%D 2018
%T The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%K bidiagonal matrix
%K bisection
%K Divide and conquer
%K Hestenes method
%K Jacobi method
%K Kogbetliantz method
%K MRRR
%K QR iteration
%K Singular value decomposition
%K SVD
%X The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EISPACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy.
%B SIAM Review
%V 60
%P 808–865
%8 2018-11
%G eng
%U https://epubs.siam.org/doi/10.1137/17M1117732
%N 4
%! SIAM Rev.
%R 10.1137/17M1117732
%0 Generic
%D 2018
%T Software-Defined Events (SDEs) in MAGMA-Sparse
%A Heike Jagode
%A Anthony Danalis
%A Hartwig Anzt
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Erik Boman
%A Stanimire Tomov
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-12
%G eng
%0 Generic
%D 2018
%T Solver Interface & Performance on Cori
%A Hartwig Anzt
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Erik Boman
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2018-06
%G eng
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Panruo Wu
%A Mawussi Zounon
%A Jack Dongarra
%K linear algebra
%K multithreading
%K runtime
%K symmetric indefinite matrices
%X Recently, the Open Multi-Processing (OpenMP) standard has incorporated task-based programming, where a function call with input and output data is treated as a task. At run time, OpenMP's superscalar scheduler tracks the data dependencies among the tasks and executes the tasks as their dependencies are resolved. On a shared-memory architecture with multiple cores, the independent tasks are executed on different cores in parallel, thereby enabling parallel execution of a seemingly sequential code. With the emergence of many-core architectures, this type of programming paradigm is gaining attention-not only because of its simplicity, but also because it breaks the artificial synchronization points of the program and improves its thread-level parallelization. In this paper, we use these new OpenMP features to develop a portable high-performance implementation of a dense symmetric indefinite linear solver. Obtaining high performance from this kind of solver is a challenge because the symmetric pivoting, which is required to maintain numerical stability, leads to data dependencies that prevent us from using some common performance-improving techniques. To fully utilize a large number of cores through tasking, while conforming to the OpenMP standard, we describe several techniques. Our performance results on current many-core architectures-including Intel's Broadwell, Intel's Knights Landing, IBM's Power8, and Arm's ARMv8-demonstrate the portable and superior performance of our implementation compared with the Linear Algebra PACKage (LAPACK). The resulting solver is now available as a part of the PLASMA software package.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 1879–1892
%8 2018-08
%G eng
%N 8
%R 10.1109/TPDS.2018.2808964
%0 Book Section
%B Handbook of Big Data Technologies
%D 2017
%T Bringing High Performance Computing to Big Data Algorithms
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Handbook of Big Data Technologies
%I Springer
%@ 978-3-319-49339-8
%G eng
%R 10.1007/978-3-319-49340-4
%0 Generic
%D 2017
%T Comparing performance of s-step and pipelined GMRES on distributed-memory multicore CPUs
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Piotr Luszczek
%A Jack Dongarra
%I SIAM Annual Meeting
%C Pittsburgh, Pennsylvania
%8 2017-07
%G eng
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2017
%T Design and Implementation of the PULSAR Programming System for Large Scale Computing
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Yves Robert
%A Jack Dongarra
%X The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.
%B Supercomputing Frontiers and Innovations
%V 4
%G eng
%U http://superfri.org/superfri/article/view/121/210
%N 1
%R 10.14529/jsfi170101
%0 Generic
%D 2017
%T Designing SLATE: Software for Linear Algebra Targeting Exascale
%A Jakub Kurzak
%A Panruo Wu
%A Mark Gates
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Gerald Ragghianti
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-10
%G eng
%9 SLATE Working Notes
%1 03
%0 Conference Proceedings
%B Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award
%D 2017
%T Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives
%A Ichitaro Yamazaki
%A Mark Hoemmen
%A Piotr Luszczek
%A Jack Dongarra
%X We compare the performance of pipelined and s-step GMRES, respectively referred to as l-GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s-GMRES requires fewer all-reduces, while l-GMRES overlaps the all-reduces with computation. To combine the best features of two algorithms, we propose another variant, (l, t)-GMRES, that not only does fewer global all-reduces than standard GMRES, but also overlaps those all-reduces with other work. We implemented the thread-parallelism and communication-overlap in two different ways. The first uses nonblocking MPI collectives with thread-parallel computational kernels. The second relies on a shared-memory task scheduler. In our experiments, (l, t)-GMRES performed better than l-GMRES by factors of up to 1.67×. In addition, though we only used 50 nodes, when the latency cost became significant, our variant performed up to 1.22× better than s-GMRES by hiding all-reduces.
%B Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award
%C Orlando, FL
%8 2017-06
%G eng
%R https://doi.org/10.1109/IPDPSW.2017.65
%0 Generic
%D 2017
%T LAWN 294: Aasen's Symmetric Indenite Linear Solvers in LAPACK
%A Ichitaro Yamazaki
%A Jack Dongarra
%X Recently, we released two LAPACK subroutines that implement Aasen's algorithms for solving a symmetric indefinite linear system of equations. The first implementation is based on a partitioned right-looking variant of Aasen's algorithm (the column-wise left-looking panel factorization, followed by the right-looking trailing submatrix update using the panel). The second implements the two-stage left-looking variant of the algorithm (the block-wise left- looking algorithm that reduces the matrix to the symmetric band form, followed by the band LU factorization). In this report, we discuss our implementations and present our experimental results to compare the stability and performance of these two new solvers with those of the other two symmetric indefinite solvers in LAPACK (i.e., the Bunch-Kaufman and rook pivoting algorithms).
%B LAPACK Working Note
%I University of Tennessee
%8 2017-12
%G eng
%0 Generic
%D 2017
%T MAGMA-sparse Interface Design Whitepaper
%A Hartwig Anzt
%A Erik Boman
%A Jack Dongarra
%A Goran Flegar
%A Mark Gates
%A Mike Heroux
%A Mark Hoemmen
%A Jakub Kurzak
%A Piotr Luszczek
%A Sivasankaran Rajamanickam
%A Stanimire Tomov
%A Stephen Wood
%A Ichitaro Yamazaki
%X In this report we describe the logic and interface we develop for the MAGMA-sparse library to allow for easy integration as third-party library into a top-level software ecosystem. The design choices are based on extensive consultation with other software library developers, in particular the Trilinos software development team. The interface documentation is at this point not exhaustive, but a first proposal for setting a standard. Although the interface description targets the MAGMA-sparse software module, we hope that the design choices carry beyond this specific library, and are attractive for adoption in other packages. This report is not intended as static document, but will be updated over time to reflect the agile software development in the ECP 1.3.3.11 STMS11-PEEKS project.
%B Innovative Computing Laboratory Technical Report
%8 2017-09
%G eng
%9 Technical Report
%0 Generic
%D 2017
%T PLASMA 17 Performance Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng
%0 Generic
%D 2017
%T PLASMA 17.1 Functionality Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng
%0 Generic
%D 2017
%T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Aurelien Bouteiller
%A Anthony Danalis
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Stephen Wood
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-06
%G eng
%9 SLATE Working Notes
%1 01
%0 Conference Paper
%B IEEE International Conference on Big Data
%D 2017
%T Sampling Algorithms to Update Truncated SVD
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%B IEEE International Conference on Big Data
%I IEEE
%C Boston, MA
%8 2017-12
%G eng
%0 Conference Paper
%B IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017)
%D 2017
%T Scaling Point Set Registration in 3D Across Thread Counts on Multicore and Hardware Accelerator Platforms through Autotuning for Large Scale Analysis of Scientific Point Clouds
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A David Keffer
%A Jack Dongarra
%X In this article, we present an autotuning approach applied to systematic performance engineering of the EM-ICP (Expectation-Maximization Iterative Closest Point) algorithm for the point set registration problem. We show how we were able to exceed the performance achieved by the reference code through multiple dependence transformations and automated procedure of generating and evaluating numerous implementation variants. Furthermore, we also managed to exploit code transformations that are not that common during manual optimization but yielded better performance in our tests for the EM-ICP algorithm. Finally, we maintained high levels of performance rate in a portable fashion across a wide range of HPC hardware platforms including multicore, many-core, and GPU-based accelerators. More importantly, the results indicate consistently high performance level and ability to move the task of data analysis through point-set registration to any modern compute platform without the concern of inferior asymptotic efficiency.
%B IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017)
%I IEEE
%C Boston, MA
%8 2017-12
%G eng
%R https://doi.org/10.1109/BigData.2017.8258258
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2017
%T Solving Dense Symmetric Indefinite Systems using GPUs
%A Marc Baboulin
%A Jack Dongarra
%A Adrien Remy
%A Stanimire Tomov
%A Ichitaro Yamazaki
%X This paper studies the performance of different algorithms for solving a dense symmetric indefinite linear system of equations on multicore CPUs with a Graphics Processing Unit (GPU). To ensure the numerical stability of the factorization, pivoting is required. Obtaining high performance of such algorithms on the GPU is difficult because all the existing pivoting strategies lead to frequent synchronizations and irregular data accesses. Until recently, there has not been any implementation of these algorithms on a hybrid CPU/GPU architecture. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive data transfer and synchronization between the CPU and GPU, or on the GPU (e.g., factorizing the matrix entirely on the GPU or in a communication-avoiding fashion). We also study the performance of the solver using iterative refinements along with the factorization without pivoting combined with the preprocessing technique based on random butterfly transformations, or with the mixed-precision algorithm where the matrix is factorized in single precision. This randomization algorithm only has a probabilistic proof on the numerical stability, and for this paper, we only focused on the mixed-precision algorithm without pivoting. However, they demonstrate that we can obtain good performance on the GPU by avoiding the pivoting and using the lower precision arithmetics, respectively. As illustrated with the application in acoustics studied in this paper, in many practical cases, the matrices can be factorized without pivoting. Because the componentwise backward error computed in the iterative refinement signals when the algorithm failed to obtain the desired accuracy, the user can use these potentially unstable but efficient algorithms in most of the cases and fall back to a more stable algorithm with pivoting only in the case of the failure.
%B Concurrency and Computation: Practice and Experience
%V 29
%8 2017-03
%G eng
%U http://onlinelibrary.wiley.com/doi/10.1002/cpe.4055/full
%N 9
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.4055
%0 Journal Article
%J IEEE Embedded Systems Letters
%D 2017
%T Structure-aware Linear Solver for Realtime Convex Optimization for Embedded Systems
%A Ichitaro Yamazaki
%A Saeid Nooshabadi
%A Stanimire Tomov
%A Jack Dongarra
%K Karush Kuhn Tucker (KKT)
%K Realtime embedded convex optimization solver
%X With the increasing sophistication in the use of optimization algorithms such as deep learning on embedded systems, the convex optimization solvers on embedded systems have found widespread use. This letter presents a novel linear solver technique to reduce the run-time of convex optimization solver by using the property that some parameters are fixed during the solution iterations of a solve instance. Our experimental results show that the run-time can be reduced by two orders of magnitude.
%B IEEE Embedded Systems Letters
%V 9
%P 61–64
%8 2017-05
%G eng
%U http://ieeexplore.ieee.org/document/7917357/
%N 3
%R 10.1109/LES.2017.2700401
%0 Conference Paper
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC)
%D 2017
%T Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A Jack Dongarra
%X With NVIDA Tegra Jetson X1 and Pascal P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. In this talk, we will introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how they relate to NVIDIA platforms.
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC)
%I IEEE
%C Waltham, MA
%8 2017-09
%G eng
%R https://doi.org/10.1109/HPEC.2017.8091031
%0 Journal Article
%J Computing in Science & Engineering
%D 2017
%T With Extreme Computing, the Rules Have Changed
%A Jack Dongarra
%A Stanimire Tomov
%A Piotr Luszczek
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Azzam Haidar
%A Ahmad Abdelfattah
%X On the eve of exascale computing, traditional wisdom no longer applies. High-performance computing is gone as we know it. This article discusses a range of new algorithmic techniques emerging in the context of exascale computing, many of which defy the common wisdom of high-performance computing and are considered unorthodox, but could turn out to be a necessity in near future.
%B Computing in Science & Engineering
%V 19
%P 52-62
%8 2017-05
%G eng
%N 3
%R https://doi.org/10.1109/MCSE.2017.48
%0 Book Section
%B Lecture Notes in Computer Science
%D 2016
%T Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures
%A Marc Baboulin
%A Jack Dongarra
%A Adrien Remy
%A Stanimire Tomov
%A Ichitaro Yamazaki
%E Roman Wyrzykowski
%E Ewa Deelman
%E Konrad Karczewski
%E Jacek Kitowski
%E Kazimierz Wiatr
%K Communication-avoiding
%K Dense symmetric indefinite factorization
%K gpu computation
%K randomization
%X We study the performance of dense symmetric indefinite factorizations (Bunch-Kaufman and Aasen’s algorithms) on multicore CPUs with a Graphics Processing Unit (GPU). Though such algorithms are needed in many scientific and engineering simulations, obtaining high performance of the factorization on the GPU is difficult because the pivoting that is required to ensure the numerical stability of the factorization leads to frequent synchronizations and irregular data accesses. As a result, until recently, there has not been any implementation of these algorithms on hybrid CPU/GPU architectures. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive communication and synchronization between the CPU and GPU, or on the GPU. We also study the performance of an LDL^T factorization with no pivoting combined with the preprocessing technique based on Random Butterfly Transformations. Though such transformations only have probabilistic results on the numerical stability, they avoid the pivoting and obtain a great performance on the GPU.
%B Lecture Notes in Computer Science
%S 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I
%I Springer International Publishing
%V 9573
%P 86-95
%8 2015-09
%@ 978-3-319-32149-3
%G eng
%& Parallel Processing and Applied Mathematics
%R 10.1007/978-3-319-32149-3_9
%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%D 2016
%T Heterogeneous Streaming
%A Chris J. Newburn
%A Gaurav Bansal
%A Michael Wood
%A Luis Crivelli
%A Judit Planas
%A Alejandro Duran
%A Paulo Souza
%A Leonardo Borges
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%A Hartwig Anzt
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Ichitaro Yamazaki
%A Jesus Labarta
%X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%0 Generic
%D 2016
%T High Performance Realtime Convex Solver for Embedded Systems
%A Ichitaro Yamazaki
%A Saeid Nooshabadi
%A Stanimire Tomov
%A Jack Dongarra
%K KKT
%K Realtime embedded convex optimization solver
%X Convex optimization solvers for embedded systems find widespread use. This letter presents a novel technique to reduce the run-time of decomposition of KKT matrix for the convex optimization solver for an embedded system, by two orders of magnitude. We use the property that although the KKT matrix changes, some of its block sub-matrices are fixed during the solution iterations and the associated solving instances.
%B University of Tennessee Computer Science Technical Report
%8 2016-10
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2016
%T Non-GPU-resident Dense Symmetric Indefinite Factorization
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X We study various algorithms to factorize a symmetric indefinite matrix that does not fit in the core memory of a computer. There are two sources of the data movement into the memory: one needed for selecting and applying pivots and the other needed to update each column of the matrix for the factorization. It is a challenge to obtain high performance of such an algorithm when the pivoting is required to ensure the numerical stability of the factorization. For example, when factorizing each column of the matrix, a diagonal entry, which ensures the stability, may need to be selected as a pivot among the remaining diagonals, and moved to the leading diagonal by swapping both the corresponding rows and columns of the matrix. If the pivot is not in the core memory, then it must be loaded into the core memory. For updating the matrix, the data locality may be improved by partitioning the matrix. For example, a right-looking partitioned algorithm first factorizes the leading columns, called panel, and then uses the factorized panel to update the trailing submatrix. This algorithm only accesses the trailing submatrix after each panel factorization (instead of after each column factorization) and performs most of its floating-point operations (flops) using BLAS-3, which can take advantage of the memory hierarchy. However, because the pivots cannot be predetermined, the whole trailing submatrix must be updated before the next panel factorization can start. When the whole submatrix does not fit in the core memory all at once, loading the block columns into the memory can become the performance bottleneck. Similarly, the left-looking variant of the algorithm would require to update each panel with all of the previously factorized columns. This makes it a much greater challenge to implement an efficient out-of-core symmetric indefinite factorization compared with an out-of-core nonsymmetric LU factorization with partial pivoting, which only requires to swap the rows of the matrix and accesses the trailing submatrix after each in-core factorization (instead of after each panel factorization by the symmetric factorization). To reduce the amount of the data transfer, in this paper we uses the recently proposed left-looking communication-avoiding variant of the symmetric factorization algorithm to factorize the columns in the core memory, and then perform the partitioned right-looking out-of-core trailing submatrix updates. This combination may still require to load the pivots into the core memory, but it only updates the trailing submatrix after each in-core factorization, while the previous algorithm updates it after each panel factorization.Although these in-core and out-of-core algorithms can be applied at any level of the memory hierarchy, we apply our designs to the GPU and CPU memory, respectively. We call this specific implementation of the algorithm a non–GPU-resident implementation. Our performance results on the current hybrid CPU/GPU architecture demonstrate that when the matrix is much larger than the GPU memory, the proposed algorithm can obtain significant speedups over the communication-hiding implementations of the previous algorithms.
%B Concurrency and Computation: Practice and Experience
%8 2016-11
%G eng
%R 10.1002/cpe.4012
%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2016
%T Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X To orthonormalize a set of dense vectors, Singular Value QR (SVQR) requires only one global reduction between the parallel processing units, and uses BLAS-3 kernels to perform most of its local computation. As a result, compared to other orthogonalization schemes, SVQR obtains superior performance on many of the current computers. In this paper, we study the stability and performance of various SVQR implementations on multicore CPUs with a GPU, focusing on the dense triangular solve, which performs half of the total floating-point operations in SVQR. As a part of this study, we examine its adaptive mixed-precision variant that decides if a lower-precision arithmetic can be used for the triangular solution at runtime without increasing the order of its orthogonality error. Since the backward error of this adaptive mixed-precision variant is significantly greater than that of the standard SVQR, we study its effects on the solution convergence of several subspace projection methods for solving a linear system of equations and for computing singular values or eigenvalues of a sparse matrix. Our experimental results indicate that in some cases, the convergence rate of the solver may not be affected by the larger backward errors, while reducing the time to solution.
%B ACM Transactions on Mathematical Software (TOMS)
%V 43
%8 2016-10
%G eng
%N 2
%0 Journal Article
%J Scientific Programming
%D 2015
%T Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X Low-rank matrices arise in many scientific and engineering computation. Both computational and storage costs of manipulating such matrices may be reduced by taking advantages of their low-rank properties. To compute a low-rank approximation of a dense matrix, in this paper, we study the performance of QR factorization with column pivoting or with restricted pivoting on multicore CPUs with a GPU. We first propose several techniques to reduce the postprocessing time, which is required for restricted pivoting, on a modern CPU. We then examine the potential of using a GPU to accelerate the factorization process with both column and restricted pivoting. Our performance results on two eight-core Intel Sandy Bridge CPUs with one NVIDIA Kepler GPU demonstrate that using the GPU, the factorization time can be reduced by a factor of more than two. In addition, to study the performance of our implementations in practice, we integrate them into a recently-developed software StruMF which algebraically exploits such low-rank structures for solving a general sparse linear system of equations. Our performance results for solving Poisson's equations demonstrate that the proposed techniques can significantly reduce the preconditioner construction time of StruMF on the CPUs, and the construction time can be further reduced by 10%-50% using the GPU.
%B Scientific Programming
%G eng
%0 Generic
%D 2015
%T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I ISC High Performance (ISC15), Intel Booth Presentation
%C Frankfurt, Germany
%8 2015-06
%G eng
%0 Conference Paper
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2015
%T Mixed-precision Block Gram Schmidt Orthogonalization
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jakub Kurzak
%A Jack Dongarra
%A Jesse Barlow
%X The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a significant impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7:1 while maintaining about the same order of the numerical errors.
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Journal Article
%J SIAM Journal on Scientific Computing
%D 2015
%T Mixed-Precision Cholesky QR Factorization and its Case Studies on Multicore CPU with Multiple GPUs
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%X To orthonormalize the columns of a dense matrix, the Cholesky QR (CholQR) requires only one global reduction between the parallel processing units and performs most of its computation using BLAS-3 kernels. As a result, compared to other orthogonalization algorithms, CholQR obtains superior performance on many of the current computer architectures, where the communication is becoming increasingly expensive compared to the arithmetic operations. This is especially true when the input matrix is tall-skinny. Unfortunately, the orthogonality error of CholQR depends quadratically on the condition number of the input matrix, and it is numerically unstable when the matrix is ill-conditioned. To enhance the stability of CholQR, we recently used mixed-precision arithmetic; the input and output matrices are in the working precision, but some of its intermediate results are accumulated in the doubled precision. In this paper, we analyze the numerical properties of this mixed-precision CholQR. Our analysis shows that by selectively using the doubled precision, the orthogonality error of the mixed-precision CholQR only depends linearly on the condition number of the input matrix. We provide numerical results to demonstrate the improved numerical stability of the mixed-precision CholQR in practice. We then study its performance. When the target hardware does not support the desired higher precision, software emulation is needed. For example, using software-emulated double-double precision for the working 64-bit double precision, the mixed-precision CholQR requires about 8.5x more floating-point instructions than that required by the standard CholQR. On the other hand, the increase in the communication cost using the double-double precision is less significant, and our performance results on multicore CPU with a different graphics processing unit (GPU) demonstrate that the overhead of using the double-double arithmetic is decreasing on a newer architecture, where the computation is becoming less expensive compared to the communication. As a result, with a latest NVIDIA GPU, the mixed-precision CholQR was only 1.4x slower than the standard CholQR. Finally, we present case studies of using the mixed-precision CholQR within communication-avoiding variants of Krylov subspace projection methods for solving a nonsymmetric linear system of equations and for solving a symmetric eigenvalue problem, on a multicore CPU with multiple GPUs. These case studies demonstrate that by using the higher precision for this small but critical segment of the Krylov methods, we can improve not only the overall numerical stability of the solvers but also, in some cases, their performance.
%B SIAM Journal on Scientific Computing
%V 37
%P C203-C330
%8 2015-05
%G eng
%R DOI:10.1137/14M0973773
%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra
%D 2015
%T Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs
%A Ichitaro Yamazaki
%A Jesse Barlow
%A Stanimire Tomov
%A Jakub Kurzak
%A Jack Dongarra
%X Orthogonalizing a set of dense vectors is an important computational kernel in subspace projection methods for solving large-scale problems. In this talk, we discuss our efforts to improve the performance of the kernel, while maintaining its numerical accuracy. Our experimental results demonstrate the effectiveness of our approaches.
%B 2015 SIAM Conference on Applied Linear Algebra
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2015
%T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems
%A Maksims Abalenkovs
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%A Asim YarKhan
%K dense linear algebra
%K gpu
%K HPC
%K Multicore
%K Programming models
%K runtime
%X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.
%B Supercomputing Frontiers and Innovations
%V 2
%8 2015-10
%G eng
%R 10.14529/jsfi1504
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs
%A Theo Mary
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Randomized Algorithms to Update Partial Singular Value Decomposition on a Hybrid CPU/GPU Cluster
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%K Gaussian elimination
%K lu factorization
%K Multicore
%K parallel
%K shared memory
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 1292-1309
%8 2015-04
%G eng
%N 5
%R 10.1002/cpe.3306
%0 Book Section
%B Numerical Computations with GPUs
%D 2014
%T Accelerating Numerical Dense Linear Algebra Calculations with GPUs
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Numerical Computations with GPUs
%I Springer International Publishing
%P 3-28
%@ 978-3-319-06547-2
%G eng
%& 1
%R 10.1007/978-3-319-06548-9_1
%0 Conference Paper
%B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining
%D 2014
%T Access-averse Framework for Computing Low-rank Matrix Approximations
%A Ichitaro Yamazaki
%A Theo Mary
%A Jakub Kurzak
%A Stanimire Tomov
%A Jack Dongarra
%B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining
%C Washington, DC
%8 2014-10
%G eng
%0 Journal Article
%J SIAM Journal on Matrix Analysis and Application
%D 2014
%T Communication-Avoiding Symmetric-Indefinite Factorization
%A Grey Ballard
%A Dulceneia Becker
%A James Demmel
%A Jack Dongarra
%A Alex Druinsky
%A I Peled
%A Oded Schwartz
%A Sivan Toledo
%A Ichitaro Yamazaki
%X We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen’s triangular tridiagonalization. It factors a dense symmetric matrix A as the product A = P LT L T P T where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. The current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal.
%B SIAM Journal on Matrix Analysis and Application
%V 35
%P 1364-1406
%8 2014-07
%G eng
%N 4
%0 Conference Paper
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2014
%T Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C New Orleans, LA
%8 2014-11
%G eng
%0 Conference Paper
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%D 2014
%T Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K dataflow
%K message-passing
%K multithreading
%K QR decomposition
%K runtime
%K systolic array
%X A systolic array provides an alternative computing paradigm to the von Neuman architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems.
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14)
%D 2014
%T Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster
%A Ichitaro Yamazaki
%A Sivasankaran Rajamanickam
%A Eric G. Boman
%A Mark Hoemmen
%A Michael A. Heroux
%A Stanimire Tomov
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14)
%I IEEE
%C New Orleans, LA
%8 2014-11
%G eng
%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Improving the performance of CA-GMRES on multicores with multiple GPUs
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Hoemmen
%A Jack Dongarra
%X Abstract—The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Conference Paper
%B VECPAR 2014 (Best Paper)
%D 2014
%T Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Tingxing Dong
%A Jack Dongarra
%X We propose a mixed-precision orthogonalization scheme that takes the input matrix in a standard 32 or 64-bit floating-point precision, but uses higher-precision arithmetics to accumulate its intermediate results. For the 64-bit precision, our scheme uses software emulation for the higher-precision arithmetics, and requires about 20x more computation but about the same amount of communication as the standard orthogonalization scheme. Since the computation is becoming less expensive compared to the communication on new and emerging architectures, the relative cost of our mixed-precision scheme is decreasing. Our case studies with CA-GMRES on a GPU demonstrate that using mixed-precision for this small but critical segment of CA-GMRES can improve not only its overall numerical stability but also, in some cases, its performance.
%B VECPAR 2014 (Best Paper)
%C Eugene, OR
%8 2014-06
%G eng
%0 Conference Paper
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%D 2014
%T Optimizing Krylov Subspace Solvers on Graphics Processing Units
%A Stanimire Tomov
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Hartwig Anzt
%A William Sawyer
%X Krylov subspace solvers are often the method of choice when solving sparse linear systems iteratively. At the same time, hardware accelerators such as graphics processing units (GPUs) continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to leverage the full potential of the accelerator. In this paper we target the acceleration of the BiCGSTAB solver for GPUs, showing that significant improvement can be achieved by reformulating the method and developing application-specific kernels instead of using the generic CUBLAS library provided by NVIDIA. We propose an implementation that benefits from a significantly reduced number of kernel launches and GPUhost communication events, by means of increased data locality and a simultaneous reduction of multiple scalar products. Using experimental data, we show that, depending on the dominance of the untouched sparse matrix vector products, significant performance improvements can be achieved compared to a reference implementation based on the CUBLAS library. We feel that such optimizations are crucial for the subsequent development of highlevel sparse linear algebra libraries.
%B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%0 Conference Paper
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%D 2014
%T Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors
%A Azzam Haidar
%A Chongxiao Cao
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly.
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%I IEEE
%C New Orleans, LA
%8 2014-11
%G eng
%R 10.1109/ScalA.2014.8
%0 Generic
%D 2014
%T PULSAR Users’ Guide, Parallel Ultra-Light Systolic Array Runtime
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X PULSAR version 2.0, released in November 2014, is a complete programming platform for large-scale distributed memory systems with multicore processors and hardware accelerators. PULSAR provides a simple abstraction layer over multithreading, message passing, and multi-GPU, multi-stream programming. PULSAR offers a general-purpose programming model, suitable for a wide range of scientific and engineering applications. PULSAR was inspired by systolic arrays, popularized by Hsiang-Tsung Kung and Charles E. Leiserson.
%B University of Tennessee EECS Technical Report
%I University of Tennessee
%8 2014-11
%G eng
%0 Journal Article
%J IPDPS 2013 (submitted)
%D 2013
%T Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures
%A Ichitaro Yamazaki
%A Dulceneia Becker
%A Jack Dongarra
%A Alex Druinsky
%A I. Peled
%A Sivan Toledo
%A Grey Ballard
%A James Demmel
%A Oded Schwartz
%X Factorization of a dense symmetric indeﬁnite matrix is a key computational kernel in many scientiﬁc and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symmetry and guarantees numerical stability through pivoting at the same time. This is because such an algorithm exhibits many of the fundamental challenges in parallel programming like irregular data accesses and irregular task dependencies. In this paper, we address these challenges in a tiled implementation of a blocked Aasen’s algorithm using a dynamic scheduler. To fully exploit the limited parallelism in this left-looking algorithm, we study several performance enhancing techniques; e.g., parallel reduction to update a panel, tall-skinny LU factorization algorithms to factorize the panel, and a parallel implementation of symmetric pivoting. Our performance results on up to 48 AMD Opteron processors demonstrate that our implementation obtains speedups of up to 2.8 over MKL, while losing only one or two digits in the computed residual norms.
%B IPDPS 2013 (submitted)
%C Boston, MA
%8 2013-00
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Raffaele Solcà
%A Stanimire Tomov
%A Jack Dongarra
%A Thomas C. Schulthess
%X For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)-2 symmetric matrix-vector multiplication, and the BLAS-3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi-GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU-GPU kernel into computational kernels at higher-levels of software stacks, that is, a shared-memory dense eigensolver and a distributed-memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher-level kernels, not only reducing the solution time but also enabling the solution of larger-scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques.
%B Concurrency and Computation: Practice and Experience
%8 2013-10
%G eng
%0 Conference Paper
%B The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES)
%D 2013
%T Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Stanimire Tomov
%A Jack Dongarra
%B The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES)
%8 2013-05
%G eng
%0 Conference Paper
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%D 2013
%T Virtual Systolic Array for QR Decomposition
%A Jakub Kurzak
%A Piotr Luszczek
%A Mark Gates
%A Ichitaro Yamazaki
%A Jack Dongarra
%K dataflow programming
%K message passing
%K multi-core
%K QR decomposition
%K roofline model
%K systolic array
%X Systolic arrays offer a very attractive, data-centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for the QR decomposition is developed and a virtualization layer is used for mapping of the algorithm to a large distributed memory system. Strong scaling properties are discovered, superior to existing solutions.
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%I IEEE
%C Boston, MA
%8 2013-05
%G eng
%R 10.1109/IPDPS.2013.119
%0 Generic
%D 2012
%T On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng
%0 Generic
%D 2012
%T MAGMA: A Breakthrough in Solvers for Eigenvalue Problems
%A Stanimire Tomov
%A Jack Dongarra
%A Azzam Haidar
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Thomas Schulthess
%A Raffaele Solcà
%I GPU Technology Conference (GTC12), Presentation
%C San Jose, CA
%8 2012-05
%G eng
%0 Generic
%D 2012
%T MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures
%A Jack Dongarra
%A Tingxing Dong
%A Mark Gates
%A Azzam Haidar
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation
%C Salt Lake City, UT
%8 2012-11
%G eng
%0 Conference Proceedings
%B The International Conference on Computational Science (ICCS)
%D 2012
%T One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B The International Conference on Computational Science (ICCS)
%8 2012-06
%G eng