%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2020
%T A Set of Batched Basic Linear Algebra Subprograms
%A Ahmad Abdelfattah
%A Timothy Costa
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Mawussi Zounon
%X This paper describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular half precision is used in many very large scale applications, such as those associated with machine learning.
%B ACM Transactions on Mathematical Software
%8 2020-10
%G eng

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2019
%T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Sven Hammarling
%A Jakub Sistek
%B ACM Transactions on Mathematical Software
%V 45
%8 2019-06
%G eng
%N 2
%R https://doi.org/10.1145/3264491

%0 Report
%D 2018
%T Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
%A Jack Dongarra
%A Iain Duff
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jonathan Hogg
%A Pedro Valero Lara
%A Piotr Luszczek
%A Mawussi Zounon
%A Samuel D. Relton
%A Stanimire Tomov
%A Timothy Costa
%A Sarah Knepper
%X This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility.
%8 2018-07
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
%A Jack Dongarra
%A Sven Hammarling
%A Nicholas J. Higham
%A Samuel Relton
%A Pedro Valero-Lara
%A Mawussi Zounon
%K Batched BLAS
%K BLAS
%K High-performance computing
%K Memory management
%K Parallel processing
%K Scientific computing
%X A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on a general matrix-matrix multiplication (GEMM), to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL1 CuBLAS2 by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, for large numbers of double precision GEMM operations using matrices of size 2 × 2 to 20 × 20.
%B International Conference on Computational Science (ICCS 2017)
%I Elsevier
%C Zürich, Switzerland
%8 2017-06
%G eng
%R DOI:10.1016/j.procs.2017.05.138

%0 Conference Paper
%B Euro-Par 2017
%D 2017
%T Optimized Batched Linear Algebra for Modern Architectures
%A Jack Dongarra
%A Sven Hammarling
%A Nicholas J. Higham
%A Samuel Relton
%A Mawussi Zounon
%X Solving large numbers of small linear algebra problems simultaneously is becoming increasingly important in many application areas. Whilst many researchers have investigated the design of efficient batch linear algebra kernels for GPU architectures, the common approach for many/multi-core CPUs is to use one core per subproblem in the batch. When solving batches of very small matrices, 2 × 2 for example, this design exhibits two main issues: it fails to fully utilize the vector units and the cache of modern architectures, since the matrices are too small. Our approach to resolve this is as follows: given a batch of small matrices spread throughout the primary memory, we first reorganize the elements of the matrices into a contiguous array, using a block interleaved memory format, which allows us to process the small independent problems as a single large matrix problem and enables cross-matrix vectorization. The large problem is solved using blocking strategies that attempt to optimize the use of the cache. The solution is then converted back to the original storage format. To explain our approach we focus on two BLAS routines: general matrix-matrix multiplication (GEMM) and the triangular solve (TRSM). We extend this idea to LAPACK routines using the Cholesky factorization and solve (POSV). Our focus is primarily on very small matrices ranging in size from 2 × 2 to 32 × 32. Compared to both MKL and OpenMP implementations, our approach can be up to 4 times faster for GEMM, up to 14 times faster for TRSM, and up to 40 times faster for POSV on the new Intel Xeon Phi processor, code-named Knights Landing (KNL). Furthermore, we discuss strategies to avoid data movement between sockets when using our interleaved approach on a NUMA node.
%B Euro-Par 2017
%I Springer
%C Santiago de Compostela, Spain
%8 2017-08
%G eng
%R https://doi.org/10.1007/978-3-319-64203-1_37

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2002
%T An Updated Set of Basic Linear Algebra Subprograms (BLAS)
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Iain Duff
%A Sven Hammarling
%A Greg Henry
%A Michael Heroux
%A Linda Kaufman
%A Andrew Lumsdaine
%A Antoine Petitet
%A Roldan Pozo
%A Karin Remington
%A Clint Whaley
%B ACM Transactions on Mathematical Software
%V 28
%P 135-151
%8 2002-12
%G eng
%R 10.1145/567806.567807

%0 Journal Article
%J (an update), submitted to ACM TOMS
%D 2001
%T Basic Linear Algebra Subprograms (BLAS)
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Iain Duff
%A Sven Hammarling
%A Greg Henry
%A Michael Heroux
%A Linda Kaufman
%A Andrew Lumsdaine
%A Antoine Petitet
%A Roldan Pozo
%A Karin Remington
%A Clint Whaley
%B (an update), submitted to ACM TOMS
%8 2001-02
%G eng

%0 Journal Article
%J Philadelphia: Society for Industrial and Applied Mathematics
%D 1999
%T LAPACK Users' Guide, 3rd ed.
%A Ed Anderson
%A Zhaojun Bai
%A Christian Bischof
%A Susan Blackford
%A James Demmel
%A Jack Dongarra
%A Jeremy Du Croz
%A Anne Greenbaum
%A Sven Hammarling
%A Alan McKenney
%A Danny Sorensen
%B Philadelphia: Society for Industrial and Applied Mathematics
%8 1999-01
%G eng