%0 Journal Article
%J Parallel Computing
%D 2018
%T Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%K 2-stage
%K accelerator
%K Divide and conquer
%K gpu
%K Singular value decomposition
%K SVD
%X The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today’s high performance computers. For dense matrices, the classic algorithm for the singular value decomposition (SVD) uses a one stage reduction to bidiagonal form, which is limited in performance by the memory bandwidth. To overcome this limitation, a two stage reduction to bidiagonal has been gaining popularity. It first reduces the matrix to band form using high performance Level 3 BLAS, then reduces the band matrix to bidiagonal form. As accelerators such as GPUs and co-processors are becoming increasingly widespread in high-performance computing, a question of great interest to many SVD users is how much the employment of a two stage reduction, as well as other current best practices in GPU computing, can accelerate this important routine. To fulfill this interest, we have developed an accelerated SVD employing a two stage reduction to bidiagonal and a number of other algorithms that are highly optimized for GPUs. Notably, we also parallelize and accelerate the divide and conquer algorithm used to solve the subsequent bidiagonal SVD. By accelerating all phases of the SVD algorithm, we provide a significant speedup compared to existing multi-core and GPU-based SVD implementations. In particular, using a P100 GPU, we illustrate a performance of up to 804 Gflop/s in double precision arithmetic to compute the full SVD of a 20k × 20k matrix in 90 seconds, which is 8.9 × faster than MKL on two 10 core Intel Haswell E5-2650 v3 CPUs, 3.7 × over the multi-core PLASMA two stage version, and 2.6 × over the previously accelerated one stage MAGMA version.
%B Parallel Computing
%V 74
%P 3–18
%8 05-2018
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167819117301758
%! Parallel Computing
%R 10.1016/j.parco.2017.10.004