Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs

TitleAccelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs
Publication TypeJournal Article
Year of Publication2017
AuthorsGates, M., S. Tomov, and J. Dongarra
JournalParallel Computing
Date Published11-2017
Keywords2-stage, accelerator, Divide and conquer, gpu, Singular value decomposition, SVD

The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today’s high performance computers. For dense matrices, the classic algorithm for the singular value decomposition (SVD) uses a one stage reduction to bidiagonal form, which is limited in performance by the memory bandwidth. To overcome this limitation, a two stage reduction to bidiagonal has been gaining popularity. It first reduces the matrix to band form using high performance Level 3 BLAS, then reduces the band matrix to bidiagonal form. As accelerators such as GPUs and co-processors are becoming increasingly widespread in high-performance computing, a question of great interest to many SVD users is how much the employment of a two stage reduction, as well as other current best practices in GPU computing, can accelerate this important routine. To fulfill this interest, we have developed an accelerated SVD employing a two stage reduction to bidiagonal and a number of other algorithms that are highly optimized for GPUs. Notably, we also parallelize and accelerate the divide and conquer algorithm used to solve the subsequent bidiagonal SVD. By accelerating all phases of the SVD algorithm, we provide a significant speedup compared to existing multi-core and GPU-based SVD implementations. In particular, using a P100 GPU, we illustrate a performance of up to 804 Gflop/s in double precision arithmetic to compute the full SVD of a 20k × 20k matrix in 90 seconds, which is 8.9 ×  faster than MKL on two 10 core Intel Haswell E5-2650 v3 CPUs, 3.7 ×  over the multi-core PLASMA two stage version, and 2.6 ×  over the previously accelerated one stage MAGMA version.

Short TitleParallel Computing
Project Tags: 
External Publication Flag: