%0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2023 %T Sparse matrix-vector and matrix-multivector products for the truncated SVD on graphics processors %A José I. Aliaga %A Hartwig Anzt %A Enrique S. Quintana-Orti %A Andres E. Thomas %K graphics processing units %K Singular value decomposition %K sparse matrix-multivector product %K sparse matrix-vector product %X Many practical algorithms for numerical rank computations implement an iterative procedure that involves repeated multiplications of a vector, or a collection of vectors, with both a sparse matrix A and its transpose. Unfortunately, the realization of these sparse products on current high performance libraries often deliver much lower arithmetic throughput when the matrix involved in the product is transposed. In this work, we propose a hybrid sparse matrix layout, named CSRC, that combines the flexibility of some well-known sparse formats to offer a number of appealing properties: (1) CSRC can be obtained at low cost from the popular CSR (compressed sparse row) format; (2) CSRC has similar storage requirements as CSR; and especially, (3) the implementation of the sparse product kernels delivers high performance for both the direct product and its transposed variant on modern graphics accelerators thanks to a significant reduction of atomic operations compared to a conventional implementation based on CSR. This solution thus renders considerably higher performance when integrated into an iterative algorithm for the truncated singular value decomposition (SVD), such as the randomized SVD or, as demonstrated in the experimental results, the block Golub–Kahan–Lanczos algorithm. %B Concurrency and Computation: Practice and Experience %8 2023-08 %G eng %U https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.7871 %! Concurrency and Computation %R 10.1002/cpe.7871 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2014 %T Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors %A José I. Aliaga %A Hartwig Anzt %A Maribel Castillo %A Juan C. Fernández %A Germán León %A Joaquín Pérez %A Enrique S. Quintana-Orti %K CG %K CPUs %K energy efficiency %K GPUs %K low-power architectures %X In this paper, we analyze the interactions occurring in the triangle performance-power-energy for the execution of a pivotal numerical algorithm, the iterative conjugate gradient (CG) method, on a diverse collection of parallel multithreaded architectures. This analysis is especially timely in a decade where the power wall has arisen as a major obstacle to build faster processors. Moreover, the CG method has recently been proposed as a complement to the LINPACK benchmark, as this iterative method is argued to be more archetypical of the performance of today's scientific and engineering applications. To gain insights about the benefits of hands-on optimizations we include runtime and energy efficiency results for both out-of-the-box usage relying exclusively on compiler optimizations, and implementations manually optimized for target architectures, that range from general-purpose and digital signal multicore processors to manycore graphics processing units, all representative of current multithreaded systems. %B Concurrency and Computation: Practice and Experience %V 27 %P 885-904 %8 2014-09 %G eng %U http://dx.doi.org/10.1002/cpe.3341 %N 4 %& 885 %R 10.1002/cpe.3341