%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Efficient Implementation Of Quantum Materials Simulations On Distributed CPU-GPU Systems
%A Raffaele Solcà
%A Anton Kozhevnikov
%A Azzam Haidar
%A Stanimire Tomov
%A Thomas C. Schulthess
%A Jack Dongarra
%X We present a scalable implementation of the Linearized Augmented Plane Wave method for distributed memory systems, which relies on an efficient distributed, block-cyclic setup of the Hamiltonian and overlap matrices and allows us to turn around highly accurate 1000+ atom all-electron quantum materials simulations on clusters with a few hundred nodes. The implementation runs efficiently on standard multicore CPU nodes, as well as hybrid CPU-GPU nodes. The key for the latter is a novel algorithm to solve the generalized eigenvalue problem for dense, complex Hermitian matrices on distributed hybrid CPU-GPU systems. Performance tests for Li-intercalated CoO2 supercells containing 1501 atoms demonstrate that high-accuracy, transferable quantum simulations can now be used in throughput materials search problems. While our application can benefit and get scalable performance through CPU-only libraries like ScaLAPACK or ELPA2, our new hybrid solver enables the efficient use of GPUs and shows that a hybrid CPU-GPU architecture scales to a desired performance using substantially fewer cluster nodes, and notably, is considerably more energy efficient than the traditional multicore CPU only systems for such complex applications.
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2014
%T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks
%A Azzam Haidar
%A Raffaele Solcà
%A Mark Gates
%A Stanimire Tomov
%A Thomas C. Schulthess
%A Jack Dongarra
%K Eigensolver
%K electronic structure calculations
%K generalized eigensolver
%K gpu
%K high performance
%K hybrid
%K Multicore
%K two-stage
%X The adoption of hybrid CPU–GPU nodes in traditional supercomputing platforms such as the Cray-XK6 opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium-sized generalized eigenvalue problems must be solved many times. These eigenvalue problems are too small to effectively solve on distributed systems, but can benefit from the massive computing power concentrated on a single-node, hybrid CPU–GPU system. However, hybrid systems call for the development of new algorithms that efficiently exploit heterogeneity and massive parallelism of not just GPUs, but of multicore/manycore CPUs as well. Addressing these demands, we developed a generalized eigensolver featuring novel algorithms of increased computational intensity (compared with the standard algorithms), decomposition of the computation into fine-grained memory aware tasks, and their hybrid execution. The resulting eigensolvers are state-of-the-art in high-performance computing, significantly outperforming existing libraries. We describe the algorithm and analyze its performance impact on applications of interest when different fractions of eigenvectors are needed by the host electronic structure code.
%B International Journal of High Performance Computing Applications
%V 28
%P 196-209
%8 2014-05
%G eng
%N 2
%& 196
%R 10.1177/1094342013502097
%0 Conference Proceedings
%B International Supercomputing Conference (ISC)
%D 2013
%T Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%A Raffaele Solcà
%A Thomas C. Schulthess
%X Today’s high computational demands from engineering fields and complex hardware development make it necessary to develop and optimize new algorithms toward achieving high performance and good scalability on the next generation of computers. The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe and analyze a successful methodology to address the challenges—starting from our algorithm design, kernel optimization and tuning, to our programming model—in the development of a scalable high-performance generalized eigenvalue solver in the context of electronic structure calculations in materials science applications. We developed a set of leading edge dense linear algebra algorithms, as part of a generalized eigensolver, featuring fine grained memory aware kernels, a task based approach and hybrid execution/scheduling. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. We report the performance impact on the generalized eigensolver when different fractions of eigenvectors are needed. The algorithm described provides an enormous performance boost compared to current GPU-based solutions, and performance comparable to state-of-the-art distributed solutions, using a single node with multiple GPUs.
%B International Supercomputing Conference (ISC)
%7 Lecture Notes in Computer Science
%I Springer Berlin Heidelberg
%C Leipzig, Germany
%V 7905
%P 67-80
%8 2013-06
%@ 978-3-642-38750-0
%G eng
%R 10.1007/978-3-642-38750-0_6
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Raffaele Solcà
%A Stanimire Tomov
%A Jack Dongarra
%A Thomas C. Schulthess
%X For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)-2 symmetric matrix-vector multiplication, and the BLAS-3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi-GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU-GPU kernel into computational kernels at higher-levels of software stacks, that is, a shared-memory dense eigensolver and a distributed-memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher-level kernels, not only reducing the solution time but also enabling the solution of larger-scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques.
%B Concurrency and Computation: Practice and Experience
%8 2013-10
%G eng
%0 Journal Article
%J Supercomputing '12 (poster)
%D 2012
%T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks
%A Raffaele Solcà
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%A Thomas C. Schulthess
%B Supercomputing '12 (poster)
%C Salt Lake City, Utah
%8 2012-11
%G eng
%0 Journal Article
%J Oak Ridge National Laboratory Report
%D 2004
%T Cray X1 Evaluation Status Report
%A Pratul Agarwal
%A R. A. Alexander
%A E. Apra
%A Satish Balay
%A Arthur S. Bland
%A James Colgan
%A Eduardo D'Azevedo
%A Jack Dongarra
%A Tom Dunigan
%A Mark Fahey
%A Al Geist
%A M. Gordon
%A Robert Harrison
%A Dinesh Kaushik
%A M. Krishnakumar
%A Piotr Luszczek
%A Tony Mezzacapa
%A Jeff Nichols
%A Jarek Nieplocha
%A Leonid Oliker
%A T. Packwood
%A M. Pindzola
%A Thomas C. Schulthess
%A Jeffrey Vetter
%A James B White
%A T. Windus
%A Patrick H. Worley
%A Thomas Zacharia
%B Oak Ridge National Laboratory Report
%V /-2004/13
%8 2004-01
%G eng