%0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Efficient Implementation Of Quantum Materials Simulations On Distributed CPU-GPU Systems %A Raffaele SolcĂ  %A Anton Kozhevnikov %A Azzam Haidar %A Stanimire Tomov %A Thomas C. Schulthess %A Jack Dongarra %X We present a scalable implementation of the Linearized Augmented Plane Wave method for distributed memory systems, which relies on an efficient distributed, block-cyclic setup of the Hamiltonian and overlap matrices and allows us to turn around highly accurate 1000+ atom all-electron quantum materials simulations on clusters with a few hundred nodes. The implementation runs efficiently on standard multicore CPU nodes, as well as hybrid CPU-GPU nodes. The key for the latter is a novel algorithm to solve the generalized eigenvalue problem for dense, complex Hermitian matrices on distributed hybrid CPU-GPU systems. Performance tests for Li-intercalated CoO2 supercells containing 1501 atoms demonstrate that high-accuracy, transferable quantum simulations can now be used in throughput materials search problems. While our application can benefit and get scalable performance through CPU-only libraries like ScaLAPACK or ELPA2, our new hybrid solver enables the efficient use of GPUs and shows that a hybrid CPU-GPU architecture scales to a desired performance using substantially fewer cluster nodes, and notably, is considerably more energy efficient than the traditional multicore CPU only systems for such complex applications. %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng