%0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2015 %T A Scalable Approach to Solving Dense Linear Algebra Problems on Hybrid CPU-GPU Systems %A Fengguang Song %A Jack Dongarra %K dense linear algebra %K distributed dataflow scheduling %K heterogeneous HPC systems %K runtime systems %X Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs. %B Concurrency and Computation: Practice and Experience %V 27 %P 3702-3723 %8 2015-09 %G eng %N 14 %R 10.1002/cpe.3403 %0 Conference Proceedings %B International conference on Supercomputing %D 2014 %T Scaling Up Matrix Computations on Shared-Memory Manycore Systems with 1000 CPU Cores %A Fengguang Song %A Jack Dongarra %X While the growing number of cores per chip allows researchers to solve larger scientific and engineering problems, the parallel efficiency of the deployed parallel software starts to decrease. This unscalability problem happens to both vendor-provided and open-source software and wastes CPU cycles and energy. By expecting CPUs with hundreds of cores to be imminent, we have designed a new framework to perform matrix computations for massively many cores. Our performance analysis on manycore systems shows that the unscalability bottleneck is related to Non-Uniform Memory Access (NUMA): memory bus contention and remote memory access latency. To overcome the bottleneck, we have designed NUMA-aware tile algorithms with the help of a dynamic scheduling runtime system to minimize NUMA memory accesses. The main idea is to identify the data that is, either read a number of times or written once by a thread resident on a remote NUMA node, then utilize the runtime system to conduct data caching and movement between different NUMA nodes. Based on the experiments with QR factorizations, we demonstrate that our framework is able to achieve great scalability on a 48-core AMD Opteron system (e.g., parallel efficiency drops only 3% from one core to 48 cores). We also deploy our framework to an extreme-scale shared-memory SGI machine which has 1024 CPU cores and runs a single Linux operating system image. Our framework continues to scale well, and can outperform the vendor-optimized Intel MKL library by up to 750%. %B International conference on Supercomputing %I ACM %C Munich, Germany %P 333-342 %8 2014-06 %@ 978-1-4503-2642-1 %G eng %R 10.1145/2597652.2597670 %0 Conference Proceedings %B 26th ACM International Conference on Supercomputing (ICS 2012) %D 2012 %T Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems %A Fengguang Song %A Stanimire Tomov %A Jack Dongarra %K magma %B 26th ACM International Conference on Supercomputing (ICS 2012) %I ACM %C San Servolo Island, Venice, Italy %8 2012-06 %G eng %0 Conference Proceedings %B The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012) %D 2012 %T A Scalable Framework for Heterogeneous GPU-Based Clusters %A Fengguang Song %A Jack Dongarra %K magma %B The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012) %I ACM %C Pittsburgh, PA, USA %8 2012-06 %G eng %0 Generic %D 2011 %T Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures %A Fengguang Song %A Stanimire Tomov %A Jack Dongarra %K magma %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250) %8 2011-06 %G eng %0 Generic %D 2010 %T Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems %A Fengguang Song %A Hatem Ltaeif %A Bilel Hadri %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report %V –10-653 %8 2010-04 %G eng %0 Journal Article %J SC'10 %D 2010 %T Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems %A Fengguang Song %A Hatem Ltaeif %A Bilel Hadri %A Jack Dongarra %K plasma %B SC'10 %I ACM SIGARCH/ IEEE Computer Society %C New Orleans, LA %8 2010-11 %G eng %0 Journal Article %J IEEE Cluster 2009 %D 2009 %T Analytical Modeling and Optimization for Affinity Based Thread Scheduling on Multicore Systems %A Fengguang Song %A Shirley Moore %A Jack Dongarra %K gridpac %K mumi %B IEEE Cluster 2009 %C New Orleans %8 2009-08 %G eng %0 Conference Proceedings %B International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) %D 2009 %T Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems %A Fengguang Song %A Asim YarKhan %A Jack Dongarra %K mumi %K plasma %B International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) %C Portland, OR %8 2009-11 %G eng %0 Conference Proceedings %B The International Conference on Computational Science 2009 (ICCS 2009) %D 2009 %T A Scalable Non-blocking Multicast Scheme for Distributed DAG Scheduling %A Fengguang Song %A Shirley Moore %A Jack Dongarra %K plasma %B The International Conference on Computational Science 2009 (ICCS 2009) %C Baton Rouge, LA %V 5544 %P 195-204 %8 2009-05 %G eng %0 Generic %D 2008 %T Analytical Modeling for Affinity-Based Thread Scheduling on Multicore Platforms %A Fengguang Song %A Shirley Moore %A Jack Dongarra %B University of Tennessee Computer Science Technical Report, UT-CS-08-626 %8 2008-01 %G eng %0 Journal Article %J Lecture Notes in Computer Science, OpenMP Shared Memory Parallel Programming %D 2008 %T Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications %A Oscar Hernandez %A Fengguang Song %A Barbara Chapman %A Jack Dongarra %A Bernd Mohr %A Shirley Moore %A Felix Wolf %B Lecture Notes in Computer Science, OpenMP Shared Memory Parallel Programming %I Springer Berlin / Heidelberg %V 4315 %8 2008-00 %G eng %0 Conference Proceedings %B IEEE International Symposium on High Performance Distributed Computing %D 2007 %T Feedback-Directed Thread Scheduling with Memory Considerations %A Fengguang Song %A Shirley Moore %A Jack Dongarra %B IEEE International Symposium on High Performance Distributed Computing %C Monterey Bay, CA %8 2007-06 %G eng %0 Conference Proceedings %B Proceedings of the 2007 International Conference on Parallel Processing %D 2007 %T L2 Cache Modeling for Scientific Applications on Chip Multi-Processors %A Fengguang Song %A Shirley Moore %A Jack Dongarra %B Proceedings of the 2007 International Conference on Parallel Processing %I IEEE Computer Society %C Xi'an, China %8 2007-01 %G eng %0 Conference Proceedings %B 18th IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2006 (submitted) %D 2006 %T Experiments with Strassen's Algorithm: From Sequential to Parallel %A Fengguang Song %A Jack Dongarra %A Shirley Moore %B 18th IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2006 (submitted) %C Dallas, Texas %8 2006-01 %G eng %0 Generic %D 2006 %T Modeling of L2 Cache Behavior for Thread-Parallel Scientific Programs on Chip Multi-Processors %A Fengguang Song %A Shirley Moore %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %8 2006-01 %G eng %0 Conference Proceedings %B Second International Workshop on OpenMP %D 2006 %T Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications %A Oscar Hernandez %A Fengguang Song %A Barbara Chapman %A Jack Dongarra %A Bernd Mohr %A Shirley Moore %A Felix Wolf %K kojak %B Second International Workshop on OpenMP %C Reims, France %8 2006-01 %G eng %0 Conference Proceedings %B In Proceedings of the International Conference on Parallel Processing %D 2005 %T Automatic Experimental Analysis of Communication Patterns in Virtual Topologies %A Nikhil Bhatia %A Fengguang Song %A Felix Wolf %A Jack Dongarra %A Bernd Mohr %A Shirley Moore %K kojak %B In Proceedings of the International Conference on Parallel Processing %I IEEE Computer Society %C Oslo, Norway %8 2005-06 %G eng %0 Conference Proceedings %B 2004 International Conference on Parallel Processing (ICCP-04) %D 2004 %T An Algebra for Cross-Experiment Performance Analysis %A Fengguang Song %A Felix Wolf %A Nikhil Bhatia %A Jack Dongarra %A Shirley Moore %K kojak %B 2004 International Conference on Parallel Processing (ICCP-04) %C Montreal, Quebec, Canada %8 2004-08 %G eng %0 Conference Paper %B 5th LCI International Conference on Linux Clusters: The HPC Revolution %D 2004 %T Automating the Large-Scale Collection and Analysis of Performance %A Phil Mucci %A Jack Dongarra %A Rick Kufrin %A Shirley Moore %A Fengguang Song %A Felix Wolf %K kojak %K papi %B 5th LCI International Conference on Linux Clusters: The HPC Revolution %C Austin, Texas %8 2004-05 %G eng %0 Generic %D 2004 %T CUBE User Manual %A Fengguang Song %A Felix Wolf %K kojak %B ICL Technical Report %8 2004-02 %G eng