%0 Journal Article %J Journal of Parallel and Distributed Computing %D 2013 %T Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %K Cluster %K Collective communication %K Hierarchical %K HPC %K MPI %K Multicore %X Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies. This calls for a collaborative approach between multiple layers of collective algorithms, dedicated to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications. In this work, we present HierKNEM, a kernel-assisted topology-aware collective framework, and the mechanisms deployed by this framework to orchestrate the collaboration between multiple layers of collective algorithms. The resulting scheme maximizes the overlap of intra- and inter-node communications. We demonstrate experimentally, by considering three of the most used collective operations (Broadcast, Allgather and Reduction), that (1) this approach is immune to modifications of the underlying process-core binding; (2) it outperforms state-of-art MPI libraries (Open MPI, MPICH2 and MVAPICH2) demonstrating up to a 30x speedup for synthetic benchmarks, and up to a 3x acceleration for a parallel graph application (ASP); (3) it furthermore demonstrates a linear speedup with the increase of the number of cores per compute node, a paramount requirement for scalability on future many-core hardware. %B Journal of Parallel and Distributed Computing %V 73 %P 1000-1010 %8 2013-07 %G eng %U http://www.sciencedirect.com/science/article/pii/S0743731513000166 %N 7 %R 10.1016/j.jpdc.2013.01.015 %0 Journal Article %J IPDPS 2012 (Best Paper) %D 2012 %T HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %B IPDPS 2012 (Best Paper) %C Shanghai, China %8 2012-05 %G eng %0 Journal Article %J 18th EuroMPI %D 2011 %T Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW %A Teng Ma %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %K dague %B 18th EuroMPI %I Springer %C Santorini, Greece %P 247-254 %8 2011-09 %G eng %0 Conference Proceedings %B Int'l Conference on Parallel Processing (ICPP '11) %D 2011 %T Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Brice Goglin %A J. Squyres %A Jack Dongarra %B Int'l Conference on Parallel Processing (ICPP '11) %C Taipei, Taiwan %8 2011-09 %G eng %0 Conference Proceedings %B IEEE Int'l Conference on Cluster Computing (Cluster 2011) %D 2011 %T Process Distance-aware Adaptive MPI Collective Communications %A Teng Ma %A Thomas Herault %A George Bosilca %A Jack Dongarra %B IEEE Int'l Conference on Cluster Computing (Cluster 2011) %C Austin, Texas %8 2011-00 %G eng %0 Generic %D 2010 %T Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Brice Goglin %A J. Squyres %A Jack Dongarra %B University of Tennessee Computer Science Technical Report, UT-CS-10-663 %8 2010-11 %G eng %0 Conference Proceedings %B Proceedings of the 17th EuroMPI conference %D 2010 %T Locality and Topology aware Intra-node Communication Among Multicore CPUs %A Teng Ma %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %B Proceedings of the 17th EuroMPI conference %I LNCS %C Stuttgart, Germany %8 2010-09 %G eng