%0 Journal Article %J Parallel Computing %D 2022 %T Using long vector extensions for MPI reductions %A Zhong, Dong %A Cao, Qinglei %A George Bosilca %A Dongarra, Jack %X The modern CPU’s design, including the deep memory hierarchies and SIMD/vectorization capability have a more significant impact on algorithms’ efficiency than the modest frequency increase observed recently. The current introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become a critical software component to increase efficiency and close the gap to peak performance. In this paper, we investigate the impact of the vectorization of MPI reduction operations. We propose an implementation of predefined MPI reduction operations using vector intrinsics (AVX and SVE) to improve the time-to-solution of the predefined MPI reduction operations. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architectures. Experiments conducted on varied architectures (Intel Xeon Gold, AMD Zen 2, and Arm A64FX), show that the proposed vector extension optimized reduction operations significantly reduce completion time for collective communication reductions. With these optimizations, we achieve higher memory bandwidth and an increased efficiency for local computations, which directly benefit the overall cost of collective reductions and applications based on them. %B Parallel Computing %V 109 %P 102871 %8 2022-03 %G eng %U https://www.sciencedirect.com/science/article/pii/S0167819121001137 %! Parallel Computing %R 10.1016/j.parco.2021.102871