%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2022
%T Accelerating Geostatistical Modeling and Prediction With Mixed-Precision Computations: A High-Productivity Approach With PaRSEC
%A Abdulah, Sameh
%A Cao, Qinglei
%A Pei, Yu
%A Bosilca, George
%A Dongarra, Jack
%A Genton, Marc G.
%A Keyes, David E.
%A Ltaief, Hatem
%A Sun, Ying
%K Computational modeling
%K Covariance matrices
%K Data models
%K Maximum likelihood estimation
%K Predictive models
%K runtime
%K Task analysis
%X Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Spatial data are assumed to possess properties of stationarity or non-stationarity via a kernel fitted to a covariance matrix. A primary workhorse of stationary spatial statistics is Gaussian maximum log-likelihood estimation (MLE), whose central data structure is a dense, symmetric positive definite covariance matrix of the dimension of the number of correlated observations. Two essential operations in MLE are the application of the inverse and evaluation of the determinant of the covariance matrix. These can be rendered through the Cholesky decomposition and triangular solution. In this contribution, we reduce the precision of weakly correlated locations to single- or half- precision based on distance. We thus exploit mathematical structure to migrate MLE to a three-precision approximation that takes advantage of contemporary architectures offering BLAS3-like operations in a single instruction that are extremely fast for reduced precision. We illustrate application-expected accuracy worthy of double-precision from a majority half-precision computation, in a context where uniform single-precision is by itself insufficient. In tackling the complexity and imbalance caused by the mixing of three precisions, we deploy the PaRSEC runtime system. PaRSEC delivers on-demand casting of precisions while orchestrating tasks and data movement in a multi-GPU distributed-memory environment within a tile-based Cholesky factorization. Application-expected accuracy is maintained while achieving up to 1.59X by mixing FP64/FP32 operations on 1536 nodes of HAWK or 4096 nodes of Shaheen II , and up to 2.64X by mixing FP64/FP32/FP16 operations on 128 nodes of Summit , relative to FP64-only operations. This translates into up to 4.5, 4.7, ...
%B IEEE Transactions on Parallel and Distributed Systems
%V 33
%P 964 - 976
%8 2022-04
%G eng
%U https://ieeexplore.ieee.org/document/9442267/https://ieeexplore.ieee.org/ielam/71/9575177/9442267-aam.pdfhttp://xplorestaging.ieee.org/ielx7/71/9575177/09442267.pdf?arnumber=9442267
%N 4
%! IEEE Trans. Parallel Distrib. Syst.
%R 10.1109/TPDS.2021.3084071
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2022
%T Evaluating Data Redistribution in PaRSEC
%A Cao, Qinglei
%A Bosilca, George
%A Losada, Nuria
%A Wu, Wei
%A Zhong, Dong
%A Dongarra, Jack
%B IEEE Transactions on Parallel and Distributed Systems
%V 33
%P 1856-1872
%G eng
%R 10.1109/TPDS.2021.3131657
%0 Generic
%D 2021
%T Accelerating FFT towards Exascale Computing
%A Ayala, Alan
%A Tomov, Stanimire
%A Haidar, Azzam
%A Stoyanov, M.
%A Cayrols, Sebastien
%A Li, Jiali
%A Bosilca, George
%A Dongarra, Jack
%I NVIDIA GPU Technology Conference (GTC2021)
%G eng
%0 Journal Article
%J Parallel Computing
%D 2021
%T Callback-based completion notification using MPI Continuations
%A Schuchart, Joseph
%A Samfass, Philipp
%A Niethammer, Christoph
%A Gracia, José
%A Bosilca, George
%K MPI
%K MPI Continuations
%K OmpSs
%K OpenMP
%K parsec
%K TAMPI
%K Task-based programming models
%X Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and nonblocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. We have previously proposed an extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations. This interface is flexible enough to accommodate a wide range of different APMs. In this paper, we present an extension to the previously described interface that allows for finer control of the behavior of the MPI Continuations interface. We then present some of our first experiences in using the interface in the context of different applications, including the NAS parallel benchmarks, the PaRSEC task-based runtime system, and a load-balancing scheme within an adaptive mesh refinement solver called ExaHyPE. We show that the interface, implemented inside Open MPI, enables low-latency, high-throughput completion notifications that outperform solutions implemented in the application space.
%B Parallel Computing
%V 21238566
%P 102793
%8 Jan-05-2021
%G eng
%U https://www.sciencedirect.com/science/article/abs/pii/S0167819121000466?via%3Dihub
%N 0225
%! Parallel Computing
%R 10.1016/j.parco.2021.102793
%0 Generic
%D 2021
%T DTE: PaRSEC Enabled Libraries and Applications
%A Bosilca, George
%A Herault, Thomas
%A Dongarra, Jack
%I 2021 Exascale Computing Project Annual Meeting
%8 2021-04
%G eng
%0 Conference Paper
%B EuroMPI'21
%D 2021
%T Quo Vadis MPI RMA? Towards a More Efficient Use of MPI One-Sided Communication
%A Schuchart, Joseph
%A Niethammer, Christoph
%A Gracia, José
%A Bosilca, George
%K Memory Handles
%K MPI
%K MPI-RMA
%K RDMA
%X The MPI standard has long included one-sided communication abstractions through the MPI Remote Memory Access (RMA) interface. Unfortunately, the MPI RMA chapter in the 4.0 version of the MPI standard still contains both well-known and lesser known short-comings for both implementations and users, which lead to potentially non-optimal usage patterns. In this paper, we identify a set of issues and propose ways for applications to better express anticipated usage of RMA routines, allowing the MPI implementation to better adapt to the application's needs. In order to increase the flexibility of the RMA interface, we add the capability to duplicate windows, allowing access to the same resources encapsulated by a window using different configurations. In the same vein, we introduce the concept of MPI memory handles, meant to provide life-time guarantees on memory attached to dynamic windows, removing the overhead currently present in using dynamically exposed memory. We will show that our extensions provide improved accumulate latencies, reduced overheads for multi-threaded flushes, and allow for zero overhead dynamic memory window usage.
%B EuroMPI'21
%C Garching, Munich Germany
%G eng
%U https://arxiv.org/abs/2111.08142
%0 Conference Proceedings
%B 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2021
%T Revisiting Credit Distribution Algorithms for Distributed Termination Detection
%A Bosilca, George
%A Bouteiller, Aurélien
%A Herault, Thomas
%A Le Fèvre, Valentin
%A Robert, Yves
%A Dongarra, Jack
%K control messages
%K credit distribution algorithms
%K task-based HPC application
%K Termination detection
%X This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages.
%B 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%P 611–620
%G eng
%R 10.1109/IPDPSW52791.2021.00095