Distributed Termination Detection for HPC Task-Based Environments

TitleDistributed Termination Detection for HPC Task-Based Environments
Publication TypeTech Report
Year of Publication2019
AuthorsBosilca, G., A. Bouteiller, T. Herault, V. Le Fèvre, Y. Robert, and J. Dongarra
Technical Report Series TitleInnovative Computing Laboratory Technical Report
NumberICL-UT-18-14
Date Published06-2018
InstitutionUniversity of Tennessee
Abstract

This paper revisits distributed termination detection algorithms in the context of high-performance computing applications in task systems. We first outline the need to efficiently detect termination in workflows for which the total number of tasks is data dependent and therefore not known statically but only revealed dynamically during execution. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). On the theoretical side, we analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. On the practical side, we provide a highly tuned implementation of each termination detection algorithm within PaRSEC and compare their performance for a variety of benchmarks, extracted from scientific applications that exhibit dynamic behaviors.

Project Tags: 
External Publication Flag: