%0 Generic %D 2020 %T Roadmap for Refactoring Classic PAPI to PAPI++: Part II: Formulation of Roadmap Based on Survey Results %A Heike Jagode %A Anthony Danalis %A Damien Genet %B PAPI++ Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2020-07 %G eng %0 Generic %D 2018 %T Tensor Contraction on Distributed Hybrid Architectures using a Task-Based Runtime System %A George Bosilca %A Damien Genet %A Robert Harrison %A Thomas Herault %A Mohammad Mahdi Javanmard %A Chong Peng %A Edward Valeev %X The needs for predictive simulation of electronic structure in chemistry and materials science calls for fast/reduced-scaling formulations of quantum n-body methods that replace the traditional dense tensors with element-, block-, rank-, and block-rank-sparse (data-sparse) tensors. The resulting, highly irregular data structures are a poor match to imperative, bulk-synchronous parallel programming style due to the dynamic nature of the problem and to the lack of clear domain decomposition to guarantee a fair load-balance. TESSE runtime and the associated programming model aim to support performance-portable composition of applications involving irregular and dynamically changing data. In this paper we report an implementation of irregular dense tensor contraction in a paradigmatic electronic structure application based on the TESSE extension of PaRSEC, a distributed hybrid task runtime system, and analyze the resulting performance on a distributed memory cluster of multi-GPU nodes. Unprecedented strong scaling and promising efficiency indicate a viable future for task-based programming of complete production-quality reduced scaling models of electronic structure. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-12 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2017 %T Argobots: A Lightweight Low-Level Threading and Tasking Framework %A Sangmin Seo %A Abdelhalim Amer %A Pavan Balaji %A Cyril Bordage %A George Bosilca %A Alex Brooks %A Philip Carns %A Adrian Castello %A Damien Genet %A Thomas Herault %A Shintaro Iwasaki %A Prateek Jindal %A Sanjay Kale %A Sriram Krishnamoorthy %A Jonathan Lifflander %A Huiwei Lu %A Esteban Meneses %A Mar Snir %A Yanhua Sun %A Kenjiro Taura %A Pete Beckman %K Argobots %K context switch %K I/O %K interoperability %K lightweight %K MPI %K OpenMP %K stackable scheduler %K tasklet %K user-level thread %X In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, are either too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing a rich set of controls to allow specialization by the user or high-level programming model. We describe the design, implementation, and optimization of Argobots and present integrations with three example high-level models: OpenMP, MPI, and co-located I/O service. Evaluations show that (1) Argobots outperforms existing generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency hiding capabilities; and (4) I/O service with Argobots reduces interference with co-located applications, achieving performance competitive with that of the Pthreads version. %B IEEE Transactions on Parallel and Distributed Systems %8 2017-10 %G eng %U http://ieeexplore.ieee.org/document/8082139/ %R 10.1109/TPDS.2017.2766062 %0 Conference Paper %B Euro-Par 2014 %D 2014 %T Assembly Operations for Multicore Architectures using Task-Based Runtime Systems %A Damien Genet %A Abdou Guermouche %A George Bosilca %X Traditionally, numerical simulations based on finite element methods consider the algorithm as being divided in three major steps: the generation of a set of blocks and vectors, the assembly of these blocks in a matrix and a big vector, and the inversion of the matrix. In this paper we tackle the second step, the block assembly, where no parallel algorithm is widely available. Several strategies are proposed to decompose the assembly problem while relying on a scheduling middle-ware to maximize the overlap between stages and increase the parallelism and thus the performance. These strategies are quantified using examples covering two extremes in the field, large number of non-overlapping small blocks for CFD-like problems, and a smaller number of larger blocks with significant overlap which can be met in sparse linear algebra solvers. %B Euro-Par 2014 %I Springer International Publishing %C Porto, Portugal %8 2014-08 %G eng