%0 Conference Proceedings %B 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2023 %T Memory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements %A Daniel Barry %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %K GPU power %K High Performance Computing %K network traffic %K papi %K performance analysis %K Performance Counters %X Some of the most important categories of performance events count the data traffic between the processing cores and the main memory. However, since these counters are not core-private, applications require elevated privileges to access them. PAPI offers a component that can access this information on IBM systems through the Performance Co-Pilot (PCP); however, doing so adds an indirection layer that involves querying the PCP daemon. This paper performs a quantitative study of the accuracy of the measurements obtained through this component on the Summit supercomputer. We use two linear algebra kernels---a generalized matrix multiply, and a modified matrix-vector multiply---as benchmarks and a distributed, GPU-accelerated 3D-FFT mini-app (using cuFFT) to compare the measurements obtained through the PAPI PCP component against the expected values across different problem sizes. We also compare our measurements against an in-house machine with a very similar architecture to Summit, where elevated privileges allow PAPI to access the hardware counters directly (without using PCP) to show that measurements taken via PCP are as accurate as the those taken directly. Finally, using both QMCPACK and the 3D-FFT, we demonstrate the diverse hardware activities that can be monitored simultaneously via PAPI hardware components. %B 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %C St. Petersburg, Florida %8 2023-08 %G eng %U https://ieeexplore.ieee.org/document/10196656 %R 10.1109/IPDPSW59300.2023.00070 %0 Generic %D 2023 %T Memory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements %A Daniel Barry %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %I 28th HIPS Workshop %C St. Petersburg, FL %8 2023-05 %G eng %0 Book Section %B Accelerated Computing with HIP %D 2022 %T Performance Application Programming Interface %A Anthony Danalis %A Heike Jagode %B Accelerated Computing with HIP %I Sun, Baruah and Kaeli %8 2022-12 %@ B0BR8KSS7K %G eng %U https://a.co/d/0DoG5as %0 Book Section %B Tools for High Performance Computing 2018/2019 %D 2021 %T Effortless Monitoring of Arithmetic Intensity with PAPI’s Counter Analysis Toolkit %A Daniel Barry %A Danalis, Anthony %A Heike Jagode %X With exascale computing forthcoming, performance metrics such as memory traffic and arithmetic intensity are increasingly important for codes that heavily utilize numerical kernels. Performance metrics in different CPU architectures can be monitored by reading the occurrences of various hardware events. However, from architecture to architecture, it becomes more and more unclear which native performance events are indexed by which event names, making it difficult for users to understand what specific events actually measure. This ambiguity seems particularly true for events related to hardware that resides beyond the compute core, such as events related to memory traffic. Still, traffic to memory is a necessary characteristic for determining arithmetic intensity. To alleviate this difficulty, PAPI’s Counter Analysis Toolkit measures the occurrences of events through a series of benchmarks, allowing its users to discover the high-level meaning of native events. We (i) leverage the capabilities of the Counter Analysis Toolkit to identify the names of hardware events for reading and writing bandwidth utilization in addition to floating-point operations, (ii) measure the occurrences of the events they index during the execution of important numerical kernels, and (iii) verify their identities by comparing these occurrence patterns to the expected arithmetic intensity of the numerical kernels. %B Tools for High Performance Computing 2018/2019 %I Springer %P 195–218 %@ 978-3-030-66057-4 %G eng %R 10.1007/978-3-030-66057-4_11 %0 Book Section %B Rare Earth Elements and Actinides: Progress in Computational Science Applications %D 2021 %T An Introduction to High Performance Computing and Its Intersection with Advances in Modeling Rare Earth Elements and Actinides %A Deborah A. Penchoff %A Edward Valeev %A Heike Jagode %A Piotr Luszczek %A Anthony Danalis %A George Bosilca %A Robert J. Harrison %A Jack Dongarra %A Theresa L. Windus %K actinide %K Computational modeling %K HPC %K REE %X Computationally driven solutions in nuclear and radiochemistry heavily depend on efficient modeling of Rare Earth Elements (REEs) and actinides. Accurate modeling of REEs and actinides faces challenges stemming from limitations from an imbalanced hardware-software ecosystem and its implications on inefficient use of High Performance Computing (HPC). This chapter provides a historical perspective on the evolution of HPC hardware, its intersectionality with domain sciences, the importance of benchmarks for performance, and an overview of challenges and advances in modeling REEs and actinides. This chapter intends to provide an introduction for researchers at the intersection of scientific computing, software development for HPC, and applied computational modeling of REEs and actinides. The chapter is structured in five sections. First, the Introduction includes subsections focusing on the Importance of REEs and Actinides (1.1), Hardware, Software, and the HPC Ecosystem (1.2), and Electronic Structure Modeling of REEs and Actinides (1.3). Second, a section in High Performance Computing focuses on the TOP500 (2.1), HPC Performance (2.2), HPC Benchmarks: Processing, Bandwidth, and Latency (2.3), and HPC Benchmarks and their Relationship to Chemical Modeling (2.4). Third, the Software Challenges and Advances focus on NWChem/NWChemEx (3.1), MADNESS (3.2), and MPQC (3.3). The fourth section provides a short overview of Artificial Intelligence in HPC applications relevant to nuclear and radiochemistry. The fifth section illustrates A Protocol to Evaluate Complexation Preferences in Separations of REEs and Actinides through Computational Modeling. %B Rare Earth Elements and Actinides: Progress in Computational Science Applications %I American Chemical Society %C Washington, DC %V 1388 %P 3-53 %8 2021-10 %@ ISBN13: 9780841298255 eISBN: 9780841298248 %G eng %U https://pubs.acs.org/doi/10.1021/bk-2021-1388.ch001 %& 1 %R 10.1021/bk-2021-1388.ch001 %0 Book %D 2021 %T Lecture Notes in Computer Science: High Performance Computing %A Heike Jagode %A Anzt, Hartwig %A Ltaief, Hatem %A Piotr Luszczek %X This book constitutes the refereed post-conference proceedings of 9 workshops held at the 35th International ISC High Performance 2021 Conference, in Frankfurt, Germany, in June-July 2021: Second International Workshop on the Application of Machine Learning Techniques to Computational Fluid Dynamics and Solid Mechanics Simulations and Analysis; HPC-IODC: HPC I/O in the Data Center Workshop; Compiler-assisted Correctness Checking and Performance Optimization for HPC; Machine Learning on HPC Systems; 4th International Workshop on Interoperability of Supercomputing and Cloud Technologies; 2nd International Workshop on Monitoring and Operational Data Analytics; 16th Workshop on Virtualization in High­-Performance Cloud Computing; Deep Learning on Supercomputers; 5th International Workshop on In Situ Visualization. The 35 papers included in this volume were carefully reviewed and selected. They cover all aspects of research, development, and application of large-scale, high performance experimental and commercial systems. Topics include high-performance computing (HPC), computer architecture and hardware, programming models, system software, performance analysis and modeling, compiler analysis and optimization techniques, software sustainability, scientific applications, deep learning. %I Springer International Publishing %V 12761 %@ 978-3-030-90538-5 %G eng %R 10.1007/978-3-030-90539-2 %0 Conference Paper %B 13th International Workshop on Parallel Tools for High Performance Computing %D 2020 %T Effortless Monitoring of Arithmetic Intensity with PAPI's Counter Analysis Toolkit %A Daniel Barry %A Anthony Danalis %A Heike Jagode %X With exascale computing forthcoming, performance metrics such as memory traffic and arithmetic intensity are increasingly important for codes that heavily utilize numerical kernels. Performance metrics in different CPU architectures can be monitored by reading the occurrences of various hardware events. However, from architecture to architecture, it becomes more and more unclear which native performance events are indexed by which event names, making it difficult for users to understand what specific events actually measure. This ambiguity seems particularly true for events related to hardware that resides beyond the compute core, such as events related to memory traffic. Still, traffic to memory is a necessary characteristic for determining arithmetic intensity. To alleviate this difficulty, PAPI’s Counter Analysis Toolkit measures the occurrences of events through a series of benchmarks, allowing its users to discover the high-level meaning of native events. We (i) leverage the capabilities of the Counter Analysis Toolkit to identify the names of hardware events for reading and writing bandwidth utilization in addition to floating-point operations, (ii) measure the occurrences of the events they index during the execution of important numerical kernels, and (iii) verify their identities by comparing these occurrence patterns to the expected arithmetic intensity of the numerical kernels. %B 13th International Workshop on Parallel Tools for High Performance Computing %I Springer International Publishing %C Dresden, Germany %8 2020-09 %G eng %0 Generic %D 2020 %T Exa-PAPI: The Exascale Performance API with Modern C++ %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %I 2020 Exascale Computing Project Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Generic %D 2020 %T Formulation of Requirements for New PAPI++ Software Package: Part I: Survey Results %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %B PAPI++ Working Notes %I Innovative Computing Laboratory, University of Tennessee Knoxville %8 2020-01 %G eng %0 Generic %D 2020 %T Performance Application Programming Interface for Extreme-Scale Environments (PAPI-EX) (Poster) %A Jack Dongarra %A Heike Jagode %A Anthony Danalis %A Daniel Barry %A Vince Weaver %I 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting %C Seattle, WA %8 2020-20 %G eng %0 Generic %D 2020 %T PULSE: PAPI Unifying Layer for Software-Defined Events (Poster) %A Heike Jagode %A Anthony Danalis %I 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting %C Seattle, WA %8 2020-02 %G eng %0 Generic %D 2020 %T Roadmap for Refactoring Classic PAPI to PAPI++: Part II: Formulation of Roadmap Based on Survey Results %A Heike Jagode %A Anthony Danalis %A Damien Genet %B PAPI++ Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2020-07 %G eng %0 Conference Paper %B 2019 International Conference on Parallel Computing (ParCo2019) %D 2019 %T Characterization of Power Usage and Performance in Data-Intensive Applications using MapReduce over MPI %A Joshua Davis %A Tao Gao %A Sunita Chandrasekaran %A Heike Jagode %A Anthony Danalis %A Pavan Balaji %A Jack Dongarra %A Michela Taufer %B 2019 International Conference on Parallel Computing (ParCo2019) %C Prague, Czech Republic %8 2019-09 %G eng %0 Conference Paper %B 11th International Workshop on Parallel Tools for High Performance Computing %D 2019 %T Counter Inspection Toolkit: Making Sense out of Hardware Performance Events %A Anthony Danalis %A Heike Jagode %A H Hanumantharayappa %A Sangamesh Ragate %A Jack Dongarra %X Hardware counters play an essential role in understanding the behavior of performance-critical applications, and inform any effort to identify opportunities for performance optimization. However, because modern hardware is becoming increasingly complex, the number of counters that are offered by the vendors increases and, in some cases, so does their complexity. In this paper we present a toolkit that aims to assist application developers invested in performance analysis by automatically categorizing and disambiguating performance counters. We present and discuss the set of microbenchmarks and analyses that we developed as part of our toolkit. We explain why they work and discuss the non-obvious reasons why some of our early benchmarks and analyses did not work in an effort to share with the rest of the community the wisdom we acquired from negative results. %B 11th International Workshop on Parallel Tools for High Performance Computing %I Cham, Switzerland: Springer %C Dresden, Germany %8 2019-02 %G eng %R https://doi.org/10.1007/978-3-030-11987-4_2 %0 Generic %D 2019 %T Does your tool support PAPI SDEs yet? %A Anthony Danalis %A Heike Jagode %A Jack Dongarra %I 13th Scalable Tools Workshop %C Tahoe City, CA %8 2019-07 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2019 %T PAPI Software-Defined Events for in-Depth Performance Analysis %A Heike Jagode %A Anthony Danalis %A Hartwig Anzt %A Jack Dongarra %X The methodology and standardization layer provided by the Performance Application Programming Interface (PAPI) has played a vital role in application profiling for almost two decades. It has enabled sophisticated performance analysis tool designers and performance-conscious scientists to gain insights into their applications by simply instrumenting their code using a handful of PAPI functions that “just work” across different hardware components. In the past, PAPI development had focused primarily on hardware-specific performance metrics. However, the rapidly increasing complexity of software infrastructure poses new measurement and analysis challenges for the developers of large-scale applications. In particular, acquiring information regarding the behavior of libraries and runtimes—used by scientific applications—requires low-level binary instrumentation, or APIs specific to each library and runtime. No uniform API for monitoring events that originate from inside the software stack has emerged. In this article, we present our efforts to extend PAPI’s role so that it becomes the de facto standard for exposing performance-critical events, which we refer to as software-defined events (SDEs), from different software layers. Upgrading PAPI with SDEs enables monitoring of both types of performance events—hardware- and software-related events—in a uniform way, through the same consistent PAPI. The goal of this article is threefold. First, we motivate the need for SDEs and describe our design decisions regarding the functionality we offer through PAPI’s new SDE interface. Second, we illustrate how SDEs can be utilized by different software packages, specifically, by showcasing their use in the numerical linear algebra library MAGMA-Sparse, the tensor algebra library TAMM that is part of the NWChem suite, and the compiler-based performance analysis tool Byfl. Third, we provide a performance analysis of the overhead that results from monitoring SDEs and discuss the trade-offs between overhead and functionality. %B The International Journal of High Performance Computing Applications %V 33 %P 1113-1127 %8 2019-11 %G eng %U https://doi.org/10.1177/1094342019846287 %N 6 %0 Generic %D 2019 %T PAPI's new Software-Defined Events for in-depth Performance Analysis %A Anthony Danalis %A Heike Jagode %A Jack Dongarra %X One of the most recent developments of the Performance API (PAPI) is the addition of Software-Defined Events (SDE). PAPI has successfully served the role of the abstraction and unification layer for hardware performance counters for the past two decades. This talk presents our effort to extend this role to encompass performance critical information that does not originate in hardware, but rather in critical software layers, such as libraries and runtime systems. Our overall objective is to enable monitoring of both types of performance events, hardware- and software-related events, in a uniform way, through one consistent PAPI interface. Performance analysts will be able to form a complete picture of the entire application performance without learning new instrumentation primitives. In this talk, we outline PAPI's new SDE API and showcase the usefulness of SDE through its employment in software layers as diverse as the math library MAGMA, the dataflow runtime PaRSEC, and the state-of-the-art chemistry application NWChem. We outline the process of instrumenting these software packages and highlight the performance information that can be acquired with SDEs. %I 13th Parallel Tools Workshop %C Dresden, Germany %8 2019-09 %G eng %0 Conference Paper %B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2019 %T Software-Defined Events through PAPI %A Anthony Danalis %A Heike Jagode %A Thomas Herault %A Piotr Luszczek %A Jack Dongarra %X PAPI has been used for almost two decades as an abstraction and standardization layer for profiling hardware-specific performance metrics. However, application developers-and profiling software packages-are quite often interested in information beyond hardware counters, such as the behavior of libraries used by the software that is being profiled. So far, accessing this information has required interfacing directly with the libraries on a case-by-case basis, or low-level binary instrumentation. In this paper, we introduce the new Software-Defined Event (SDE) component of PAPI which aims to enable PAPI to serve as an abstraction and standardization layer for events that originate in software layers as well. Extending PAPI to include SDEs enables monitoring of both types of performance events-hardware-and software-related events-in a uniform way, through the same consistent PAPI interface. Furthermore, implementing SDE as a PAPI component means that the new API is aimed only at the library developers who wish to export events from within their libraries. The API for reading PAPI events-both hardware and software-remains the same, so all legacy codes and tools that use PAPI will not only continue to work, but they will automatically be able to read SDEs wherever those are available. The goal of this paper is threefold. First, we outline our design decisions regarding the functionality we offer through the new SDE interface, and offer simple examples of usage. Second, we illustrate how those events can be utilized by different software packages, specifically, by showcasing their use in the task-based runtime PaRSEC, and the HPCG supercomputing benchmark. Third, we provide a thorough performance analysis of the overhead that results from monitoring different types of SDEs, and showcase the negligible overhead of using PAPI SDE even in cases of extremely heavy use. %B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %C Rio de Janeiro, Brazil %8 2019-05 %G eng %R https://doi.org/10.1109/IPDPSW.2019.00069 %0 Generic %D 2019 %T Understanding Native Event Semantics %A Anthony Danalis %A Heike Jagode %A Daniel Barry %A Jack Dongarra %I 9th JLESC Workshop %C Knoxville, TN %8 2019-04 %G eng %0 Conference Paper %B 1st Workshop on Sustainable Scientific Software (CW3S19) %D 2019 %T What it Takes to keep PAPI Instrumental for the HPC Community %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %B 1st Workshop on Sustainable Scientific Software (CW3S19) %C Collegeville, Minnesota %8 2019-07 %G eng %U https://collegeville.github.io/CW3S19/WorkshopResources/WhitePapers/JagodeHeike_CW3S19_papi.pdf %0 Generic %D 2019 %T What it Takes to keep PAPI Instrumental for the HPC Community %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %I The 2019 Collegeville Workshop on Sustainable Scientific Software (CW3S19) %C Collegeville, MN %8 2019-07 %G eng %0 Generic %D 2019 %T Is your scheduling good? How would you know? %A Anthony Danalis %A Heike Jagode %A Jack Dongarra %X Optimal scheduling is a goal that can rarely be achieved, even in purely theoretical contexts where the nuanced behavior of complex hardware and software systems can be abstracted away, and simplified assumptions can be made. In real runtime systems, task schedulers are usually designed based on intuitions about optimal design and heuristics such as minimizing idle time and load imbalance, as well as maximizing data locality and reuse. This harsh reality is due in part to the very crude tools designers of task scheduling systems have at their disposal for assessing the quality of their assumptions. Examining hardware behavior—such as cache reuse—through counters rarely leads to improvement in scheduler design, and quite often the runtime designers are left with total execution time as their only guiding mechanism. In this talk we will discuss new methods for illuminating the dark corners of task scheduling on real hardware. We will present our work on extending PAPI—which has long been the de facto standard for accessing hardware events—so that it can be used to access software events. We will focus specifically on the impact this work can have on runtime systems with dynamic schedulers, and discuss illustrative examples. %I 14th Scheduling for Large Scale Systems Workshop %C Bordeaux, France %8 2019-06 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2018 %T Accelerating NWChem Coupled Cluster through dataflow-based Execution %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %K CCSD %K dag %K dataflow %K NWChem %K parsec %K ptg %K tasks %X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWCHEM, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelized in coarse chunks. In this paper, we present our effort of converting the NWCHEM’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PARSEC (Parallel Runtime Scheduling and Execution Controller): a software package designed to enable high-performance computing at scale. We discuss the modularity of our approach and explain how the PARSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWCHEM codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer-grained tasks (compared with the original version of NWCHEM); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWCHEM, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation. %B The International Journal of High Performance Computing Applications %V 32 %P 540--551 %8 2018-07 %G eng %U http://journals.sagepub.com/doi/10.1177/1094342016672543 %N 4 %9 Journal Article %& 540 %R 10.1177/1094342016672543 %0 Journal Article %J Concurrency and Computation: Practice and Experience: Special Issue on Parallel and Distributed Algorithms %D 2018 %T Evaluation of Dataflow Programming Models for Electronic Structure Theory %A Heike Jagode %A Anthony Danalis %A Reazul Hoque %A Mathieu Faverge %A Jack Dongarra %K CCSD %K coupled cluster methods %K dataflow %K NWChem %K OpenMP %K parsec %K StarPU %K task-based runtime %X Dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. In this paper, we evaluate different dataflow programming models for electronic structure methods and compare them in terms of programmability, resource utilization, and scalability. In particular, we evaluate two programming paradigms for expressing scientific applications in a dataflow form: (1) explicit dataflow, where the dataflow is specified explicitly by the developer, and (2) implicit dataflow, where a task scheduling runtime derives the dataflow using per-task data-access information embedded in a serial program. We discuss our findings and present a thorough experimental analysis using methods from the NWChem quantum chemistry application as our case study, and OpenMP, StarPU, and PaRSEC as the task-based runtimes that enable the different forms of dataflow execution. Furthermore, we derive an abstract model to explore the limits of the different dataflow programming paradigms. %B Concurrency and Computation: Practice and Experience: Special Issue on Parallel and Distributed Algorithms %V 2018 %P 1–20 %8 2018-05 %G eng %N e4490 %R https://doi.org/10.1002/cpe.4490 %0 Journal Article %J Concurrency Computation: Practice and Experience %D 2018 %T Investigating Power Capping toward Energy-Efficient Scientific Applications %A Azzam Haidar %A Heike Jagode %A Phil Vaccaro %A Asim YarKhan %A Stanimire Tomov %A Jack Dongarra %K energy efficiency %K High Performance Computing %K Intel Xeon Phi %K Knights landing %K papi %K performance analysis %K Performance Counters %K power efficiency %X The emergence of power efficiency as a primary constraint in processor and system design poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers, which may house petascale or exascale-level computing systems. At these extreme scales, understanding and improving the energy efficiency of numerical libraries and their related applications becomes a crucial part of the successful implementation and operation of the computing system. In this paper, we study and investigate the practice of controlling a compute system's power usage, and we explore how different power caps affect the performance of numerical algorithms with different computational intensities. Further, we determine the impact, in terms of performance and energy usage, that these caps have on a system running scientific applications. This analysis will enable us to characterize the types of algorithms that benefit most from these power management schemes. Our experiments are performed using a set of representative kernels and several popular scientific benchmarks. We quantify a number of power and performance measurements and draw observations and conclusions that can be viewed as a roadmap to achieving energy efficiency in the design and execution of scientific algorithms. %B Concurrency Computation: Practice and Experience %V 2018 %P 1-14 %8 2018-04 %G eng %U https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4485 %N e4485 %R https://doi.org/10.1002/cpe.4485 %0 Generic %D 2018 %T PAPI: Counting outside the Box %A Anthony Danalis %A Heike Jagode %A Jack Dongarra %I 8th JLESC Meeting %C Barcelona, Spain %8 2018-04 %G eng %0 Generic %D 2018 %T PAPI's New Software-Defined Events for In-Depth Performance Analysis %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %X One of the most recent developments of the Performance API (PAPI) is the addition of Software-Defined Events (SDE). PAPI has successfully served the role of the abstraction and unification layer for hardware performance counters for over a decade. This talk presents our effort to extend this role to encompass performance critical information that does not originate in hardware, but rather in critical software layers, such as libraries and runtime systems. Our overall objective is to enable monitoring of both types of performance events, hardware- and software-related events, in a uniform way, through one consistent PAPI interface. Performance analysts will be able to form a complete picture of the entire application performance without learning new instrumentation primitives. In this talk, we outline PAPI's new SDE API and showcase the usefulness of SDE through its employment in software layers as diverse as the math library MAGMA, the dataflow runtime PaRSEC, and the state-of-the-art chemistry application NWChem. We outline the process of instrumenting these software packages and highlight the performance information that can be acquired with SDEs. %I CCDSC 2018: Workshop on Clusters, Clouds, and Data for Scientific Computing %C Lyon, France %8 2018-09 %G eng %0 Generic %D 2018 %T Software-Defined Events (SDEs) in MAGMA-Sparse %A Heike Jagode %A Anthony Danalis %A Hartwig Anzt %A Ichitaro Yamazaki %A Mark Hoemmen %A Erik Boman %A Stanimire Tomov %A Jack Dongarra %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-12 %G eng %0 Generic %D 2018 %T Software-Defined Events through PAPI for In-Depth Analysis of Application Performance %A Anthony Danalis %A Heike Jagode %A Jack Dongarra %I 5th Platform for Advanced Scientific Computing Conference (PASC18) %C Basel, Switzerland %8 2018-07 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2017 %T Accelerating NWChem Coupled Cluster through Dataflow-Based Execution %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %K CCSD %K dag %K dataflow %K NWChem %K parsec %K ptg %K tasks %X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWChem, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelized in coarse chunks. In this paper, we present our effort of converting the NWChem’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PaRSEC (Parallel Runtime Scheduling and Execution Controller): a software package designed to enable high-performance computing at scale. We discuss the modularity of our approach and explain how the PaRSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWChem codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer-grained tasks (compared with the original version of NWChem); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWChem, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation. %B The International Journal of High Performance Computing Applications %P 1–13 %8 2017-01 %G eng %U http://journals.sagepub.com/doi/10.1177/1094342016672543 %R 10.1177/1094342016672543 %0 Generic %D 2017 %T Dataflow Programming Paradigms for Computational Chemistry Methods %A Heike Jagode %X The transition to multicore and heterogeneous architectures has shaped the High Performance Computing (HPC) landscape over the past decades. With the increase in scale, complexity, and heterogeneity of modern HPC platforms, one of the grim challenges for traditional programming models is to sustain the expected performance at scale. By contrast, dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. This work introduces dataflow programming models for computational chemistry methods, and compares different dataflow executions in terms of programmability, resource utilization, and scalability. This effort is driven by computational chemistry applications, considering that they comprise one of the driving forces of HPC. In particular, many-body methods, such as Coupled Cluster methods (CC), which are the "gold standard" to compute energies in quantum chemistry, are of particular interest for the applied chemistry community. On that account, the latest development for CC methods is used as the primary vehicle for this research, but our effort is not limited to CC and can be applied across other application domains. Two programming paradigms for expressing CC methods into a dataflow form, in order to make them capable of utilizing task scheduling systems, are presented. Explicit dataflow, is the programming model where the dataflow is explicitly specified by the developer, is contrasted with implicit dataflow, where a task scheduling runtime derives the dataflow. An abstract model is derived to explore the limits of the different dataflow programming paradigms. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %C Knoxville, TN %8 2017-05 %U http://trace.tennessee.edu/utk_graddiss/4469/ %9 PhD Dissertation (Computer Science) %0 Book Section %B Exascale Scientific Applications: Scalability and Performance Portability %D 2017 %T Performance Analysis and Debugging Tools at Scale %A Scott Parker %A John Mellor-Crummey %A Dong H. Ahn %A Heike Jagode %A Holger Brunst %A Sameer Shende %A Allen D. Malony %A David DelSignore %A Ronny Tschuter %A Ralph Castain %A Kevin Harms %A Philip Carns %A Ray Loy %A Kalyan Kumaran %X This chapter explores present-day challenges and those likely to arise as new hardware and software technologies are introduced on the path to exascale. It covers some of the underlying hardware, software, and techniques that enable tools and debuggers. Performance tools and debuggers are critical components that enable computational scientists to fully exploit the computing power of While high-performance computing systems. Instrumentation is the insertion of code to perform measurement in a program. It is vital step in performance analysis, especially for parallel programs. The essence of a debugging tool is enabling observation, exploration, and control of program state, such that a developer can, for example, verify that what is currently occurring correlates to what is intended. The increased complexity and volume of performance and debugging data likely to be seen on exascale systems risks overwhelming tool users. Tools and debuggers may need to develop advanced techniques such as automated filtering and analysis to reduce the complexity seen by the user. %B Exascale Scientific Applications: Scalability and Performance Portability %I Chapman & Hall / CRC Press %P 17-50 %8 2017-11 %@ 9781315277400 %G eng %& 2 %R https://doi.org/10.1201/b21930 %0 Conference Paper %B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist %D 2017 %T Power-aware Computing: Measurement, Control, and Performance Analysis for Intel Xeon Phi %A Azzam Haidar %A Heike Jagode %A Asim YarKhan %A Phil Vaccaro %A Stanimire Tomov %A Jack Dongarra %X The emergence of power efficiency as a primary constraint in processor and system designs poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers in particular for peta- and exa- scale systems. Understanding and improving the energy efficiency of numerical simulation becomes very crucial. We present a detailed study and investigation toward control- ling power usage and exploring how different power caps affect the performance of numerical algorithms with different computa- tional intensities, and determine the impact and correlation with performance of scientific applications. Our analyses is performed using a set of representatives kernels, as well as many highly used scientific benchmarks. We quantify a number of power and performance measurements, and draw observations and conclusions that can be viewed as a roadmap toward achieving energy efficiency computing algorithms. %B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist %I IEEE %C Waltham, MA %8 2017-09 %G eng %R https://doi.org/10.1109/HPEC.2017.8091085 %0 Generic %D 2017 %T Power-Aware HPC on Intel Xeon Phi KNL Processors %A Azzam Haidar %A Heike Jagode %A Asim YarKhan %A Phil Vaccaro %A Stanimire Tomov %A Jack Dongarra %I ISC High Performance (ISC17), Intel Booth Presentation %C Frankfurt, Germany %8 2017-06 %G eng %0 Conference Proceedings %B Tools for High Performance Computing 2015: Proceedings of the 9th International Workshop on Parallel Tools for High Performance Computing, September 2015, Dresden, Germany %D 2016 %T Power Management and Event Verification in PAPI %A Heike Jagode %A Asim YarKhan %A Anthony Danalis %A Jack Dongarra %X For more than a decade, the PAPI performance monitoring library has helped to implement the familiar maxim attributed to Lord Kelvin: “If you cannot measure it, you cannot improve it.” Widely deployed and widely used, PAPI provides a generic, portable interface for the hardware performance counters available on all modern CPUs and some other components of interest that are scattered across the chip and system. Recent and radical changes in processor and system design—systems that combine multicore CPUs and accelerators, shared and distributed memory, PCI- express and other interconnects—as well as the emergence of power efficiency as a primary design constraint, and reduced data movement as a primary programming goal, pose new challenges and bring new opportunities to PAPI. We discuss new developments of PAPI that allow for multiple sources of performance data to be measured simultaneously via a common software interface. Specifically, a new PAPI component that controls power is discussed. We explore the challenges of shared hardware counters that include system-wide measurements in existing multicore architectures. We conclude with an exploration of future directions for the PAPI interface. %B Tools for High Performance Computing 2015: Proceedings of the 9th International Workshop on Parallel Tools for High Performance Computing, September 2015, Dresden, Germany %I Springer International Publishing %C Dresden, Germany %P pp. 41-51 %@ 978-3-319-39589-0 %G eng %R https://doi.org/10.1007/978-3-319-39589-0_4 %0 Conference Paper %B 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) %D 2015 %T Accelerating NWChem Coupled Cluster through dataflow-based Execution %A Heike Jagode %A Anthony Danalis %A George Bosilca %A Jack Dongarra %K CCSD %K dag %K dataflow %K NWChem %K parsec %K ptg %K tasks %X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWChem, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelised in coarse chunks. In this paper, we present our effort of converting the NWChem’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PaRSEC (Parallel Runtime Scheduling and Execution Controller) – a software package designed to enable high performance computing at scale. We discuss the modularity of our approach and explain how the PaRSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWChem codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer grained tasks (compared to the original version of NWChem); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWChem, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation. %B 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) %I Springer International Publishing %C Krakow, Poland %8 2015-09 %G eng %0 Conference Paper %B 2015 IEEE International Conference on Cluster Computing %D 2015 %T PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution %A Anthony Danalis %A Heike Jagode %A George Bosilca %A Jack Dongarra %K dag %K parsec %K ptg %K tasks %X Task-based execution has been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. The Parallel Runtime Scheduling and Execution Control (PARSEC) framework is a task-based runtime system that we designed to achieve high performance computing at scale. PARSEC offers a programming paradigm that is different than what has been traditionally used to develop large scale parallel scientific applications. In this paper, we discuss the use of PARSEC to convert a part of the Coupled Cluster (CC) component of the Quantum Chemistry package NWCHEM into a task-based form. We explain how we organized the computation of the CC methods in individual tasks with explicitly defined data dependencies between them and re-integrated the modified code into NWCHEM. We present a thorough performance evaluation and demonstrate that the modified code outperforms the original by more than a factor of two. We also compare the performance of different variants of the modified code and explain the different behaviors that lead to the differences in performance. %B 2015 IEEE International Conference on Cluster Computing %I IEEE %C Chicago, IL %8 2015-09 %G eng %0 Conference Paper %B International Conference on Parallel Processing (ICPP'11) %D 2011 %T Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs %A Allen D. Malony %A Scott Biersdorff %A Sameer Shende %A Heike Jagode %A Stanimire Tomov %A Guido Juckeland %A Robert Dietrich %A Duncan Poole %A Christopher Lamb %K magma %K mumi %K papi %X The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. This paper studies the problems associated with performance measurement of heterogeneous machines with GPUs. A heterogeneous computation model and alternative host-GPU measurement approaches are discussed to set the stage for reporting new capabilities for heterogeneous parallel performance measurement in three leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work leverages the new CUPTI tool support in NVIDIA's CUDA device library. Heterogeneous benchmarks from the SHOC suite are used to demonstrate the measurement methods and tool support. %B International Conference on Parallel Processing (ICPP'11) %I ACM %C Taipei, Taiwan %8 2011-09 %@ 978-0-7695-4510-3 %G eng %R 10.1109/ICPP.2011.71 %0 Generic %D 2011 %T Power-aware Computing on GPGPUs %A Kiran Kasichayanula %A Haihang You %A Shirley Moore %A Stanimire Tomov %A Heike Jagode %A Matt Johnson %I Fall Creek Falls Conference, Poster %C Gatlinburg, TN %8 2011-09 %G eng %0 Journal Article %J Tools for High Performance Computing 2009 %D 2010 %T Collecting Performance Data with PAPI-C %A Dan Terpstra %A Heike Jagode %A Haihang You %A Jack Dongarra %K mumi %K papi %X Modern high performance computer systems continue to increase in size and complexity. Tools to measure application performance in these increasingly complex environments must also increase the richness of their measurements to provide insights into the increasingly intricate ways in which software and hardware interact. PAPI (the Performance API) has provided consistent platform and operating system independent access to CPU hardware performance counters for nearly a decade. Recent trends toward massively parallel multi-core systems with often heterogeneous architectures present new challenges for the measurement of hardware performance information, which is now available not only on the CPU core itself, but scattered across the chip and system. We discuss the evolution of PAPI into Component PAPI, or PAPI-C, in which multiple sources of performance data can be measured simultaneously via a common software interface. Several examples of components and component data measurements are discussed. We explore the challenges to hardware performance measurement in existing multi-core architectures. We conclude with an exploration of future directions for the PAPI interface. %B Tools for High Performance Computing 2009 %I Springer Berlin / Heidelberg %C 3rd Parallel Tools Workshop, Dresden, Germany %P 157-173 %8 2010-05 %G eng %R https://doi.org/10.1007/978-3-642-11261-4_11 %0 Journal Article %J International Journal of High Performance Computing Applications (to appear) %D 2010 %T Trace-based Performance Analysis for the Petascale Simulation Code FLASH %A Heike Jagode %A Andreas Knuepfer %A Jack Dongarra %A Matthias Jurenz %A Matthias S. Mueller %A Wolfgang E. Nagel %B International Journal of High Performance Computing Applications (to appear) %8 2010-00 %G eng %0 Conference Proceedings %B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing %D 2009 %T A Holistic Approach for Performance Measurement and Analysis for Petascale Applications %A Heike Jagode %A Jack Dongarra %A Sadaf Alam %A Jeffrey Vetter %A W. Spear %A Allen D. Malony %E Gabrielle Allen %K point %K test %B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing %I Springer-Verlag Berlin Heidelberg 2009 %C Baton Rouge, Louisiana %V 2009 %P 686-695 %8 2009-05 %G eng %0 Journal Article %J Euro-Par 2009, Lecture Notes in Computer Science %D 2009 %T Impact of Quad-core Cray XT4 System and Software Stack on Scientific Computation %A Sadaf Alam %A Richard F. Barrett %A Heike Jagode %A J. A. Kuehn %A Steve W. Poole %A R. Sankaran %K test %B Euro-Par 2009, Lecture Notes in Computer Science %I Springer Berlin / Heidelberg %C Delft, The Netherlands %V 5704/2009 %P 334-344 %8 2009-08 %G eng %0 Journal Article %J ISC'09 %D 2009 %T I/O Performance Analysis for the Petascale Simulation Code FLASH %A Heike Jagode %A Shirley Moore %A Dan Terpstra %A Jack Dongarra %A Andreas Knuepfer %A Matthias Jurenz %A Matthias S. Mueller %A Wolfgang E. Nagel %K test %B ISC'09 %C Hamburg, Germany %8 2009-06 %G eng %0 Conference Proceedings %B SciDAC 2009, Journal of Physics: Conference Series %D 2009 %T Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team %A Bronis R. de Supinski %A Sadaf Alam %A David Bailey %A Laura Carrington %A Chris Daley %A Anshu Dubey %A Todd Gamblin %A Dan Gunter %A Paul D. Hovland %A Heike Jagode %A Karen Karavanic %A Gabriel Marin %A John Mellor-Crummey %A Shirley Moore %A Boyana Norris %A Leonid Oliker %A Catherine Olschanowsky %A Philip C. Roth %A Martin Schulz %A Sameer Shende %A Allan Snavely %K test %B SciDAC 2009, Journal of Physics: Conference Series %I IOP Publishing %C San Diego, California %V 180(2009)012039 %8 2009-07 %G eng %0 Generic %D 2009 %T Trace-based Performance Analysis for the Petascale Simulation Code FLASH %A Heike Jagode %A Andreas Knuepfer %A Jack Dongarra %A Matthias Jurenz %A Matthias S. Mueller %A Wolfgang E. Nagel %K test %B Innovative Computing Laboratory Technical Report %8 2009-04 %G eng %0 Conference Proceedings %B Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA-08) %D 2008 %T Custom assignment of MPI ranks for parallel multi-dimensional FFTs: Evaluation of BG/P versus BG/L %A Heike Jagode %A Joachim Hein %B Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA-08) %I IEEE Computer Society %C Sydney, Australia %P 271-283 %8 2008-01 %G eng %0 Generic %D 2008 %T Task placement of parallel multi-dimensional FFTs on a mesh communication network %A Heike Jagode %A Joachim Hein %A Arthur Trew %B University of Tennessee Computer Science Technical Report %8 2008-01 %G eng