%0 Book Section %B Exascale Scientific Applications: Scalability and Performance Portability %D 2017 %T Performance Analysis and Debugging Tools at Scale %A Scott Parker %A John Mellor-Crummey %A Dong H. Ahn %A Heike Jagode %A Holger Brunst %A Sameer Shende %A Allen D. Malony %A David DelSignore %A Ronny Tschuter %A Ralph Castain %A Kevin Harms %A Philip Carns %A Ray Loy %A Kalyan Kumaran %X This chapter explores present-day challenges and those likely to arise as new hardware and software technologies are introduced on the path to exascale. It covers some of the underlying hardware, software, and techniques that enable tools and debuggers. Performance tools and debuggers are critical components that enable computational scientists to fully exploit the computing power of While high-performance computing systems. Instrumentation is the insertion of code to perform measurement in a program. It is vital step in performance analysis, especially for parallel programs. The essence of a debugging tool is enabling observation, exploration, and control of program state, such that a developer can, for example, verify that what is currently occurring correlates to what is intended. The increased complexity and volume of performance and debugging data likely to be seen on exascale systems risks overwhelming tool users. Tools and debuggers may need to develop advanced techniques such as automated filtering and analysis to reduce the complexity seen by the user. %B Exascale Scientific Applications: Scalability and Performance Portability %I Chapman & Hall / CRC Press %P 17-50 %8 2017-11 %@ 9781315277400 %G eng %& 2 %R https://doi.org/10.1201/b21930 %0 Conference Paper %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %D 2013 %T Diagnosis and Optimization of Application Prefetching Performance %A Gabriel Marin %A Colin McCurdy %A Jeffrey Vetter %E Allen D. Malony %E Nemirovsky, Mario %E Midkiff, Sam %X Hardware prefetchers are effective at recognizing streaming memory access patterns and at moving data closer to the processing units to hide memory latency. However, hardware prefetchers can track only a limited number of data streams due to finite hardware resources. In this paper, we introduce the term streaming concurrency to characterize the number of parallel, logical data streams in an application. We present a simulation algorithm for understanding the streaming concurrency at any point in an application, and we show that this metric is a good predictor of the number of memory requests initiated by streaming prefetchers. Next, we try to understand the causes behind poor prefetching performance. We identified four prefetch unfriendly conditions and we show how to classify an application's memory references based on these conditions. We evaluated our analysis using the SPEC CPU2006 benchmark suite. We selected two benchmarks with unfavorable access patterns and transformed them to improve their prefetching effectiveness. Results show that making applications more prefetcher friendly can yield meaningful performance gains. %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %I ACM Press %C Eugene, Oregon, USA %8 2013-06 %@ 9781450321303 %G eng %U http://dl.acm.org/citation.cfm?doid=2464996.2465014 %R 10.1145/2464996.2465014 %0 Conference Paper %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %D 2013 %T Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication %A Azzam Haidar %A Mark Gates %A Stanimire Tomov %A Jack Dongarra %E Allen D. Malony %E Nemirovsky, Mario %E Midkiff, Sam %K eigenvalue %K gpu communication %K gpu computation %K heterogeneous programming model %K performance %K reduction to tridiagonal %K singular value decomposiiton %K task parallelism %X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges---starting from our algorithm design, kernel optimization and tuning, to our programming model---in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores. %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %I ACM Press %C Eugene, Oregon, USA %8 2013-06 %@ 9781450321303 %G eng %U http://dl.acm.org/citation.cfm?doid=2464996.2465438 %R 10.1145/2464996.2465438 %0 Conference Paper %B International Conference on Parallel Processing (ICPP'11) %D 2011 %T Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs %A Allen D. Malony %A Scott Biersdorff %A Sameer Shende %A Heike Jagode %A Stanimire Tomov %A Guido Juckeland %A Robert Dietrich %A Duncan Poole %A Christopher Lamb %K magma %K mumi %K papi %X The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. This paper studies the problems associated with performance measurement of heterogeneous machines with GPUs. A heterogeneous computation model and alternative host-GPU measurement approaches are discussed to set the stage for reporting new capabilities for heterogeneous parallel performance measurement in three leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work leverages the new CUPTI tool support in NVIDIA's CUDA device library. Heterogeneous benchmarks from the SHOC suite are used to demonstrate the measurement methods and tool support. %B International Conference on Parallel Processing (ICPP'11) %I ACM %C Taipei, Taiwan %8 2011-09 %@ 978-0-7695-4510-3 %G eng %R 10.1109/ICPP.2011.71 %0 Conference Proceedings %B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing %D 2009 %T A Holistic Approach for Performance Measurement and Analysis for Petascale Applications %A Heike Jagode %A Jack Dongarra %A Sadaf Alam %A Jeffrey Vetter %A W. Spear %A Allen D. Malony %E Gabrielle Allen %K point %K test %B ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing %I Springer-Verlag Berlin Heidelberg 2009 %C Baton Rouge, Louisiana %V 2009 %P 686-695 %8 2009-05 %G eng %0 Conference Proceedings %B Proc. DoD HPCMP Users Group Conference (HPCMP-UGC'07) %D 2007 %T Memory Leak Detection in Fortran Applications using TAU %A Sameer Shende %A Allen D. Malony %A Shirley Moore %A David Cronk %B Proc. DoD HPCMP Users Group Conference (HPCMP-UGC'07) %I IEEE Computer Society %C Pittsburgh, PA %8 2007-01 %G eng %0 Conference Proceedings %B In Proceedings of the 2005 SciDAC Conference %D 2005 %T Performance Analysis of GYRO: A Tool Evaluation %A Patrick H. Worley %A Jeff Candy %A Laura Carrington %A Kevin Huck %A Timothy Kaiser %A Kumar Mahinthakumar %A Allen D. Malony %A Shirley Moore %A Dan Reed %A Philip C. Roth %A H. Shan %A Sameer Shende %A Allan Snavely %A S. Sreepathi %A Felix Wolf %A Y. Zhang %K kojak %B In Proceedings of the 2005 SciDAC Conference %C San Francisco, CA %8 2005-06 %G eng %0 Conference Proceedings %B In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference %D 2005 %T Performance Profiling Overhead Compensation for MPI Programs %A Sameer Shende %A Allen D. Malony %A Alan Morris %A Felix Wolf %K kojak %B In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference %I Springer LNCS %8 2005-09 %G eng %0 Conference Proceedings %B In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference %D 2005 %T A Scalable Approach to MPI Application Performance Analysis %A Shirley Moore %A Felix Wolf %A Jack Dongarra %A Sameer Shende %A Allen D. Malony %A Bernd Mohr %K kojak %B In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference %I Springer LNCS %8 2005-09 %G eng %0 Conference Proceedings %B In Proc. of the International Conference on High Performance Computing and Communications (HPCC) %D 2005 %T Trace-Based Parallel Performance Overhead Compensation %A Felix Wolf %A Allen D. Malony %A Sameer Shende %A Alan Morris %K kojak %B In Proc. of the International Conference on High Performance Computing and Communications (HPCC) %C Sorrento (Naples), Italy %8 2005-09 %G eng %0 Conference Paper %B ICCS 2003 Terascale Workshop %D 2003 %T Performance Instrumentation and Measurement for Terascale Systems %A Jack Dongarra %A Allen D. Malony %A Shirley Moore %A Phil Mucci %A Sameer Shende %K papi %X As computer systems grow in size and complexity, tool support is needed to facilitate the efficient mapping of large-scale applications onto these systems. To help achieve this mapping, performance analysis tools must provide robust performance observation capabilities at all levels of the system, as well as map low-level behavior to high-level program constructs. Instrumentation and measurement strategies, developed over the last several years, must evolve together with performance analysis infrastructure to address the challenges of new scalable parallel systems. %B ICCS 2003 Terascale Workshop %I Springer, Berlin, Heidelberg %C Melbourne, Australia %8 2003-06 %G eng %R https://doi.org/10.1007/3-540-44864-0_6