%0 Journal Article %J TeraGrid'11 %D 2011 %T Autotuned Parallel I/O for Highly Scalable Biosequence Analysis %A Haihang You %A Bhanu Rekapalli %A Qing Liu %A Shirley Moore %B TeraGrid'11 %C Salt Lake City, Utah %8 2011-07 %G eng %0 Conference Proceedings %B Cray Users Group Conference (CUG'11) (Best Paper Finalist) %D 2011 %T The Design of an Auto-tuning I/O Framework on Cray XT5 System %A Haihang You %A Qing Liu %A Zhiqiang Li %A Shirley Moore %K gco %B Cray Users Group Conference (CUG'11) (Best Paper Finalist) %C Fairbanks, Alaska %8 2011-05 %G eng %0 Generic %D 2011 %T Power-aware Computing on GPGPUs %A Kiran Kasichayanula %A Haihang You %A Shirley Moore %A Stanimire Tomov %A Heike Jagode %A Matt Johnson %I Fall Creek Falls Conference, Poster %C Gatlinburg, TN %8 2011-09 %G eng %0 Journal Article %J Tools for High Performance Computing 2009 %D 2010 %T Collecting Performance Data with PAPI-C %A Dan Terpstra %A Heike Jagode %A Haihang You %A Jack Dongarra %K mumi %K papi %X Modern high performance computer systems continue to increase in size and complexity. Tools to measure application performance in these increasingly complex environments must also increase the richness of their measurements to provide insights into the increasingly intricate ways in which software and hardware interact. PAPI (the Performance API) has provided consistent platform and operating system independent access to CPU hardware performance counters for nearly a decade. Recent trends toward massively parallel multi-core systems with often heterogeneous architectures present new challenges for the measurement of hardware performance information, which is now available not only on the CPU core itself, but scattered across the chip and system. We discuss the evolution of PAPI into Component PAPI, or PAPI-C, in which multiple sources of performance data can be measured simultaneously via a common software interface. Several examples of components and component data measurements are discussed. We explore the challenges to hardware performance measurement in existing multi-core architectures. We conclude with an exploration of future directions for the PAPI interface. %B Tools for High Performance Computing 2009 %I Springer Berlin / Heidelberg %C 3rd Parallel Tools Workshop, Dresden, Germany %P 157-173 %8 2010-05 %G eng %R https://doi.org/10.1007/978-3-642-11261-4_11 %0 Journal Article %J Cluster Computing Journal: Special Issue on High Performance Distributed Computing %D 2009 %T Paravirtualization Effect on Single- and Multi-threaded Memory-Intensive Linear Algebra Software %A Lamia Youseff %A Keith Seymour %A Haihang You %A Dmitrii Zagorodnov %A Jack Dongarra %A Rich Wolski %B Cluster Computing Journal: Special Issue on High Performance Distributed Computing %I Springer Netherlands %V 12 %P 101-122 %8 2009-00 %G eng %0 Conference Proceedings %B The 3rd international Workshop on Automatic Performance Tuning %D 2008 %T A Comparison of Search Heuristics for Empirical Code Optimization %A Keith Seymour %A Haihang You %A Jack Dongarra %K gco %B The 3rd international Workshop on Automatic Performance Tuning %C Tsukuba, Japan %8 2008-10 %G eng %0 Conference Proceedings %B ACM/IEEE International Symposium on High Performance Distributed Computing %D 2008 %T The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software %A Lamia Youseff %A Keith Seymour %A Haihang You %A Jack Dongarra %A Rich Wolski %K gco %K netsolve %B ACM/IEEE International Symposium on High Performance Distributed Computing %C Boston, MA. %8 2008-06 %G eng %0 Journal Article %J Proc. SciDAC 2008 %D 2008 %T PERI Auto-tuning %A David Bailey %A Jacqueline Chame %A Chun Chen %A Jack Dongarra %A Mary Hall %A Jeffrey K. Hollingsworth %A Paul D. Hovland %A Shirley Moore %A Keith Seymour %A Jaewook Shin %A Ananta Tiwari %A Sam Williams %A Haihang You %K gco %B Proc. SciDAC 2008 %I Journal of Physics %C Seatlle, Washington %V 125 %8 2008-01 %G eng %0 Generic %D 2007 %T Automated Empirical Tuning of a Multiresolution Analysis Kernel %A Haihang You %A Keith Seymour %A Jack Dongarra %A Shirley Moore %K gco %B ICL Technical Report %P 10 %8 2007-01 %G eng %0 Generic %D 2007 %T Empirical Tuning of a Multiresolution Analysis Kernel using a Specialized Code Generator %A Haihang You %A Keith Seymour %A Jack Dongarra %A Shirley Moore %K gco %B ICL Technical Report %8 2007-01 %G eng %0 Generic %D 2006 %T ATLAS on the BlueGene/L – Preliminary Results %A Keith Seymour %A Haihang You %A Jack Dongarra %K gco %B ICL Technical Report %8 2006-01 %G eng %0 Journal Article %J IBM Journal of Research and Development %D 2006 %T Self Adapting Numerical Software SANS Effort %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Victor Eijkhout %A Graham Fagg %A Erika Fuentes %A Julien Langou %A Piotr Luszczek %A Jelena Pjesivac–Grbovic %A Keith Seymour %A Haihang You %A Sathish Vadhiyar %K gco %B IBM Journal of Research and Development %V 50 %P 223-238 %8 2006-01 %G eng %0 Generic %D 2005 %T An Effective Empirical Search Method for Automatic Software Tuning %A Haihang You %A Keith Seymour %A Jack Dongarra %K gco %B ICL Technical Report %8 2005-01 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS 2004) %D 2004 %T Accurate Cache and TLB Characterization Using Hardware Counters %A Jack Dongarra %A Shirley Moore %A Phil Mucci %A Keith Seymour %A Haihang You %K gco %K lacsi %K papi %X We have developed a set of microbenchmarks for accurately determining the structural characteristics of data cache memories and TLBs. These characteristics include cache size, cache line size, cache associativity, memory page size, number of data TLB entries, and data TLB associativity. Unlike previous microbenchmarks that used time-based measurements, our microbenchmarks use hardware event counts to more accurately and quickly determine these characteristics while requiring fewer limiting assumptions. %B International Conference on Computational Science (ICCS 2004) %I Springer %C Krakow, Poland %8 2004-06 %G eng %R https://doi.org/10.1007/978-3-540-24688-6_57 %0 Conference Paper %B 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004) %D 2004 %T Automatic Blocking of QR and LU Factorizations for Locality %A Qing Yi %A Ken Kennedy %A Haihang You %A Keith Seymour %A Jack Dongarra %K gco %K papi %K sans %X QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures. %B 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004) %I ACM %C Washington, DC %8 2004-06 %G eng %R 10.1145/1065895.1065898 %0 Conference Paper %B PADTAD Workshop, IPDPS 2003 %D 2003 %T Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters %A Jack Dongarra %A Kevin London %A Shirley Moore %A Phil Mucci %A Dan Terpstra %A Haihang You %A Min Zhou %K lacsi %K papi %X The PAPI project has defined and implemented a cross-platform interface to the hardware counters available on most modern microprocessors. The interface has gained widespread use and acceptance from hardware vendors, users, and tool developers. This paper reports on experiences with the community-based open-source effort to define the PAPI specification and implement it on a variety of platforms. Collaborations with tool developers who have incorporated support for PAPI are described. Issues related to interpretation and accuracy of hardware counter data and to the overheads of collecting this data are discussed. The paper concludes with implications for the design of the next version of PAPI. %B PADTAD Workshop, IPDPS 2003 %I IEEE %C Nice, France %8 2003-04 %@ 0-7695-1926-1 %G eng