%0 Journal Article %J Future Generation Computer Systems %D 2010 %T Self-Healing Network for Scalable Fault-Tolerant Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %B Future Generation Computer Systems %V 26 %P 479-485 %8 2010-03 %G eng %0 Journal Article %J Euro-Par 2007 %D 2007 %T Decision Trees and MPI Collective Algorithm Selection Problem %A Jelena Pjesivac–Grbovic %A George Bosilca %A Graham Fagg %A Thara Angskun %A Jack Dongarra %K ftmpi %B Euro-Par 2007 %I Springer %C Rennes, France %P 105–115 %8 2007-08 %G eng %0 Journal Article %J Parallel Computing (Special Edition: EuroPVM/MPI 2006) %D 2007 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A George Bosilca %A Graham Fagg %A Thara Angskun %A Jack Dongarra %K ftmpi %B Parallel Computing (Special Edition: EuroPVM/MPI 2006) %I Elsevier %8 2007-00 %G eng %0 Book Section %B Distributed and Parallel Systems %D 2007 %T A New Approach to MPI Collective Communication Implementations %A Torsten Hoefler %A Jeffrey M. Squyres %A Graham Fagg %A George Bosilca %A Wolfgang Rehm %A Andrew Lumsdaine %K Automatic Selection %K Collective Operation %K Framework %K Message Passing (MPI) %K Open MPI %X Recent research into the optimization of collective MPI operations has resulted in a wide variety of algorithms and corresponding implementations, each typically only applicable in a relatively narrow scope: on a specific architecture, on a specific network, with a specific number of processes, with a specific data size and/or data-type – or any combination of these (or other) factors. This situation presents an enormous challenge to portable MPI implementations which are expected to provide optimized collective operation performance on all platforms. Many portable implementations have attempted to provide a token number of algorithms that are intended to realize good performance on most systems. However, many platform configurations are still left without well-tuned collective operations. This paper presents a proposal for a framework that will allow a wide variety of collective algorithm implementations and a flexible, multi-tiered selection process for choosing which implementation to use when an application invokes an MPI collective function. %B Distributed and Parallel Systems %I Springer US %P 45-54 %@ 978-0-387-69857-1 %G eng %R 10.1007/978-0-387-69858-8_5 %0 Journal Article %J Cluster computing %D 2007 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B Cluster computing %I Springer Netherlands %V 10 %P 127-143 %8 2007-06 %G eng %0 Conference Proceedings %B Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) %D 2007 %T Reliability Analysis of Self-Healing Network using Discrete-Event Simulation %A Thara Angskun %A George Bosilca %A Graham Fagg %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %B Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) %I IEEE Computer Society %P 437-444 %8 2007-05 %G eng %0 Journal Article %J 2006 Euro PVM/MPI (submitted) %D 2006 %T Flexible collective communication tuning architecture applied to Open MPI %A Graham Fagg %A Jelena Pjesivac–Grbovic %A George Bosilca %A Thara Angskun %A Jack Dongarra %K ftmpi %B 2006 Euro PVM/MPI (submitted) %C Bonn, Germany %8 2006-01 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2006 %T FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study %A David Dewolfs %A Jan Broeckhove %A Vaidy Sunderam %A Graham Fagg %K ftmpi %B Lecture Notes in Computer Science %I Springer Berlin / Heidelberg %V 4192 %P 133-140 %8 2006-00 %G eng %0 Journal Article %J Euro PVM/MPI 2006 %D 2006 %T Implementation and Usage of the PERUSE-Interface in Open MPI %A Rainer Keller %A George Bosilca %A Graham Fagg %A Michael Resch %A Jack Dongarra %B Euro PVM/MPI 2006 %C Bonn, Germany %8 2006-09 %G eng %0 Generic %D 2006 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B ICL Technical Report %8 2006-00 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2006 %T MPI Collective Algorithm Selection and Quadtree Encoding %A Jelena Pjesivac–Grbovic %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %B Lecture Notes in Computer Science %I Springer Berlin / Heidelberg %V 4192 %P 40-48 %8 2006-09 %G eng %0 Conference Proceedings %B Proceedings of IEEE CCGrid 2006 %D 2006 %T Proposal of MPI operation level Checkpoint/Rollback and one implementation %A Yuan Tang %A Graham Fagg %A Jack Dongarra %K HARNESS/FT-PI %B Proceedings of IEEE CCGrid 2006 %I IEEE Computer Society %8 2006-01 %G eng %0 Journal Article %J 2006 Euro PVM/MPI %D 2006 %T Scalable Fault Tolerant Protocol for Parallel Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %B 2006 Euro PVM/MPI %C Bonn, Germany %8 2006-00 %G eng %0 Journal Article %J IBM Journal of Research and Development %D 2006 %T Self Adapting Numerical Software SANS Effort %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Victor Eijkhout %A Graham Fagg %A Erika Fuentes %A Julien Langou %A Piotr Luszczek %A Jelena Pjesivac–Grbovic %A Keith Seymour %A Haihang You %A Sathish Vadhiyar %K gco %B IBM Journal of Research and Development %V 50 %P 223-238 %8 2006-01 %G eng %0 Conference Proceedings %B DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems %D 2006 %T Self-Healing Network for Scalable Fault Tolerant Runtime Environments %A Thara Angskun %A Graham Fagg %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %B DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems %C Innsbruck, Austria %8 2006-01 %G eng %0 Conference Proceedings %B Proceedings of DoD HPCMP UGC 2005 (to appear) %D 2005 %T Dynamic Process Management for Pipelined Applications %A David Cronk %A Graham Fagg %A Susan Emeny %A Scott Tucker %B Proceedings of DoD HPCMP UGC 2005 (to appear) %I IEEE %C Nashville, TN %8 2005-01 %G eng %0 Conference Proceedings %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %D 2005 %T Fault Tolerant High Performance Computing by a Coding Approach %A Zizhong Chen %A Graham Fagg %A Edgar Gabriel %A Julien Langou %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %K grads %K lacsi %K sans %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %C Chicago, Illinois %8 2005-01 %G eng %0 Conference Proceedings %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %D 2005 %T Hash Functions for Datatype Signatures in MPI %A George Bosilca %A Jack Dongarra %A Graham Fagg %A Julien Langou %E Beniamino Di Martino %K ftmpi %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %I Springer-Verlag Berlin %C Sorrento (Naples), Italy %V 3666 %P 76-83 %8 2005-09 %G eng %0 Conference Proceedings %B 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05) %D 2005 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05) %C Denver, Colorado %8 2005-04 %G eng %0 Journal Article %J Cluster Computing Journal (to appear) %D 2005 %T Performance Analysis of MPI Collective Operations %A Jelena Pjesivac–Grbovic %A Thara Angskun %A George Bosilca %A Graham Fagg %A Edgar Gabriel %A Jack Dongarra %K ftmpi %B Cluster Computing Journal (to appear) %8 2005-01 %G eng %0 Conference Proceedings %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %D 2005 %T Scalable Fault Tolerant MPI: Extending the Recovery Algorithm %A Graham Fagg %A Thara Angskun %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %E Beniamino Di Martino %K ftmpi %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %I Springer-Verlag Berlin %C Sorrento (Naples) , Italy %V 3666 %P 67 %8 2005-09 %G eng %0 Generic %D 2005 %T Towards an Accurate Model for Collective Communications %A Sathish Vadhiyar %A Graham Fagg %A Jack Dongarra %B ICL Technical Report %8 2005-01 %G eng %0 Journal Article %J International Journal of High Performance Applications and Supercomputing (to appear) %D 2004 %T Building and using a Fault Tolerant MPI implementation %A Graham Fagg %A Jack Dongarra %K ftmpi %K lacsi %K sans %B International Journal of High Performance Applications and Supercomputing (to appear) %8 2004-00 %G eng %0 Conference Proceedings %B Proceedings of ISC2004 (to appear) %D 2004 %T Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems %A Graham Fagg %A Edgar Gabriel %A George Bosilca %A Thara Angskun %A Zizhong Chen %A Jelena Pjesivac–Grbovic %A Kevin London %A Jack Dongarra %K ftmpi %K lacsi %B Proceedings of ISC2004 (to appear) %C Heidelberg, Germany %8 2004-06 %G eng %0 Journal Article %J International Journal for High Performance Applications and Supercomputing (to appear) %D 2004 %T Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing %A Graham Fagg %A Edgar Gabriel %A Zizhong Chen %A Thara Angskun %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %K lacsi %B International Journal for High Performance Applications and Supercomputing (to appear) %8 2004-04 %G eng %0 Journal Article %J International Journal of High Performance Applications, Special Issue: Automatic Performance Tuning %D 2004 %T Towards an Accurate Model for Collective Communications %A Sathish Vadhiyar %A Graham Fagg %A Jack Dongarra %K lacsi %B International Journal of High Performance Applications, Special Issue: Automatic Performance Tuning %V 18 %P 159-167 %8 2004-01 %G eng %0 Journal Article %J Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th European PVM/MPI User's Group Meeting %D 2003 %T Evaluating The Performance Of MPI-2 Dynamic Communicators And One-Sided Communication %A Edgar Gabriel %A Graham Fagg %A Jack Dongarra %K ftmpi %B Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th European PVM/MPI User's Group Meeting %I Springer-Verlag, Berlin %C Venice, Italy %V 2840 %P 88-97 %8 2003-09 %G eng %0 Conference Proceedings %B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented) %D 2003 %T Fault Tolerant Communication Library and Applications for High Performance Computing %A Graham Fagg %A Edgar Gabriel %A Zizhong Chen %A Thara Angskun %A George Bosilca %A Antonin Bukovsky %A Jack Dongarra %K ftmpi %K lacsi %B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented) %C Santa Fe, NM %8 2003-10 %G eng %0 Conference Proceedings %B 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science %D 2003 %T A Fault-Tolerant Communication Library for Grid Environments %A Edgar Gabriel %A Graham Fagg %A Antonin Bukovsky %A Thara Angskun %A Jack Dongarra %K ftmpi %K lacsi %B 17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science %C San Francisco %8 2003-06 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2002 %T HARNESS Fault Tolerant MPI Design, Usage and Performance Issues %A Graham Fagg %A Jack Dongarra %B Future Generation Computer Systems %V 18 %P 1127-1142 %8 2002-01 %G eng %0 Conference Proceedings %B Proceedings of the second IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID 2002) %D 2002 %T The Internet BackPlane Protocol: A Study in Resource Sharing %A Alessandro Bassi %A Micah Beck %A Graham Fagg %A Terry Moore %A James Plank %A Martin Swany %A Rich Wolski %K ftmpi %B Proceedings of the second IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID 2002) %C Berlin, Germany %8 2002-10 %G eng %0 Conference Proceedings %B Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science %D 2001 %T Fault Tolerant MPI for the HARNESS Meta-Computing System %A Graham Fagg %A Antonin Bukovsky %A Jack Dongarra %E Benjoe A. Juliano %E R. Renner %E K. Tan %K ftmpi %K harness %B Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science %I Springer Verlag %C Berlin %V 2073 %P 355-366 %8 2001-00 %G eng %R 10.1007/3-540-45545-0_44 %0 Journal Article %J Parallel Computing %D 2001 %T HARNESS and Fault Tolerant MPI %A Graham Fagg %A Antonin Bukovsky %A Jack Dongarra %B Parallel Computing %V 27 %P 1479-1496 %8 2001-01 %G eng %0 Journal Article %J International Journal of High Performance Applications and Supercomputing %D 2001 %T Numerical Libraries and The Grid %A Antoine Petitet %A Susan Blackford %A Jack Dongarra %A Brett Ellis %A Graham Fagg %A Kenneth Roche %A Sathish Vadhiyar %K grads %B International Journal of High Performance Applications and Supercomputing %V 15 %P 359-374 %8 2001-01 %G eng %0 Generic %D 2001 %T Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK %A Antoine Petitet %A Susan Blackford %A Jack Dongarra %A Brett Ellis %A Graham Fagg %A Kenneth Roche %A Sathish Vadhiyar %K grads %K scalapack %B University of Tennessee Computer Science Technical Report %8 2001-01 %G eng %0 Conference Proceedings %B Department of Defense Users' Group Conference Proceedings (to appear), %D 2001 %T Parallel I/O for EQM Applications %A David Cronk %A Graham Fagg %A Shirley Moore %K ftmpi %B Department of Defense Users' Group Conference Proceedings (to appear), %C Biloxi, Mississippi %8 2001-06 %G eng %0 Journal Article %J 8th European PVM/MPI User's Group Meeting, Lecture Notes in Computer Science %D 2001 %T Parallel IO Support for Meta-Computing Applications: MPI_Connect IO Applied to PACX-MPI %A Graham Fagg %A Edgar Gabriel %A Michael Resch %K ftmpi %B 8th European PVM/MPI User's Group Meeting, Lecture Notes in Computer Science %I Springer Verlag, Berlin %C Greece %V 2131 %8 2001-09 %G eng %0 Conference Proceedings %B LACSI Symposium 2001 %D 2001 %T Performance Modeling for Self Adapting Collective Communications for MPI %A Sathish Vadhiyar %A Graham Fagg %A Jack Dongarra %K ftmpi %B LACSI Symposium 2001 %C Santa Fe, NM %8 2001-10 %G eng %0 Conference Proceedings %B Proceedings of SuperComputing 2000 (SC'2000) %D 2000 %T Automatically Tuned Collective Communications %A Sathish Vadhiyar %A Graham Fagg %A Jack Dongarra %K ftmpi %B Proceedings of SuperComputing 2000 (SC'2000) %C Dallas, TX %8 2000-11 %G eng %0 Conference Proceedings %B Lecture Notes in Computer Science: Proceedings of EuroPVM-MPI 2000 %D 2000 %T FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World %A Graham Fagg %A Jack Dongarra %K ftmpi %B Lecture Notes in Computer Science: Proceedings of EuroPVM-MPI 2000 %C (Hungary: Springer Verlag, 2000) %P V1908,346-353 %8 2000-01 %G eng %0 Journal Article %J Encyclopedia of Electrical and Engineering, Supplement 1 %D 2000 %T Message Passing Software Systems %A Jack Dongarra %A Graham Fagg %A Rolf Hempel %A David W. Walker %E J. Webster %K ftmpi %B Encyclopedia of Electrical and Engineering, Supplement 1 %I John Wiley & Sons, Inc. %8 2000-00 %G eng %0 Generic %D 2000 %T Metacomputing: An Evaluation of Emerging Systems %A David Cronk %A Brett Ellis %A Graham Fagg %B University of Tennessee Computer Science Department Technical Report %8 2000-07 %G eng %0 Generic %D 2000 %T Secure Remote Access to Numerical Software and Computation Hardware %A Dorian Arnold %A Shirley Browne %A Jack Dongarra %A Graham Fagg %A Keith Moore %B University of Tennessee Computer Science Technical Report, UT-CS-00-446 %8 2000-07 %G eng %0 Conference Proceedings %B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000 %D 2000 %T Secure Remote Access to Numerical Software and Computational Hardware %A Dorian Arnold %A Shirley Browne %A Jack Dongarra %A Graham Fagg %A Keith Moore %K netsolve %B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000 %C Albuquerque, NM %8 2000-06 %G eng %0 Journal Article %J International Journal on Future Generation Computer Systems %D 1999 %T HARNESS: A Next Generation Distributed Virtual Machine %A Micah Beck %A Jack Dongarra %A Graham Fagg %A Al Geist %A Paul Gray %A James Kohl %A Mauro Migliardi %A Keith Moore %A Terry Moore %A Philip Papadopoulous %A Stephen L. Scott %A Vaidy Sunderam %K harness %B International Journal on Future Generation Computer Systems %V 15 %P 571-582 %8 1999-01 %G eng %0 Journal Article %J Journal on Future Generation Computer Systems %D 1999 %T Scalable Networked Information Processing Environment (SNIPE) %A Graham Fagg %A Keith Moore %A Jack Dongarra %K harness %B Journal on Future Generation Computer Systems %V 15 %P 595-605 %8 1999-01 %G eng