%0 Journal Article %J IEEE Transactions on Computers %D 2009 %T Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing %A Zizhong Chen %A Jack Dongarra %X As the number of processors in today's high-performance computers continues to grow, the mean-time-to-failure of these computers is becoming significantly shorter than the execution time of many current high-performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most of today's high-performance computing applications cannot survive node failures. Therefore, whenever a node fails, all surviving processes on surviving nodes usually have to be aborted and the whole application has to be restarted. In this paper, we present a framework for building self-healing high-performance numerical computing applications so that they can adapt to node or link failures without aborting themselves. The framework is based on FT-MPI and diskless checkpointing. Our diskless checkpointing uses weighted checksum schemes, a variation of Reed-Solomon erasure codes over floating-point numbers. We introduce several scalable encoding strategies into the existing diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p]. k ((beta + 2gamma) m + alpha) to (1 + O (radic(p)/radic(m))) 2 . k (beta + 2gamma)m, where alpha is the communication latency, 1/beta is the network bandwidth between processes, {1\over \gamma } is the rate to perform calculations, and m is the size of local checkpoint per process. When additional checkpoint processors are used, the overhead can be reduced to (1 + O (1/radic(m))). k (beta + 2gamma)m, which is independent of the total number of computational processors. The introduced self-healing algorithms are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of our self-healing approach by using a preconditioned conjugate gradient equation solver as an example. %B IEEE Transactions on Computers %V 58 %P 1512-1524 %8 2009-11 %G eng %N 11 %R https://doi.org/10.1109/TC.2009.42 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2008 %T Algorithm-Based Fault Tolerance for Fail-Stop Failures %A Zizhong Chen %A Jack Dongarra %K FT-MPI %K lapack %K scalapack %B IEEE Transactions on Parallel and Distributed Systems %V 19 %8 2008-01 %G eng %0 Journal Article %J in Petascale Computing: Algorithms and Applications (to appear) %D 2007 %T Disaster Survival Guide in Petascale Computing: An Algorithmic Approach %A Jack Dongarra %A Zizhong Chen %A George Bosilca %A Julien Langou %B in Petascale Computing: Algorithms and Applications (to appear) %I Chapman & Hall - CRC Press %8 2007-00 %G eng %0 Journal Article %J SIAM SISC (to appear) %D 2007 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A Julien Langou %A Zizhong Chen %A George Bosilca %A Jack Dongarra %B SIAM SISC (to appear) %8 2007-05 %G eng %0 Conference Proceedings %B Proceedings of Workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing at IPDPS %D 2007 %T Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing %A Zizhong Chen %A Ming Yang %A Guillermo Francia III %A Jack Dongarra %B Proceedings of Workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing at IPDPS %P 1-8 %8 2007-03 %G eng %0 Conference Proceedings %B IPDPS 2006, 20th IEEE International Parallel and Distributed Processing Symposium %D 2006 %T Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources %A Zizhong Chen %A Jack Dongarra %B IPDPS 2006, 20th IEEE International Parallel and Distributed Processing Symposium %C Rhodes Island, Greece %8 2006-01 %G eng %0 Journal Article %J IBM Journal of Research and Development %D 2006 %T Self Adapting Numerical Software SANS Effort %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Victor Eijkhout %A Graham Fagg %A Erika Fuentes %A Julien Langou %A Piotr Luszczek %A Jelena Pjesivac–Grbovic %A Keith Seymour %A Haihang You %A Sathish Vadhiyar %K gco %B IBM Journal of Research and Development %V 50 %P 223-238 %8 2006-01 %G eng %0 Generic %D 2005 %T Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources %A Zizhong Chen %A Jack Dongarra %B University of Tennessee Computer Science Department Technical Report %V –05-561 %8 2005-11 %G eng %0 Generic %D 2005 %T Condition Numbers of Gaussian Random Matrices %A Zizhong Chen %A Jack Dongarra %K ft-la %B University of Tennessee Computer Science Department Technical Report %V –04-539 %8 2005-00 %G eng %0 Journal Article %J SIAM Journal on Matrix Analysis and Applications (to appear) %D 2005 %T Condition Numbers of Gaussian Random Matrices %A Zizhong Chen %A Jack Dongarra %K ftmpi %K grads %K lacsi %K sans %B SIAM Journal on Matrix Analysis and Applications (to appear) %8 2005-01 %G eng %0 Conference Proceedings %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %D 2005 %T Fault Tolerant High Performance Computing by a Coding Approach %A Zizhong Chen %A Graham Fagg %A Edgar Gabriel %A Julien Langou %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %K grads %K lacsi %K sans %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %C Chicago, Illinois %8 2005-01 %G eng %0 Conference Proceedings %B The International Conference on Computational Science %D 2005 %T Numerically Stable Real Number Codes Based on Random Matrices %A Zizhong Chen %A Jack Dongarra %K ftmpi %K grads %K lacsi %K sans %B The International Conference on Computational Science %I LNCS 3514, Springer-Verlag %C Atlanta, GA %8 2005-01 %G eng %0 Generic %D 2005 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Julien Langou %K ft-la %B University of Tennessee Computer Science Department Technical Report, UT-CS-04-538 %8 2005-00 %G eng %0 Conference Proceedings %B Proceedings of ISC2004 (to appear) %D 2004 %T Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems %A Graham Fagg %A Edgar Gabriel %A George Bosilca %A Thara Angskun %A Zizhong Chen %A Jelena Pjesivac–Grbovic %A Kevin London %A Jack Dongarra %K ftmpi %K lacsi %B Proceedings of ISC2004 (to appear) %C Heidelberg, Germany %8 2004-06 %G eng %0 Conference Proceedings %B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04') %D 2004 %T LAPACK for Clusters Project: An Example of Self Adapting Numerical Software %A Zizhong Chen %A Jack Dongarra %A Piotr Luszczek %A Kenneth Roche %K lacsi %K lfc %B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04') %C Big Island, Hawaii %V 9 %P 90282 %8 2004-01 %G eng %0 Generic %D 2004 %T Numerically Stable Real-Number Codes Based on Random Matrices %A Zizhong Chen %A Jack Dongarra %K ftmpi %B University of Tennessee Computer Science Department Technical Report %V –04-526 %8 2004-10 %G eng %0 Journal Article %J International Journal for High Performance Applications and Supercomputing (to appear) %D 2004 %T Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing %A Graham Fagg %A Edgar Gabriel %A Zizhong Chen %A Thara Angskun %A George Bosilca %A Jelena Pjesivac–Grbovic %A Jack Dongarra %K ftmpi %K lacsi %B International Journal for High Performance Applications and Supercomputing (to appear) %8 2004-04 %G eng %0 Generic %D 2004 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Julien Langou %B ICL Technical Report %8 2004-01 %G eng %0 Conference Proceedings %B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented) %D 2003 %T Fault Tolerant Communication Library and Applications for High Performance Computing %A Graham Fagg %A Edgar Gabriel %A Zizhong Chen %A Thara Angskun %A George Bosilca %A Antonin Bukovsky %A Jack Dongarra %K ftmpi %K lacsi %B Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented) %C Santa Fe, NM %8 2003-10 %G eng %0 Journal Article %J Parallel Computing %D 2003 %T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters %A Zizhong Chen %A Jack Dongarra %A Piotr Luszczek %A Kenneth Roche %K lacsi %K lfc %K sans %B Parallel Computing %V 29 %P 1723-1743 %8 2003-11 %G eng %0 Generic %D 2003 %T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters (LAPACK Working Note 160) %A Zizhong Chen %A Jack Dongarra %A Piotr Luszczek %A Kenneth Roche %K lacsi %B University of Tennessee Computer Science Technical Report, UT-CS-03-499 %8 2003-01 %G eng