%0 Generic %D 2015 %T Fault Tolerance Techniques for High-performance Computing %A Jack Dongarra %A Thomas Herault %A Yves Robert %X This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the report by discussing techniques to cope with silent errors (or silent data corruption). %B University of Tennessee Computer Science Technical Report (also LAWN 289) %I University of Tennessee %8 2015-05 %G eng %U http://www.netlib.org/lapack/lawnspdf/lawn289.pdf