Export 5 results:
Filters: Keyword is Fault tolerance  [Clear All Filters]
Benoit, A., A. Cavelan, F. Cappello, P. Raghavan, Y. Robert, and H. Sun, Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing,” Journal of Parallel and Distributed Computing, vol. 122, pp. 209–225, December 2018.  (837 KB)
Bosilca, G., A. Bouteiller, A. Guermouche, T. Herault, Y. Robert, P. Sens, and J. Dongarra, A Failure Detector for HPC Platforms,” The International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 139–158, January 2018.  (1.04 MB)
Benoit, A., S. K. Raina, and Y. Robert, Efficient Checkpoint/Verification Patterns,” International Journal on High Performance Computing Applications, July 2015.  (392.76 KB)
Bland, W., A. Bouteiller, T. Herault, J. Hursey, G. Bosilca, and J. Dongarra, An evaluation of User-Level Failure Mitigation support in MPI,” Computing, vol. 95, issue 12, pp. 1171-1184, December 2013.  (311.23 KB)