Publications

Export 7 results:
Filters: Keyword is Fault tolerance  [Clear All Filters]
2015
Benoit, A., S. K. Raina, and Y. Robert, Efficient Checkpoint/Verification Patterns,” International Journal on High Performance Computing Applications, July 2015. DOI: 10.1177/1094342015594531  (392.76 KB)
2018
Benoit, A., A. Cavelan, F. Cappello, P. Raghavan, Y. Robert, and H. Sun, Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing,” Journal of Parallel and Distributed Computing, vol. 122, pp. 209–225, December 2018. DOI: 10.1016/j.jpdc.2018.08.002  (837 KB)
Bosilca, G., A. Bouteiller, A. Guermouche, T. Herault, Y. Robert, P. Sens, and J. Dongarra, A Failure Detector for HPC Platforms,” The International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 139–158, January 2018. DOI: 10.1177/1094342017711505  (1.04 MB)
2019
Losada, N., A. Bouteiller, and G. Bosilca, Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications,” Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19), November 2019.  (440.7 KB)
2020
Hori, A., K. Yoshinaga, T. Herault, A. Bouteiller, G. Bosilca, and Y. Ishikawa, Overhead of Using Spare Nodes,” The International Journal of High Performance Computing Applications, February 2020. DOI: 10.1177%2F1094342020901885  (2.15 MB)