Publications

Export 8 results:
Filters: Keyword is fault-tolerance  [Clear All Filters]
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
A
Aupy, G., A. Benoit, H. Casanova, and Y. Robert, Scheduling Computational Workflows on Failure-prone Platforms,” International Journal of Networking and Computing, vol. 6, no. 1, pp. 2-26, 2016.  (503.81 KB)
B
Bosilca, G., A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra, Assessing the Impact of ABFT and Checkpoint Composite Strategies,” 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.  (1.02 MB)
Bosilca, G., A. Bouteiller, A. Guermouche, T. Herault, Y. Robert, P. Sens, and J. Dongarra, Failure Detection and Propagation in HPC Systems,” Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Salt Lake City, Utah, IEEE Press, pp. 27:1-27:11, November 2016.
Bosilca, G., A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra, Assessing the impact of ABFT and Checkpoint composite strategies,” University of Tennessee Computer Science Technical Report, no. ICL-UT-13-03, 2013.  (968.47 KB)
Bosilca, G., A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra, Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing,” International Journal of Networking and Computing, vol. 5, no. 1, pp. 2-15, January 2015.  (755.54 KB)
Bouteiller, A., T. Herault, G. Bosilca, P. Du, and J. Dongarra, Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures, and Accuracy,” ACM Transactions on Parallel Computing, vol. 1, issue 2, no. 10, pp. 10:1-10:28, January 2015.  (1.14 MB)
D
Dongarra, J., T. Herault, and Y. Robert, Revisiting the Double Checkpointing Algorithm,” University of Tennessee Computer Science Technical Report (LAWN 274), no. ut-cs-13-705, January 2013.  (682.22 KB)