Failure Detection and Propagation in HPC Systems

TitleFailure Detection and Propagation in HPC Systems
Publication TypeConference Proceedings
Year of Publication2016
AuthorsBosilca, G., A. Bouteiller, A. Guermouche, T. Herault, Y. Robert, P. Sens, and J. Dongarra
Conference Name Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16)
Pagination27:1-27:11
Date Published11-2016
PublisherIEEE Press
Conference LocationSalt Lake City, Utah
ISBN Number978-1-4673-8815-3
Keywordsfailure detection, fault-tolerance, MPI
URLhttp://dl.acm.org/citation.cfm?id=3014904.3014941