Publications

Export 7 results:
Filters: Author is Hongyang Sun  [Clear All Filters]
2016
Benoit, A., A. Cavelan, Y. Robert, and H. Sun, Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors,” ACM Transactions on Parallel Computing, August 2016. DOI: 10.1145/2897189  (573.71 KB)
Benoit, A., A. Cavelan, Y. Robert, and H. Sun, Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors,” 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016. DOI: 10.1109/IPDPS.2016.39  (603.58 KB)
2017
Benoit, A., F. Cappello, A. Cavelan, Y. Robert, and H. Sun, Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale,” 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, ACM, June 2017. DOI: 10.1145/3086157.3086162  (865.68 KB)
Benoit, A., A. Cavelan, V. Le Fèvre, Y. Robert, and H. Sun, Towards Optimal Multi-Level Checkpointing,” IEEE Transactions on Computers, vol. 66, issue 7, pp. 1212–1226, July 2017. DOI: 10.1109/TC.2016.2643660  (1.39 MB)
2018
Benoit, A., A. Cavelan, F. Cappello, P. Raghavan, Y. Robert, and H. Sun, Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing,” Journal of Parallel and Distributed Computing, vol. 122, pp. 209–225, December 2018. DOI: 10.1016/j.jpdc.2018.08.002  (837 KB)
Benoit, A., A. Cavelan, Y. Robert, and H. Sun, Multi-Level Checkpointing and Silent Error Detection for Linear Workflows,” Journal of Computational Science, vol. 28, pp. 398–415, September 2018.
2019
Aupy, G., A. Gainaru, V. Honoré, P. Raghavan, Y. Robert, and H. Sun, Reservation Strategies for Stochastic Jobs,” 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019), Rio de Janeiro, Brazil, IEEE Computer Society Press, May 2019.  (808.93 KB)