Publications
On the Combination of Silent Error Detection and Checkpointing,”
UT-CS-13-710: University of Tennessee Computer Science Technical Report, June 2013.
(1.29 MB)
“
Optimal Checkpointing Period: Time vs. Energy,”
University of Tennessee Computer Science Technical Report (also LAWN 281), no. ut-eecs-13-718: University of Tennessee, October 2013.
(440.13 KB)
“
Efficient checkpoint/verification patterns for silent error detection,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-14-03: University of Tennessee, May 2014.
(397.75 KB)
“
Efficient Checkpoint/Verification Patterns,”
International Journal on High Performance Computing Applications, July 2015.
DOI: 10.1177/1094342015594531
(392.76 KB)
“
Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors,”
ACM Transactions on Parallel Computing, August 2016.
DOI: 10.1145/2897189
(573.71 KB)
“
Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors,”
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
DOI: 10.1109/IPDPS.2016.39
(603.58 KB)
“
Scheduling Computational Workflows on Failure-prone Platforms,”
International Journal of Networking and Computing, vol. 6, no. 1, pp. 2-26, 2016.
(503.81 KB)
“
Co-Scheduling Algorithms for Cache-Partitioned Systems,”
19th Workshop on Advances in Parallel and Distributed Computational Models, Orlando, FL, IEEE Computer Society Press, May 2017.
DOI: 10.1109/IPDPSW.2017.60
(584.76 KB)
“
Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, ACM, June 2017.
DOI: 10.1145/3086157.3086162
(865.68 KB)
“
Optimal Checkpointing Period with replicated execution on heterogeneous platforms,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, IEEE Computer Society Press, June 2017.
DOI: 10.1145/3086157.3086165
(1.02 MB)
“
Resilient Co-Scheduling of Malleable Applications,”
International Journal of High Performance Computing Applications (IJHPCA), May 2017.
DOI: 10.1177/1094342017704979
(1.62 MB)
“
Towards Optimal Multi-Level Checkpointing,”
IEEE Transactions on Computers, vol. 66, issue 7, pp. 1212–1226, July 2017.
DOI: 10.1109/TC.2016.2643660
(1.39 MB)
“
Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing,”
Journal of Parallel and Distributed Computing, vol. 122, pp. 209–225, December 2018.
DOI: 10.1016/j.jpdc.2018.08.002
(837 KB)
“
Co-Scheduling Amdhal Applications on Cache-Partitioned Systems,”
International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 123–138, January 2018.
DOI: 10.1177/1094342017710806
(672.52 KB)
“
Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,”
Cluster 2018, Belfast, UK, IEEE Computer Society Press, September 2018.
(423.75 KB)
“
Multi-Level Checkpointing and Silent Error Detection for Linear Workflows,”
Journal of Computational Science, vol. 28, pp. 398–415, September 2018.
“A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures,”
The 47th International Conference on Parallel Processing (ICPP 2018), Eugene, OR, IEEE Computer Society Press, August 2018.
(868.44 KB)
“
Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors,”
International Journal of Networking and Computing, vol. 9, no. 1, pp. 2-27.
(754.6 KB)
“
Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,”
International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1221-1239, November 2019.
DOI: 10.1177/1094342019846956
(930.28 KB)
“
Replication is More Efficient Than You Think,”
The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19), Denver, CO, ACM Press, November 2019.
(975.69 KB)
“