|Title||Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures|
|Publication Type||Tech Report|
|Year of Publication||2013|
|Authors||Jia, Y., P. Luszczek, and J. Dongarra|
|Technical Report Series Title||UT-CS-13-712|
|Institution||University of Tennessee Computer Science Technical Report|
Graphics Processing Units (GPUs) are gaining wide spread usage in the ﬁeld of scientiﬁc computing owing to the performance boost GPUs bring to computation intensive applications. The typical conﬁguration is to integrate GPUs and CPUs in the same system where the CPUs handle the control ﬂow and part of the computation workload, and the GPUs serve as accelerators carry out the bulk of the data parallel compute workload. In this paper we design and implement a soft error resilient Hessenberg reduction algorithm on GPU based hybrid platforms. Our design employs algorithm based fault tolerance technique, diskless checkpointing and reverse computation. We detect and correct soft errors on-line without delaying the detection and correction to the end of the factorization. By utilizing idle time of the CPUs and overlapping both host side and GPU side workloads we minimize the observed overhead. Experiment results validated our design philosophy. Our algorithm introduces less than 2% performance overhead compared to the non-fault tolerant hybrid Hessenberg reduction algorithm.
Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures