Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures

TitleTransient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures
Publication TypeTech Report
Year of Publication2013
AuthorsJia, Y., P. Luszczek, and J. Dongarra
Technical Report Series TitleUT-CS-13-712
Date Published06-2013
InstitutionUniversity of Tennessee Computer Science Technical Report
Other Numberslawn279

Graphics Processing Units (GPUs) are gaining wide spread usage in the field of scientific computing owing to the performance boost GPUs bring to computation intensive applications. The typical configuration is to integrate GPUs and CPUs in the same system where the CPUs handle the control flow and part of the computation workload, and the GPUs serve as accelerators carry out the bulk of the data parallel compute workload. In this paper we design and implement a soft error resilient Hessenberg reduction algorithm on GPU based hybrid platforms. Our design employs algorithm based fault tolerance technique, diskless checkpointing and reverse computation. We detect and correct soft errors on-line without delaying the detection and correction to the end of the factorization. By utilizing idle time of the CPUs and overlapping both host side and GPU side workloads we minimize the observed overhead. Experiment results validated our design philosophy. Our algorithm introduces less than 2% performance overhead compared to the non-fault tolerant hybrid Hessenberg reduction algorithm.