%0 Conference Paper %B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015) %D 2015 %T Cholesky Across Accelerators %A Asim YarKhan %A Azzam Haidar %A Chongxiao Cao %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015) %I IEEE %C Elizabeth, NJ %8 2015-08 %G eng %0 Conference Paper %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2015 %T Design for a Soft Error Resilient Dynamic Task-based Runtime %A Chongxiao Cao %A George Bosilca %A Thomas Herault %A Jack Dongarra %X As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms. %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Hyderabad, India %8 2015-05 %G eng %0 Conference Paper %B 17th IEEE International Conference on High Performance Computing and Communications %D 2015 %T Flexible Linear Algebra Development and Scheduling with Cholesky Factorization %A Azzam Haidar %A Asim YarKhan %A Chongxiao Cao %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore-CPUs, GPUs, and coprocessors. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. On the other hand, a Xeon Phi coprocessor will work most effectively with degree of parallelism between GPUs and multicore-CPUs. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization. %B 17th IEEE International Conference on High Performance Computing and Communications %C Newark, NJ %8 2015-08 %G eng %0 Conference Paper %B International Workshop on OpenCL %D 2014 %T clMAGMA: High Performance Dense Linear Algebra with OpenCL %A Chongxiao Cao %A Jack Dongarra %A Peng Du %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments presented, and in general provides to heterogeneous architectures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance OpenCL BLAS, hardware and OpenCL-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. %B International Workshop on OpenCL %C Bristol University, England %8 2014-05 %G eng %0 Generic %D 2014 %T Design for a Soft Error Resilient Dynamic Task-based Runtime %A Chongxiao Cao %A Thomas Herault %A George Bosilca %A Jack Dongarra %X Abstract—As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms. %B ICL Technical Report %I University of Tennessee %8 2014-11 %G eng %0 Conference Paper %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14) %D 2014 %T Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors %A Azzam Haidar %A Chongxiao Cao %A Ichitaro Yamazaki %A Jack Dongarra %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly. %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14) %I IEEE %C New Orleans, LA %8 2014-11 %G eng %R 10.1109/ScalA.2014.8 %0 Conference Paper %B IPDPS 2014 %D 2014 %T Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment %A Azzam Haidar %A Chongxiao Cao %A Jack Dongarra %A Piotr Luszczek %A Stanimire Tomov %K algorithms %K Computer science %K CUDA %K Heterogeneous systems %K Intel Xeon Phi %K linear algebra %K nVidia %K Tesla K20 %K Tesla M2090 %X Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resourcespecific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Generic %D 2013 %T clMAGMA: High Performance Dense Linear Algebra with OpenCL %A Chongxiao Cao %A Jack Dongarra %A Peng Du %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X This paper presents the design and implementation of sev- eral fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments pre- sented, and in general provides to heterogeneous architec- tures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is ob- tained through use of the high-performance OpenCL BLAS, hardware and OpenCL-speci c tuning, and a hybridization methodology where we split the algorithm into computa- tional tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. %B University of Tennessee Technical Report (Lawn 275) %I University of Tennessee %8 2013-03 %G eng