%0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2015 %T A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination %A Simplice Donfack %A Jack Dongarra %A Mathieu Faverge %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Ichitaro Yamazaki %K Gaussian elimination %K lu factorization %K Multicore %K parallel %K plasma %K shared memory %X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed. %B Concurrency and Computation: Practice and Experience %V 27 %P 1292-1309 %8 2015-04 %G eng %N 5 %R 10.1002/cpe.3306 %0 Conference Paper %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %D 2014 %T Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs %A Simplice Donfack %A Stanimire Tomov %A Jack Dongarra %X Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer’s characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on a 24 cores AMD Opteron 6172 show that by adding one GPU (Tesla S2050) we accelerate LU up to 2.4x compared to the corresponding routine in MKL using 24 cores. The comparisons with MAGMA also show significant improvements. %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %8 2014-05 %G eng %0 Generic %D 2013 %T Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs %A Simplice Donfack %A Stanimire Tomov %A Jack Dongarra %X Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU computing approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on high-end hybrid CPU/GPU systems show that our dynamically balanced synchronization-avoiding LU is both multicore and GPU scalable. Comparisons with state-of-the-art libraries like MKL (for multicore) and MAGMA (for hybrid systems) are provided, demonstrating significant performance improvements. The approach is applicable to other linear algebra algorithms. The scheduling mechanisms and tuning models can be incorporated into respectively dynamic runtime systems/schedulers and autotuning frameworks for hybrid CPU/MIC/GPU architectures. %B University of Tennessee Computer Science Technical Report %8 2013-07 %G eng %0 Generic %D 2012 %T On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties %A Simplice Donfack %A Jack Dongarra %A Mathieu Faverge %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Ichitaro Yamazaki %X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed. %B University of Tennessee Computer Science Technical Report %8 2013-07 %G eng %0 Conference Proceedings %B Proc. of the International Conference on Computational Science (ICCS) %D 2012 %T A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines %A Marc Baboulin %A Simplice Donfack %A Jack Dongarra %A Laura Grigori %A Adrien Remi %A Stanimire Tomov %K magma %B Proc. of the International Conference on Computational Science (ICCS) %V 9 %P 17-26 %8 2012-06 %G eng %0 Generic %D 2012 %T Performance evaluation of LU factorization through hardware counter measurements %A Simplice Donfack %A Stanimire Tomov %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %8 2012-10 %G eng