%0 Generic %D 2020 %T Prospectus for the Next LAPACK and ScaLAPACK Libraries: Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC) %A James Demmel %A Jack Dongarra %A Julie Langou %A Julien Langou %A Piotr Luszczek %A Michael Mahoney %X The convergence of several unprecedented changes, including formidable new system design constraints and revolutionary levels of heterogeneity, has made it clear that much of the essential software infrastructure of computational science and engineering is, or will soon be, obsolete. Math libraries have historically been in the vanguard of software that must be adapted first to such changes, both because these low-level workhorses are so critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Under the Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC) project, the principal designers of the Linear Algebra PACKage (LAPACK) and the Scalable Linear Algebra PACKage (ScaLAPACK), the combination of which is abbreviated Sca/LAPACK, aim to enhance and update these libraries for the ongoing revolution in processor architecture, system design, and application requirements by incorporating them into a layered package of software components—the BALLISTIC ecosystem—that provides users seamless access to state-of-the-art solver implementations through familiar and improved Sca/LAPACK interfaces. %B LAPACK Working Notes %I University of Tennessee %8 2020/07 %G eng %0 Conference Paper %B IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2017 %T Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation %A Mathieu Faverge %A Julien Langou %A Yves Robert %A Jack Dongarra %K Algorithm design and analysis %K Approximation algorithms %K Kernel %K Multicore processing %K Shape %K Software algorithms %K Transforms %X We study tiled algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthog-onal transformations: (i) the tiled bidiagonalization algorithm BIDIAG, which is a tiled version of the standard scalar bidiago-nalization algorithm; and (ii) the R-bidiagonalization algorithm R-BIDIAG, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R- factor. For both BIDIAG and R-BIDIAG, we use four main types of reduction trees, namely FLATTS, FLATTT, GREEDY, and a newly introduced auto-adaptive tree, AUTO. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BIDIAG has a shorter critical path length than BIDIAG for tall and skinny matrices, and (ii) GREEDY based schemes are much better than earlier proposed algorithms with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared- memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts. %B IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Orlando, FL %8 2017-05 %G eng %R 10.1109/IPDPS.2017.46 %0 Generic %D 2016 %T 2016 Dense Linear Algebra Software Packages Survey %A Jack Dongarra %A Jim Demmel %A Julien Langou %A Julie Langou %X The 2016 Dense Linear Algebra Software Packages Survey was administered from January 1st 2016 to April 12 2016. 234 respondents answered the survey. The survey was advertised directly to the Linear Algebra community via our LAPACK/ScaLAPACK forum, NA Digest and we also directly contacted vendors and linear algebra experts. The breakdown of respondents was: 74% researchers or scientists, 25% were Principal Investigators and 25% Software maintainers or System administrators. The goal of the survey was to get the Linear Algebra community opinion and provide input on dense linear algebra software packages, in particular LAPACK, ScaLAPACK, PLASMA and MAGMA. The ultimate purpose of the survey was to improve these libraries to benefit our user community. The survey would allow the team to prioritize the many possible improvements that could be done. We also asked input from users accessing these libraries via 3rd party interfaces, for example MATLAB, Intel’s MKL, Python’s NumPy, AMD's ACML, and many others. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-09 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2015 %T Mixing LU-QR Factorization Algorithms to Design High-Performance Dense Linear Algebra Solvers %A Mathieu Faverge %A Julien Herrmann %A Julien Langou %A Bradley Lowery %A Yves Robert %A Jack Dongarra %K lu factorization %K Numerical algorithms %K QR factorization %K Stability; Performance %X This paper introduces hybrid LU–QR algorithms for solving dense linear systems of the form Ax=b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floating-point operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. The choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. A comprehensive set of experiments shows that hybrid LU–QR algorithms provide a continuous range of trade-offs between stability and performances. %B Journal of Parallel and Distributed Computing %V 85 %P 32-46 %8 2015-11 %G eng %R doi:10.1016/j.jpdc.2015.06.007 %0 Conference Paper %B IPDPS 2014 %D 2014 %T Designing LU-QR Hybrid Solvers for Performance and Stability %A Mathieu Faverge %A Julien Herrmann %A Julien Langou %A Bradley Lowery %A Yves Robert %A Jack Dongarra %K plasma %X This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. Ideally, the choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the Parsec software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %@ 978-1-4799-3800-1 %G eng %R 10.1109/IPDPS.2014.108 %0 Generic %D 2013 %T Designing LU-QR hybrid solvers for performance and stability %A Mathieu Faverge %A Julien Herrmann %A Julien Langou %A Bradley Lowery %A Yves Robert %A Jack Dongarra %B University of Tennessee Computer Science Technical Report (also LAWN 282) %I University of Tennessee %8 2013-10 %G eng %0 Journal Article %J Parallel Computing %D 2013 %T Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems %A Jack Dongarra %A Mathieu Faverge %A Thomas Herault %A Mathias Jacquelin %A Julien Langou %A Yves Robert %K Cluster %K Distributed memory %K Hierarchical architecture %K multi-core %K numerical linear algebra %K QR factorization %X This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ‘‘communication-avoiding’’), it is natural to consider hierarchical trees composed of an ‘‘inter-node’’ tree which acts on top of ‘‘intra-node’’ trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ‘‘TS level’’ for cache-friendliness, (1) ‘‘low-level’’ for decoupled highly parallel inter-node reductions, (2) ‘‘domino level’’ to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGUE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms. %B Parallel Computing %V 39 %P 212-232 %8 2013-05 %G eng %N 4-5 %0 Book Section %B Handbook of Linear Algebra %D 2013 %T LAPACK %A Zhaojun Bai %A James Demmel %A Jack Dongarra %A Julien Langou %A Jenny Wang %X With a substantial amount of new material, the Handbook of Linear Algebra, Second Edition provides comprehensive coverage of linear algebra concepts, applications, and computational software packages in an easy-to-use format. It guides you from the very elementary aspects of the subject to the frontiers of current research. Along with revisions and updates throughout, the second edition of this bestseller includes 20 new chapters. %B Handbook of Linear Algebra %7 Second %I CRC Press %C Boca Raton, FL %@ 9781466507289 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2013 %T Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms %A Fred G. Gustavson %A Jerzy Wasniewski %A Jack Dongarra %A José Herrero %A Julien Langou %X Four routines called DPOTF3i, i = a,b,c,d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts decreasing. This is our main result and it implies, due to the use of larger block size nb, that DGEMM, DSYRK, and DTRSM performance also increases! The four DPOTF3i routines use simple register blocking. Different platforms have different numbers of registers. Thus, our four routines have different register blocking sizes. BPF is introduced. LAPACK routines for POTRF and PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK POTRF source codes. We call these codes BPTRF. There are two variants of BPF: lower and upper. Upper BPF is “identical” to Square Block Packed Format (SBPF). “LAPACK” implementations on multicore processors use SBPF. Lower BPF is less efficient than upper BPF. Vector inplace transposition converts lower BPF to upper BPF very efficiently. Corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms are given for n ≈ nb as well as results for large n comparing DBPTRF versus DPOTRF. %B ACM Transactions on Mathematical Software (TOMS) %V 39 %8 2013-02 %G eng %N 2 %R 10.1145/2427023.2427026 %0 Journal Article %J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications %D 2013 %T Multithreading in the PLASMA Library %A Jakub Kurzak %A Piotr Luszczek %A Asim YarKhan %A Mathieu Faverge %A Julien Langou %A Henricus Bouwmeester %A Jack Dongarra %E Mohamed Ahmed %E Reda Ammar %E Sanguthevar Rajasekaran %K plasma %B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications %I Taylor & Francis %8 2013-00 %G eng %0 Conference Proceedings %B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium %D 2012 %T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems %A Jack Dongarra %A Mathieu Faverge %A Thomas Herault %A Julien Langou %A Yves Robert %B IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium %I IEEE Computer Society Press %C Shanghai, China %8 2012-05 %G eng %0 Generic %D 2012 %T How LAPACK library enables Microsoft Visual Studio support with CMake and LAPACKE %A Julien Langou %A Bill Hoffman %A Brad King %B University of Tennessee Computer Science Technical Report (also LAWN 270) %8 2012-07 %G eng %0 Conference Proceedings %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %D 2011 %T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %I IEEE %C Anchorage, Alaska, USA %P 1432-1441 %8 2011-05 %G eng %0 Generic %D 2011 %T Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems %A Jack Dongarra %A Mathieu Faverge %A Thomas Herault %A Julien Langou %A Yves Robert %K magma %K plasma %B University of Tennessee Computer Science Technical Report (also Lawn 257) %8 2011-10 %G eng %0 Journal Article %J IEEE/ACS AICCSA 2011 %D 2011 %T LU Factorization for Accelerator-Based Systems %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Mathieu Faverge %A Julien Langou %A Hatem Ltaeif %A Stanimire Tomov %K magma %K morse %B IEEE/ACS AICCSA 2011 %C Sharm-El-Sheikh, Egypt %8 2011-12 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2011 %T QCG-OMPI: MPI Applications on Grids. %A Emmanuel Agullo %A Camille Coti %A Thomas Herault %A Julien Langou %A Sylvain Peyronnet %A A. Rezmerita %A Franck Cappello %A Jack Dongarra %B Future Generation Computer Systems %V 27 %P 435-369 %8 2011-01 %G eng %0 Journal Article %J Parallel Computing (to appear) %D 2010 %T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %B Parallel Computing (to appear) %8 2010-00 %G eng %0 Generic %D 2010 %T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-10-660 %8 2010-09 %G eng %0 Generic %D 2010 %T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K plasma %B Innovative Computing Laboratory Technical Report %8 2010-00 %G eng %0 Journal Article %J Future Generation Computer Systems %D 2010 %T QCG-OMPI: MPI Applications on Grids %A Emmanuel Agullo %A Camille Coti %A Thomas Herault %A Julien Langou %A Sylvain Peyronnet %A A. Rezmerita %A Franck Cappello %A Jack Dongarra %B Future Generation Computer Systems %V 27 %P 357-369 %8 2010-03 %G eng %0 Conference Proceedings %B 24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224) %D 2010 %T QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment %A Emmanuel Agullo %A Camille Coti %A Jack Dongarra %A Thomas Herault %A Julien Langou %B 24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224) %C Atlanta, GA %8 2010-04 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2010 %T Rectangular Full Packed Format for Cholesky’s Algorithm: Factorization, Solution, and Inversion %A Fred G. Gustavson %A Jerzy Wasniewski %A Jack Dongarra %A Julien Langou %B ACM Transactions on Mathematical Software (TOMS) %C Atlanta, GA %V 37 %8 2010-04 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2010 %T Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion %A Fred G. Gustavson %A Jerzy Wasniewski %A Jack Dongarra %A Julien Langou %B ACM Transactions on Mathematical Software (TOMS) %V 37 %8 2010-04 %G eng %0 Journal Article %J Computer Physics Communications %D 2009 %T Accelerating Scientific Computations with Mixed Precision Algorithms %A Marc Baboulin %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julie Langou %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %X On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. %B Computer Physics Communications %V 180 %P 2526-2533 %8 2009-12 %G eng %N 12 %R https://doi.org/10.1016/j.cpc.2008.11.005 %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2009 %T Algorithmic Based Fault Tolerance Applied to High Performance Computing %A Jack Dongarra %A George Bosilca %A Remi Delmas %A Julien Langou %B Journal of Parallel and Distributed Computing %V 69 %P 410-416 %8 2009-00 %G eng %0 Journal Article %J Parallel Computing %D 2009 %T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %K plasma %B Parallel Computing %V 35 %P 38-53 %8 2009-00 %G eng %0 Journal Article %J Numerical Linear Algebra with Applications %D 2009 %T Computing the Conditioning of the Components of a Linear Least-squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B Numerical Linear Algebra with Applications %V 16 %P 517-533 %8 2009-00 %G eng %0 Generic %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Rajib Nath %A Stanimire Tomov %A Asim YarKhan %A Vasily Volkov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09) %C Portland, OR %8 2009-11 %G eng %0 Conference Proceedings %B Journal of Physics: Conference Series %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Stanimire Tomov %K magma %K plasma %B Journal of Physics: Conference Series %V 180 %8 2009-00 %G eng %0 Journal Article %J in Cyberinfrastructure Technologies and Applications %D 2009 %T Parallel Dense Linear Algebra Software in the Multicore Era %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %E Junwei Cao %K plasma %B in Cyberinfrastructure Technologies and Applications %I Nova Science Publishers, Inc. %P 9-24 %8 2009-00 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2009 %T The Problem with the Linpack Benchmark Matrix Generator %A Julien Langou %A Jack Dongarra %K hpl %B International Journal of High Performance Computing Applications %V 23 %P 5-14 %8 2009-00 %G eng %0 Journal Article %J ACM TOMS (to appear) %D 2009 %T Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion %A Fred G. Gustavson %A Jerzy Wasniewski %A Jack Dongarra %A Julien Langou %B ACM TOMS (to appear) %8 2009-00 %G eng %0 Generic %D 2008 %T Algorithmic Based Fault Tolerance Applied to High Performance Computing %A George Bosilca %A Remi Delmas %A Jack Dongarra %A Julien Langou %B University of Tennessee Computer Science Technical Report, UT-CS-08-620 (also LAPACK Working Note 205) %8 2008-01 %G eng %0 Journal Article %J VECPAR '08, High Performance Computing for Computational Science %D 2008 %T Computing the Conditioning of the Components of a Linear Least Squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B VECPAR '08, High Performance Computing for Computational Science %C Toulouse, France %8 2008-01 %G eng %0 Journal Article %J in High Performance Computing and Grids in Action %D 2008 %T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %E Lucio Grandinetti %B in High Performance Computing and Grids in Action %I IOS Press %C Amsterdam %8 2008-01 %G eng %0 Conference Proceedings %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %D 2008 %T Interior State Computation of Nano Structures %A Andrew Canning %A Jack Dongarra %A Julien Langou %A Osni Marques %A Stanimire Tomov %A Christof Voemel %A Lin-Wang Wang %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %C Trondheim, Norway %8 2008-05 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2008 %T Parallel Tiled QR Factorization for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %B Concurrency and Computation: Practice and Experience %V 20 %P 1573-1590 %8 2008-01 %G eng %0 Generic %D 2008 %T The Problem with the Linpack Benchmark Matrix Generator %A Jack Dongarra %A Julien Langou %B University of Tennessee Computer Science Technical Report, UT-CS-08-621 (also LAPACK Working Note 206) %8 2008-06 %G eng %0 Generic %D 2007 %T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report %8 2007-01 %G eng %0 Generic %D 2007 %T Computing the Conditioning of the Components of a Linear Least Squares Solution %A Marc Baboulin %A Jack Dongarra %A Serge Gratton %A Julien Langou %B University of Tennessee Computer Science Technical Report %8 2007-01 %G eng %0 Journal Article %J in Petascale Computing: Algorithms and Applications (to appear) %D 2007 %T Disaster Survival Guide in Petascale Computing: An Algorithmic Approach %A Jack Dongarra %A Zizhong Chen %A George Bosilca %A Julien Langou %B in Petascale Computing: Algorithms and Applications (to appear) %I Chapman & Hall - CRC Press %8 2007-00 %G eng %0 Journal Article %J In High Performance Computing and Grids in Action (to appear) %D 2007 %T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Julie Langou %A Piotr Luszczek %A Stanimire Tomov %E Lucio Grandinetti %B In High Performance Computing and Grids in Action (to appear) %I IOS Press %C Amsterdam %8 2007-00 %G eng %0 Journal Article %J International Journal of High Performance Computer Applications (to appear) %D 2007 %T Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems %A Alfredo Buttari %A Jack Dongarra %A Julien Langou %A Julie Langou %A Piotr Luszczek %A Jakub Kurzak %B International Journal of High Performance Computer Applications (to appear) %8 2007-08 %G eng %0 Generic %D 2007 %T Parallel Tiled QR Factorization for Multicore Architectures %A Alfredo Buttari %A Julien Langou %A Jakub Kurzak %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-598 (also LAPACK Working Note 190) %8 2007-00 %G eng %0 Journal Article %J SIAM SISC (to appear) %D 2007 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A Julien Langou %A Zizhong Chen %A George Bosilca %A Jack Dongarra %B SIAM SISC (to appear) %8 2007-05 %G eng %0 Journal Article %J International Journal of Computational Science and Engineering %D 2006 %T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures %A Stanimire Tomov %A Julien Langou %A Jack Dongarra %A Andrew Canning %A Lin-Wang Wang %B International Journal of Computational Science and Engineering %V 2 %P 205-212 %8 2006-00 %G eng %0 Journal Article %J University of Tennessee Computer Science Tech Report %D 2006 %T Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy %A Julien Langou %A Julien Langou %A Piotr Luszczek %A Jakub Kurzak %A Alfredo Buttari %A Jack Dongarra %K iter-ref %B University of Tennessee Computer Science Tech Report %8 2006-04 %G eng %0 Journal Article %J PARA 2006 %D 2006 %T The Impact of Multicore on Math Software %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %K plasma %B PARA 2006 %C Umea, Sweden %8 2006-06 %G eng %0 Conference Proceedings %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %D 2006 %T Performance evaluation of eigensolvers in nano-structure computations %A Andrew Canning %A Jack Dongarra %A Julien Langou %A Osni Marques %A Stanimire Tomov %A Christof Voemel %A Lin-Wang Wang %K doe-nano %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %8 2006-01 %G eng %0 Journal Article %J J. Phys.: Conf. Ser. 46 %D 2006 %T Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures %A Alex Zunger %A Alberto Franceschetti %A Gabriel Bester %A Wesley B. Jones %A Kwiseon Kim %A Peter A. Graf %A Lin-Wang Wang %A Andrew Canning %A Osni Marques %A Christof Voemel %A Jack Dongarra %A Julien Langou %A Stanimire Tomov %K DOE_NANO %B J. Phys.: Conf. Ser. 46 %V :101088/1742-6596/46/1/040 %P 292-298 %8 2006-01 %G eng %0 Journal Article %J PARA 2006 %D 2006 %T Prospectus for the Next LAPACK and ScaLAPACK Libraries %A James Demmel %A Jack Dongarra %A B. Parlett %A William Kahan %A Ming Gu %A David Bindel %A Yozo Hida %A Xiaoye Li %A Osni Marques %A Jason E. Riedy %A Christof Voemel %A Julien Langou %A Piotr Luszczek %A Jakub Kurzak %A Alfredo Buttari %A Julien Langou %A Stanimire Tomov %B PARA 2006 %C Umea, Sweden %8 2006-06 %G eng %0 Journal Article %J IBM Journal of Research and Development %D 2006 %T Self Adapting Numerical Software SANS Effort %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Victor Eijkhout %A Graham Fagg %A Erika Fuentes %A Julien Langou %A Piotr Luszczek %A Jelena Pjesivac–Grbovic %A Keith Seymour %A Haihang You %A Sathish Vadhiyar %K gco %B IBM Journal of Research and Development %V 50 %P 223-238 %8 2006-01 %G eng %0 Conference Proceedings %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %D 2006 %T Towards bulk based preconditioning for quantum dot computations %A Andrew Canning %A Jack Dongarra %A Julien Langou %A Osni Marques %A Stanimire Tomov %A Christof Voemel %A Lin-Wang Wang %K doe-nano %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %8 2006-01 %G eng %0 Conference Proceedings %B Proceedings of 5th International Conference on Computational Science (ICCS) %D 2005 %T Comparison of Nonlinear Conjugate-Gradient methods for computing the Electronic Properties of Nanostructure Architectures %A Stanimire Tomov %A Julien Langou %A Andrew Canning %A Lin-Wang Wang %A Jack Dongarra %E V. S. Sunderman %E Geert Dick van Albada %E Peter M. Sloot %E Jack Dongarra %K doe-nano %B Proceedings of 5th International Conference on Computational Science (ICCS) %I Springer's Lecture Notes in Computer Science %C Atlanta, GA, USA %P 317-325 %8 2005-01 %G eng %0 Journal Article %J International Journal of Computational Science and Engineering (to appear) %D 2005 %T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures %A Stanimire Tomov %A Julien Langou %A Andrew Canning %A Lin-Wang Wang %A Jack Dongarra %B International Journal of Computational Science and Engineering (to appear) %8 2005-01 %G eng %0 Conference Proceedings %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %D 2005 %T Fault Tolerant High Performance Computing by a Coding Approach %A Zizhong Chen %A Graham Fagg %A Edgar Gabriel %A Julien Langou %A Thara Angskun %A George Bosilca %A Jack Dongarra %K ftmpi %K grads %K lacsi %K sans %B Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear) %C Chicago, Illinois %8 2005-01 %G eng %0 Conference Proceedings %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %D 2005 %T Hash Functions for Datatype Signatures in MPI %A George Bosilca %A Jack Dongarra %A Graham Fagg %A Julien Langou %E Beniamino Di Martino %K ftmpi %B Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI %I Springer-Verlag Berlin %C Sorrento (Naples), Italy %V 3666 %P 76-83 %8 2005-09 %G eng %0 Journal Article %J Journal of Physics: Conference Series %D 2005 %T NanoPSE: A Nanoscience Problem Solving Environment for Atomistic Electronic Structure of Semiconductor Nanostructures %A Wesley B. Jones %A Gabriel Bester %A Andrew Canning %A Alberto Franceschetti %A Peter A. Graf %A Kwiseon Kim %A Julien Langou %A Lin-Wang Wang %A Jack Dongarra %A Alex Zunger %X Researchers at the National Renewable Energy Laboratory and their collaborators have developed over the past ~10 years a set of algorithms for an atomistic description of the electronic structure of nanostructures, based on plane-wave pseudopotentials and configuration interaction. The present contribution describes the first step in assembling these various codes into a single, portable, integrated set of software packages. This package is part of an ongoing research project in the development stage. Components of NanoPSE include codes for atomistic nanostructure generation and passivation, valence force field model for atomic relaxation, code for potential field generation, empirical pseudopotential method solver, strained linear combination of bulk bands method solver, configuration interaction solver for excited states, selection of linear algebra methods, and several inverse band structure solvers. Although not available for general distribution at this time as it is being developed and tested, the design goal of the NanoPSE software is to provide a software context for collaboration. The software package is enabled by fcdev, an integrated collection of best practice GNU software for open source development and distribution augmented to better support FORTRAN. %B Journal of Physics: Conference Series %P 277-282 %8 2005-06 %G eng %U https://iopscience.iop.org/article/10.1088/1742-6596/16/1/038/meta %N 16 %R https://doi.org/10.1088/1742-6596/16/1/038 %0 Journal Article %J Journal of Computational Acoustics (to appear) %D 2005 %T On the Parallel Solution of Large Industrial Wave Propagation Problems %A Luc Giraud %A Julien Langou %A G. Sylvand %B Journal of Computational Acoustics (to appear) %8 2005-01 %G eng %0 Generic %D 2005 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Julien Langou %K ft-la %B University of Tennessee Computer Science Department Technical Report, UT-CS-04-538 %8 2005-00 %G eng %0 Journal Article %J Numerische Mathematik %D 2005 %T Rounding Error Analysis of the Classical Gram-Schmidt Orthogonalization Process %A Luc Giraud %A Julien Langou %A Miroslav Rozložník %A Jasper van den Eshof %B Numerische Mathematik %V 101 %P 87-100 %8 2005-01 %G eng %0 Generic %D 2004 %T Performance Optimization and Modeling of Blocked Sparse Kernels %A Alfredo Buttari %A Victor Eijkhout %A Julien Langou %A Salvatore Filippone %K sans %B ICL Technical Report %8 2004-00 %G eng %0 Generic %D 2004 %T Recovery Patterns for Iterative Methods in a Parallel Unstable Environment %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Julien Langou %B ICL Technical Report %8 2004-01 %G eng