%0 Conference Paper %B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis %D 2023 %T Elastic deep learning through resilient collective operations %A Li, Jiali %A Bosilca, George %A Bouteiller, Aurélien %A Nicolae, Bogdan %X A robust solution that incorporates fault tolerance and elastic scaling capabilities for distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level Failure Mitigation (ULFM), this novel approach promotes efficient and lightweight failure management and encourages smooth scaling in volatile computational settings. The proposed ULFM MPI-centered mechanism outperforms the only officially supported elastic learning framework, Elastic Horovod (using Gloo and NCCL), by a significant factor. These results reinforce the capability of MPI extension to deal with resiliency, and promote ULFM as an effective technique for fault management, minimizing downtime, and thereby enhancing the overall performance of distributed applications, in particular elastic training in high-performance computing (HPC) environments and machine learning applications. %B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis %I ACM %C Denver, CO %8 2023-11 %@ 9798400707858 %G eng %U https://dl.acm.org/doi/abs/10.1145/3624062.3626080 %R 10.1145/3624062.3626080 %0 Journal Article %J International Journal of Networking and Computing %D 2022 %T Comparing Distributed Termination Detection Algorithms for Modern HPC Platforms %A George Bosilca %A Bouteiller, Aurélien %A Herault, Thomas %A Le Fèvre, Valentin %A Robert, Yves %A Dongarra, Jack %X This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. We then compare the implementation of these algorithms over a task-based runtime system, PaRSEC and show the advantages and limitations of each approach in a real implementation. %B International Journal of Networking and Computing %V 12 %P 26 - 46 %8 2022-01 %G eng %U https://www.jstage.jst.go.jp/article/ijnc/12/1/12_26/_article %N 1 %! IJNC %R 10.15803/ijnc.12.1_26 %0 Conference Paper %B 2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) %D 2022 %T Implicit Actions and Non-blocking Failure Recovery with MPI %A Bouteiller, Aurélien %A George Bosilca %X Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in which the fault tolerance activities of multiple components can be carried out simultaneously and overlap. This work proposes to: (1) provide the required consistency in fault reporting to applications (i.e., enable an application to assess the success of a computational phase without incurring an unacceptable performance hit); (2) bring forward the building blocks that permit the effective scoping of fault recovery in an application, so that independent components in an application can recover without interfering with each other, and separate groups of processes in the application can recover independently or in unison; and (3) overlap recovery activities necessary to restore the consistency of the system (e.g., eviction of faulty processes from the communication group) with application recovery activities (e.g., dataset restoration from checkpoints). %B 2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) %I IEEE %C Dallas, TX, USA %8 2023-01 %G eng %U https://ieeexplore.ieee.org/document/10024038/ %R 10.1109/FTXS56515.2022.00009 %0 Conference Proceedings %B 2022 IEEE International Conference on Cluster Computing (CLUSTER 2022) %D 2022 %T Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach %A Whitlock, Matthew %A Morales, Nicolas %A George Bosilca %A Bouteiller, Aurélien %A Nicolae, Bogdan %A Teranishi, Keita %A Giem, Elisabeth %A Sarkar, Vivek %K checkpointing %K Fault tolerance %K Fenix %K HPC %K Kokkos %K MPI-ULFM %K resilience %B 2022 IEEE International Conference on Cluster Computing (CLUSTER 2022) %C Heidelberg, Germany %8 2022-09 %G eng %U https://hal.archives-ouvertes.fr/hal-03772536 %0 Conference Proceedings %B 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2021 %T Revisiting Credit Distribution Algorithms for Distributed Termination Detection %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Le Fèvre, Valentin %A Robert, Yves %A Jack Dongarra %K control messages %K credit distribution algorithms %K task-based HPC application %K Termination detection %X This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. %B 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %P 611–620 %G eng %R 10.1109/IPDPSW52791.2021.00095 %0 Journal Article %J Future Generation Computer Systems %D 2020 %T Fault Tolerance of MPI Applications in Exascale Systems: The ULFM Solution %A Nuria Losada %A Patricia González %A María J. Martín %A George Bosilca %A Aurelien Bouteiller %A Keita Teranishi %K Application-level checkpointing %K MPI %K resilience %K ULFM %X The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient fault-tolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms. %B Future Generation Computer Systems %V 106 %P 467-481 %8 2020-05 %G eng %U https://www.sciencedirect.com/science/article/pii/S0167739X1930860X %R https://doi.org/10.1016/j.future.2020.01.026 %0 Conference Paper %B IEEE International Conference on Cluster Computing (Cluster 2020) %D 2020 %T Flexible Data Redistribution in a Task-Based Runtime System %A Qinglei Cao %A George Bosilca %A Wei Wu %A Dong Zhong %A Aurelien Bouteiller %A Jack Dongarra %X Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal to increase the efficiency and therefore decrease the time-to-solution for the algorithm. The classical redistribution problem focuses on optimally scheduling communications when reshuffling data between two regular, usually block-cyclic, data distributions. Recently, task-based runtime systems have gained popularity as a potential candidate to address the programming complexity on the way to exascale. In addition to an increase in portability against complex hardware and software systems, task-based runtime systems have the potential to be able to more easily cope with less-regular data distribution, providing a more balanced computational load during the lifetime of the execution. In this scenario, it becomes paramount to develop a general redistribution algorithm for task-based runtime systems, which could support all types of regular and irregular data distributions. In this paper, we detail a flexible redistribution algorithm, capable of dealing with redistribution problems without constraints of data distribution and data size and implement it in a task-based runtime system, PaRSEC. Performance results show great capability compared to ScaLAPACK, and applications highlight an increased efficiency with little overhead in terms of data distribution and data size. %B IEEE International Conference on Cluster Computing (Cluster 2020) %I IEEE %C Kobe, Japan %8 2020-09 %G eng %R https://doi.org/10.1109/CLUSTER49012.2020.00032 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2020 %T Overhead of Using Spare Nodes %A Atsushi Hori %A Kazumi Yoshinaga %A Thomas Herault %A Aurelien Bouteiller %A George Bosilca %A Yutaka Ishikawa %K communication performance %K fault mitigation %K Fault tolerance %K sliding method %K spare node %X With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even when spare nodes are present, they are not always substituted for failed nodes in an effective way. This article considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the optimal node-rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this article, several spare node allocation and failed node substitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. The proposed substitution methods are named sliding methods. The sliding methods are analyzed by using our developed simulation program and evaluated by using the K computer, Blue Gene/Q (BG/Q), and TSUBAME 2.5. It will be shown that when failures occur, the stencil communication performance on the K and BG/Q can be slowed around 10 times depending on the number of node failures. The barrier performance on the K can be cut in half. On BG/Q, barrier performance can be slowed by a factor of 10. Further, it will also be shown that almost no such communication performance degradation can be seen on TSUBAME 2.5. This is because TSUBAME 2.5 has an Infiniband network connected with a FatTree topology, while the K computer and BG/Q have dedicated Cartesian networks. Thus, the communication performance degradation depends on network characteristics. %B The International Journal of High Performance Computing Applications %8 2020-02 %G eng %U https://journals.sagepub.com/doi/10.1177/1094342020901885 %! The International Journal of High Performance Computing Applications %R https://doi.org/10.1177%2F1094342020901885 %0 Conference Paper %B Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19) %D 2019 %T Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications %A Nuria Losada %A Aurelien Bouteiller %A George Bosilca %K checkpoint/restart %K Fault tolerance %K Message logging %K MPI %K ULFM %K User Level Fault Mitigation %X With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions. %B Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19) %8 2019-11 %G eng %U https://sc19.supercomputing.org/proceedings/workshops/workshop_files/ws_ftxs103s2-file1.pdf %0 Journal Article %J International Journal of Networking and Computing %D 2019 %T Checkpointing Strategies for Shared High-Performance Computing Platforms %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Dorian Arnold %A Kurt Ferreira %A George Bosilca %A Jack Dongarra %X Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Overall, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem. %B International Journal of Networking and Computing %V 9 %P 28–52 %G eng %U http://www.ijnc.org/index.php/ijnc/article/view/195 %0 Journal Article %J Parallel Computing %D 2019 %T Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms %A Valentin Le Fèvre %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Atsushi Hori %A George Bosilca %A Jack Dongarra %B Parallel Computing %V 85 %P 1–12 %8 2019-07 %G eng %R https://doi.org/10.1016/j.parco.2019.02.002 %0 Journal Article %J Future Generation Computer Systems %D 2019 %T Local Rollback for Resilient MPI Applications with Application-Level Checkpointing and Message Logging %A Nuria Losada %A George Bosilca %A Aurelien Bouteiller %A Patricia González %A María J. Martín %K Application-level checkpointing %K Local rollback %K Message logging %K MPI %K resilience %X The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface – the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard – enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the ComPiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level—thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications. %B Future Generation Computer Systems %V 91 %P 450-464 %8 2019-02 %G eng %R https://doi.org/10.1016/j.future.2018.09.041 %0 Journal Article %J Parallel Computing %D 2019 %T Performance of Asynchronous Optimized Schwarz with One-sided Communication %A Ichitaro Yamazaki %A Edmond Chow %A Aurelien Bouteiller %A Jack Dongarra %X In asynchronous iterative methods on distributed-memory computers, processes update their local solutions using data from other processes without an implicit or explicit global synchronization that corresponds to advancing the global iteration counter. In this work, we test the asynchronous optimized Schwarz domain-decomposition iterative method using various one-sided (remote direct memory access) communication schemes with passive target completion. The results show that when one-sided communication is well-supported, the asynchronous version of optimized Schwarz can outperform the synchronous version even for perfectly balanced partitionings of the problem on a supercomputer with uniform nodes. %B Parallel Computing %V 86 %P 66-81 %8 2019-08 %G eng %U http://www.sciencedirect.com/science/article/pii/S0167819118301261 %R https://doi.org/10.1016/j.parco.2019.05.004 %0 Conference Paper %B European MPI Users' Group Meeting (EuroMPI '19) %D 2019 %T Runtime Level Failure Detection and Propagation in HPC Systems %A Dong Zhong %A Aurelien Bouteiller %A Xi Luo %A George Bosilca %X As the scale of high-performance computing (HPC) systems continues to grow, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. In order to efficiently run long computing jobs on these systems, handling system failures becomes a prime challenge. We present here the design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Multiple overlapping topologies are used to optimize the detection and propagation, minimizing the incurred overheads and guaranteeing the scalability of the entire framework. The resulting framework has been implemented in the context of a system-level runtime for parallel environments, PMIx Reference RunTime Environment (PRRTE), providing efficient and scalable capabilities of fault management to a large range of programming and execution paradigms. The experimental evaluation of the resulting software stack on different machines demonstrate that the solution is at the same time generic and efficient. %B European MPI Users' Group Meeting (EuroMPI '19) %I ACM %C Zürich, Switzerland %8 2019-09 %@ 978-1-4503-7175-9 %G eng %R https://doi.org/10.1145/3343211.3343225 %0 Book Section %B Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project %D 2019 %T System Software for Many-Core and Multi-Core Architectures %A Atsushi Hori %A Tsujita, Yuichi %A Shimada, Akio %A Yoshinaga, Kazumi %A Mitaro, Namiki %A Fukazawa, Go %A Sato, Mikiko %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %E Sato, Mitsuhisa %X In this project, the software technologies for the post-peta scale computing were explored. More specifically, OS technologies for heterogeneous architectures, lightweight thread, scalable I/O, and fault mitigation were investigated. As for the OS technologies, a new parallel execution model, Partitioned Virtual Address Space (PVAS), for the many-core CPU was proposed. For the heterogeneous architectures, where multi-core CPU and many-core CPU are connected with an I/O bus, an extension of PVAS, Multiple-PVAS, to have a unified virtual address space of multi-core and many-core CPUs was proposed. The proposed PVAS was also enhanced to have multiple processes where process context switch can take place at the user level (named User-Level Process: ULP). As for the scalable I/O, EARTH, optimization techniques for MPI collective I/O, was proposed. Lastly, for the fault mitigation, User Level Fault Mitigation, ULFM was improved to have faster agreement process, and sliding methods to substitute failed nodes with spare nodes was proposed. The funding of this project was ended in 2016; however, many proposed technologies are still being propelled. %B Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project %I Springer Singapore %C Singapore %P 59–75 %@ 978-981-13-1924-2 %G eng %U https://doi.org/10.1007/978-981-13-1924-2_4 %R 10.1007/978-981-13-1924-2_4 %0 Generic %D 2018 %T Data Movement Interfaces to Support Dataflow Runtimes %A Aurelien Bouteiller %A George Bosilca %A Thomas Herault %A Jack Dongarra %X This document presents the design study and reports on the implementation of a portable hosted accelerator device support in the PaRSEC Dataflow Tasking at Exascale runtime, undertaken as part of the ECP contract 17-SC-20-SC. The document discusses different technological approaches to transfer data to/from hosted accelerators, issues recommendations for technology providers, and presents the design of an OpenMP-based accelerator support in PaRSEC. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-05 %G eng %0 Generic %D 2018 %T Distributed Termination Detection for HPC Task-Based Environments %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Valentin Le Fèvre %A Yves Robert %A Jack Dongarra %X This paper revisits distributed termination detection algorithms in the context of high-performance computing applications in task systems. We first outline the need to efficiently detect termination in workflows for which the total number of tasks is data dependent and therefore not known statically but only revealed dynamically during execution. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). On the theoretical side, we analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. On the practical side, we provide a highly tuned implementation of each termination detection algorithm within PaRSEC and compare their performance for a variety of benchmarks, extracted from scientific applications that exhibit dynamic behaviors. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-06 %G eng %0 Conference Paper %B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids %D 2018 %T Do moldable applications perform better on failure-prone HPC platforms? %A Valentin Le Fèvre %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Atsushi Hori %A Yves Robert %A Jack Dongarra %X This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) Rigid applications, which use a constant number of processors throughout execution; (ii) Moldable applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GridShaped applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage. %B 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids %S LNCS %I Springer Verlag %C Turin, Italy %8 2018-08 %G eng %0 Conference Proceedings %B OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence %D 2018 %T Evaluating Contexts in OpenSHMEM-X Reference Implementation %A Aurelien Bouteiller %A Pophale, Swaroop %A Swen Boehm %A Baker, Matthew B. %A Manjunath Gorentla Venkata %E Manjunath Gorentla Venkata %E Imam, Neena %E Pophale, Swaroop %X Many-core processors are now ubiquitous in supercomputing. This evolution pushes toward the adoption of mixed models in which cores are exploited with threading models (and related programming abstractions, such as OpenMP), while communication between distributed memory domains employ a communication Application Programming Interface (API). OpenSHMEM is a partitioned global address space communication specification that exposes one-sided and synchronization operations. As the threaded semantics of OpenSHMEM are being fleshed out by its standardization committee, it is important to assess the soundness of the proposed concepts. This paper implements and evaluate the ``context'' extension in relation to threaded operations. We discuss the implementation challenges of the context and the associated API in OpenSHMEM-X. We then evaluate its performance in threaded situations on the Infiniband network using micro-benchmarks and the Random Access benchmark and see that adding communication contexts significantly improves message rate achievable by the executing multi-threaded PEs. %B OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence %I Springer International Publishing %C Cham %P 50–62 %@ 978-3-319-73814-7 %G eng %R https://doi.org/10.1007/978-3-319-73814-7_4 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2018 %T A Failure Detector for HPC Platforms %A George Bosilca %A Aurelien Bouteiller %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Pierre Sens %A Jack Dongarra %K failure detection %K Fault tolerance %K MPI %X Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This article describes the design and evaluation of a robust failure detector that can maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage uses a nonuniform variant of a reliable broadcast over a circulant graph overlay network and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan Oak Ridge National Laboratory supercomputer, show that the algorithm performs extremely well and exhibits all the desired properties of an exascale-ready algorithm. %B The International Journal of High Performance Computing Applications %V 32 %P 139–158 %8 2018-01 %G eng %N 1 %R https://doi.org/10.1177/1094342017711505 %0 Conference Paper %B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award %D 2018 %T Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Dorian Arnold %A Kurt Ferreira %A George Bosilca %A Jack Dongarra %X In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance. %B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award %I IEEE %C Vancouver, BC, Canada %8 2018-05 %G eng %R 10.1109/IPDPSW.2018.00127 %0 Journal Article %J Parallel Computing %D 2018 %T PMIx: Process Management for Exascale Environments %A Ralph Castain %A Joshua Hursey %A Aurelien Bouteiller %A David Solt %B Parallel Computing %V 79 %P 9–29 %8 2018-01 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S0167819118302424https://api.elsevier.com/content/article/PII:S0167819118302424?httpAccept=text/xmlhttps://api.elsevier.com/content/article/PII:S0167819118302424?httpAccept=text/plain %! Parallel Computing %R 10.1016/j.parco.2018.08.002 %0 Journal Article %J ISC High Performance 2017 %D 2017 %T A Framework for Out of Memory SVD Algorithms %A Khairul Kabir %A Azzam Haidar %A Stanimire Tomov %A Aurelien Bouteiller %A Jack Dongarra %X Many important applications – from big data analytics to information retrieval, gene expression analysis, and numerical weather prediction – require the solution of large dense singular value decompositions (SVD). In many cases the problems are too large to fit into the computer’s main memory, and thus require specialized out-of-core algorithms that use disk storage. In this paper, we analyze the SVD communications, as related to hierarchical memories, and design a class of algorithms that minimizes them. This class includes out-of-core SVDs but can also be applied between other consecutive levels of the memory hierarchy, e.g., GPU SVD using the CPU memory for large problems. We call these out-of-memory (OOM) algorithms. To design OOM SVDs, we first study the communications for both classical one-stage blocked SVD and two-stage tiled SVD. We present the theoretical analysis and strategies to design, as well as implement, these communication avoiding OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models. %B ISC High Performance 2017 %P 158–178 %8 2017-06 %G eng %R https://doi.org/10.1007/978-3-319-58667-0_9 %0 Conference Proceedings %B Proceedings of the 24th European MPI Users' Group Meeting %D 2017 %T PMIx: Process Management for Exascale Environments %A Castain, Ralph H. %A David Solt %A Joshua Hursey %A Aurelien Bouteiller %X High-Performance Computing (HPC) applications have historically executed in static resource allocations, using programming models that ran independently from the resident system management stack (SMS). Achieving exascale performance that is both cost-effective and fits within site-level environmental constraints will, however, require that the application and SMS collaboratively orchestrate the flow of work to optimize resource utilization and compensate for on-the-fly faults. The Process Management Interface - Exascale (PMIx) community is committed to establishing scalable workflow orchestration by defining an abstract set of interfaces by which not only applications and tools can interact with the resident SMS, but also the various SMS components can interact with each other. This paper presents a high-level overview of the goals and current state of the PMIx standard, and lays out a roadmap for future directions. %B Proceedings of the 24th European MPI Users' Group Meeting %S EuroMPI '17 %I ACM %C New York, NY, USA %P 14:1–14:10 %@ 978-1-4503-4849-2 %G eng %U http://doi.acm.org/10.1145/3127024.3127027 %R 10.1145/3127024.3127027 %0 Generic %D 2017 %T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale %A Ahmad Abdelfattah %A Hartwig Anzt %A Aurelien Bouteiller %A Anthony Danalis %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Stephen Wood %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2017-06 %G eng %9 SLATE Working Notes %1 01 %0 Conference Proceedings %B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16) %D 2016 %T Failure Detection and Propagation in HPC Systems %A George Bosilca %A Aurelien Bouteiller %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Pierre Sens %A Jack Dongarra %K failure detection %K fault-tolerance %K MPI %B Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16) %I IEEE Press %C Salt Lake City, Utah %P 27:1-27:11 %8 2016-11 %@ 978-1-4673-8815-3 %G eng %U http://dl.acm.org/citation.cfm?id=3014904.3014941 %0 Conference Proceedings %B OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments %D 2016 %T Surviving Errors with OpenSHMEM %A Aurelien Bouteiller %A George Bosilca %A Manjunath Gorentla Venkata %E Manjunath Gorentla Venkata %E Imam, Neena %E Pophale, Swaroop %E Mintz, Tiffany M. %X Unexpected error conditions stem from a variety of underlying causes, including resource exhaustion, network failures, hardware failures, or program errors. As the scale of HPC systems continues to grow, so does the probability of encountering a condition that causes a failure; meanwhile, error recovery and run-through failure management are becoming mature, and interoperable HPC programming paradigms are beginning to feature advanced error management. As a result from these developments, it becomes increasingly desirable to gracefully handle error conditions in OpenSHMEM. In this paper, we present the design and rationale behind an extension of the OpenSHMEM API that can (1) notify user code of unexpected erroneous conditions, (2) permit customized user response to errors without incurring overhead on an error-free execution path, (3) propagate the occurence of an error condition to all Processing Elements, and (4) consistently close the erroneous epoch in order to resume the application. %B OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments %I Springer International Publishing %C Baltimore, MD, USA %P 66–81 %@ 978-3-319-50995-2 %G eng %0 Journal Article %J ACM Transactions on Parallel Computing %D 2015 %T Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures, and Accuracy %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Peng Du %A Jack Dongarra %E Phillip B. Gibbons %K ABFT %K algorithms %K fault-tolerance %K High Performance Computing %K linear algebra %X Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable node and the possibility of losing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead decreases inversely to the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. Applicability to tolerate multiple failures and accuracy after multiple recovery is also considered. %B ACM Transactions on Parallel Computing %V 1 %P 10:1-10:28 %8 2015-01 %G eng %N 2 %R 10.1145/2686892 %0 Journal Article %J International Journal of Networking and Computing %D 2015 %T Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Yves Robert %A Jack Dongarra %K ABFT %K checkpoint %K fault-tolerance %K High-performance computing %K model %K performance evaluation %K resilience %X Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFT- aware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic program- ming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint. %B International Journal of Networking and Computing %V 5 %P 2-15 %8 2015-01 %G eng %0 Conference Proceedings %B OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies %D 2015 %T From MPI to OpenSHMEM: Porting LAMMPS %A Tang, Chunyan %A Aurelien Bouteiller %A Thomas Herault %A Manjunath Gorentla Venkata %A George Bosilca %E Manjunath Gorentla Venkata %E Shamis, Pavel %E Imam, Neena %E M. Graham Lopez %X This work details the opportunities and challenges of porting a Petascale, MPI-based application –-LAMMPS–- to OpenSHMEM. We investigate the major programming challenges stemming from the differences in communication semantics, address space organization, and synchronization operations between the two programming models. This work provides several approaches to solve those challenges for representative communication patterns in LAMMPS, e.g., by considering group synchronization, peer's buffer status tracking, and unpacked direct transfer of scattered data. The performance of LAMMPS is evaluated on the Titan HPC system at ORNL. The OpenSHMEM implementations are compared with MPI versions in terms of both strong and weak scaling. The results outline that OpenSHMEM provides a rich semantic to implement scalable scientific applications. In addition, the experiments demonstrate that OpenSHMEM can compete with, and often improve on, the optimized MPI implementation. %B OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies %I Springer International Publishing %C Annapolis, MD, USA %P 121–137 %@ 978-3-319-26428-8 %G eng %R 10.1007/978-3-319-26428-8_8 %0 Conference Paper %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2015 %T Hierarchical DAG scheduling for Hybrid Distributed Systems %A Wei Wu %A Aurelien Bouteiller %A George Bosilca %A Mathieu Faverge %A Jack Dongarra %K dense linear algebra %K gpu %K heterogeneous architecture %K PaRSEC runtime %X Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the pro-gramming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel effi-ciency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments. %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Hyderabad, India %8 2015-05 %G eng %0 Conference Paper %B 22nd European MPI Users' Group Meeting %D 2015 %T Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %X Advanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods. %B 22nd European MPI Users' Group Meeting %I ACM %C Bordeaux, France %8 2015-09 %G eng %R 10.1145/2802658.2802668 %0 Generic %D 2015 %T Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems: Formal Proof %A Thomas Herault %A Aurelien Bouteiller %A George Bosilca %A Marc Gamell %A Keita Teranishi %A Manish Parashar %A Jack Dongarra %B Innovative Computing Laboratory Technical Report %8 2015-04 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems %A Thomas Herault %A Aurelien Bouteiller %A George Bosilca %A Marc Gamell %A Keita Teranishi %A Manish Parashar %A Jack Dongarra %X The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications. %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Proceedings %B 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects %D 2015 %T UCX: An Open Source Framework for HPC Network APIs and Beyond %A P. Shamis %A Manjunath Gorentla Venkata %A M. Graham Lopez %A M. B. Baker %A O. Hernandez %A Y. Itigin %A M. Dubman %A G. Shainer %A R. L. Graham %A L. Liss %A Y. Shahar %A S. Potluri %A D. Rossetti %A D. Becker %A D. Poole %A C. Lamb %A S. Kumar %A C. Stunkel %A George Bosilca %A Aurelien Bouteiller %K application program interfaces %K Bandwidth %K Electronics packaging %K Hardware %K high throughput computing %K highly-scalable network stack %K HPC %K HPC network APIs %K I/O bound applications %K Infiniband %K input-output programs %K Libraries %K Memory management %K message passing %K message passing interface %K Middleware %K MPI %K open source framework %K OpenSHMEM %K parallel programming %K parallel programming models %K partitioned global address space languages %K PGAS %K PGAS languages %K Programming %K protocols %K public domain software %K RDMA %K system libraries %K task-based paradigms %K UCX %K Unified Communication X %X This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware. %B 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects %I IEEE %C Santa Clara, CA, USA %P 40-43 %8 Aug %@ 978-1-4673-9160-3 %G eng %M 15573048 %R 10.1109/HOTI.2015.13 %0 Conference Paper %B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014 %D 2014 %T Assessing the Impact of ABFT and Checkpoint Composite Strategies %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Yves Robert %A Jack Dongarra %K ABFT %K checkpoint %K fault-tolerance %K High-performance computing %K resilience %X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFTunaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed. %B 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B 8th International Conference on Partitioned Global Address Space Programming Models (PGAS) %D 2014 %T A Multithreaded Communication Substrate for OpenSHMEM %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %X OpenSHMEM scalability is strongly dependent on the capa- bility of its communication layer to efficiently handle multi- ple threads. In this paper, we present an early evaluation of the thread safety specification in the Unified Common Com- munication Substrate (UCCS) employed in OpenSHMEM. Results demonstrate that thread safety can be provided at an acceptable cost and can improve efficiency for some op- erations, compared to serializing communication. %B 8th International Conference on Partitioned Global Address Space Programming Models (PGAS) %C Eugene, OR %8 2014-10 %G eng %0 Conference Paper %B International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) %D 2014 %T PTG: An Abstraction for Unhindered Parallelism %A Anthony Danalis %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Jack Dongarra %K dte %K parsec %K plasma %X

Increased parallelism and use of heterogeneous computing resources is now an established trend in High Performance Computing (HPC), a trend that, looking forward to Exascale, seems bound to intensify. Despite the evolution of hardware over the past decade, the programming paradigm of choice was invariably derived from Coarse Grain Parallelism with explicit data movements. We argue that message passing has remained the de facto standard in HPC because, until now, the ever increasing challenges that application developers had to address to create efficient portable applications remained manageable for expert programmers.

Data-flow based programming is an alternative approach with significant potential. In this paper, we discuss the Parameterized Task Graph (PTG) abstraction and present the specialized input language that we use to specify PTGs in our data-flow task-based runtime system, PaRSEC. This language and the corresponding execution model are in contrast with the execution model of explicit message passing as well as the model of alternative task based runtime systems. The Parameterized Task Graph language decouples the expression of the parallelism in the algorithm from the control-flow ordering, load balance, and data distribution. Thus, programs are more adaptable and map more efficiently on challenging hardware, as well as maintain portability across diverse architectures. To support these claims, we discuss the different challenges of HPC programming and how PaRSEC can address them, and we demonstrate that in today’s large scale supercomputers, PaRSEC can significantly outperform state-of-the-art MPI applications and libraries, a trend that will increase with future architectural evolution.

%B International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) %I IEEE Press %C New Orleans, LA %8 2014-11 %G eng %0 Generic %D 2013 %T Assessing the impact of ABFT and Checkpoint composite strategies %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Yves Robert %A Jack Dongarra %K ABFT %K checkpoint %K fault-tolerance %K High-performance computing %K resilience %X Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it’s own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model and a simulator to investigate the behavior of a composite protocol, that alternates between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFT-unaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be checkpointed. %B University of Tennessee Computer Science Technical Report %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Correlated Set Coordination in Fault Tolerant Message Logging Protocols %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes. %B Concurrency and Computation: Practice and Experience %V 25 %P 572-585 %8 2013-03 %G eng %N 4 %R 10.1002/cpe.2859 %0 Journal Article %J Scalable Computing and Communications: Theory and Practice %D 2013 %T Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Piotr Luszczek %A Jack Dongarra %E Samee Khan %E Lin-Wang Wang %E Albert Zomaya %B Scalable Computing and Communications: Theory and Practice %I John Wiley & Sons %P 699-735 %8 2013-03 %G eng %0 Conference Paper %B 7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems %D 2013 %T Efficient Parallelization of Batch Pattern Training Algorithm on Many-core and Cluster Architectures %A Volodymyr Turchenko %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %K many-core system %K parallel batch pattern training %K parallelization efficiency %K recirculation neural network %X Abstract—The experimental research of the parallel batch pattern back propagation training algorithm on the example of recirculation neural network on many-core high performance computing systems is presented in this paper. The choice of recirculation neural network among the multilayer perceptron, recurrent and radial basis neural networks is proved. The model of a recirculation neural network and usual sequential batch pattern algorithm of its training are theoretically described. An algorithmic description of the parallel version of the batch pattern training method is presented. The experimental research is fulfilled using the Open MPI, Mvapich and Intel MPI message passing libraries. The results obtained on many-core AMD system and Intel MIC are compared with the results obtained on a cluster system. Our results show that the parallelization efficiency is about 95% on 12 cores located inside one physical AMD processor for the considered minimum and maximum scenarios. The parallelization efficiency is about 70-75% on 48 AMD cores for the minimum and maximum scenarios. These results are higher by 15-36% (depending on the version of MPI library) in comparison with the results obtained on 48 cores of a cluster system. The parallelization efficiency obtained on Intel MIC architecture is surprisingly low, asking for deeper analysis. %B 7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems %C Berlin, Germany %8 2013-09 %G eng %0 Journal Article %J Computing %D 2013 %T An evaluation of User-Level Failure Mitigation support in MPI %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A Joshua Hursey %A George Bosilca %A Jack Dongarra %K Fault tolerance %K MPI %K User-level fault mitigation %X As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures. %B Computing %V 95 %P 1171-1184 %8 2013-12 %G eng %N 12 %R 10.1007/s00607-013-0331-3 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X Most predictions of exascale machines picture billion ways parallelism, encompassing not only millions of cores but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: (i) traditional checkpoint-based approaches incur a steep overhead on failure free operations and (ii) the dominant programming paradigm for parallel applications (the message passing interface (MPI) Standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI Standard, to enable advanced forward recovery techniques, without incurring the overhead of customary periodic checkpointing. With our approach, when failure strikes, applications regain control to make a checkpoint before quitting execution. This checkpoint is in reaction to the failure occurrence rather than periodic. This checkpoint is reloaded in a new MPI application, which restores a sane environment for the forward, application-based recovery technique to repair the failure-damaged dataset. The validity and performance of this approach are evaluated on large-scale systems, using the QR factorization as an example. Published 2013. This article is a US Government work and is in the public domain in the USA. %B Concurrency and Computation: Practice and Experience %8 2013-07 %G eng %U http://doi.wiley.com/10.1002/cpe.3100 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.3100 %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2013 %T Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %K Cluster %K Collective communication %K Hierarchical %K HPC %K MPI %K Multicore %X Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies. This calls for a collaborative approach between multiple layers of collective algorithms, dedicated to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications. In this work, we present HierKNEM, a kernel-assisted topology-aware collective framework, and the mechanisms deployed by this framework to orchestrate the collaboration between multiple layers of collective algorithms. The resulting scheme maximizes the overlap of intra- and inter-node communications. We demonstrate experimentally, by considering three of the most used collective operations (Broadcast, Allgather and Reduction), that (1) this approach is immune to modifications of the underlying process-core binding; (2) it outperforms state-of-art MPI libraries (Open MPI, MPICH2 and MVAPICH2) demonstrating up to a 30x speedup for synthetic benchmarks, and up to a 3x acceleration for a parallel graph application (ASP); (3) it furthermore demonstrates a linear speedup with the increase of the number of cores per compute node, a paramount requirement for scalability on future many-core hardware. %B Journal of Parallel and Distributed Computing %V 73 %P 1000-1010 %8 2013-07 %G eng %U http://www.sciencedirect.com/science/article/pii/S0743731513000166 %N 7 %R 10.1016/j.jpdc.2013.01.015 %0 Generic %D 2013 %T Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization %A Aurelien Bouteiller %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %X Failures are increasingly threatening the eciency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to generalpurpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency. %B University of Tennessee Computer Science Technical Report %8 2013-02 %G eng %0 Conference Paper %B Euro-Par 2013 %D 2013 %T Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization %A Aurelien Bouteiller %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %X Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that roll- back recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the sys- tem batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted com- putation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application comple- tion time is unchanged, while it delivers near-perfect platform efficiency. %B Euro-Par 2013 %I Springer %C Aachen, Germany %8 2013-08 %G eng %0 Journal Article %J IEEE Computing in Science and Engineering %D 2013 %T PaRSEC: Exploiting Heterogeneity to Enhance Scalability %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Thomas Herault %A Jack Dongarra %X New high-performance computing system designs with steeply escalating processor and core counts, burgeoning heterogeneity and accelerators, and increasingly unpredictable memory access times call for dramatically new programming paradigms. These new approaches must react and adapt quickly to unexpected contentions and delays, and they must provide the execution environment with sufficient intelligence and flexibility to rearrange the execution to improve resource utilization. %B IEEE Computing in Science and Engineering %V 15 %P 36-45 %8 2013-11 %G eng %N 6 %R 10.1109/MCSE.2013.98 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2013 %T Post-failure recovery of MPI communication capability: Design and rationale %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %X As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery. %B International Journal of High Performance Computing Applications %V 27 %P 244 - 254 %8 2013-01 %G eng %U http://hpc.sagepub.com/cgi/doi/10.1177/1094342013488238 %N 3 %! International Journal of High Performance Computing Applications %R 10.1177/1094342013488238 %0 Book Section %B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing %D 2013 %T Scalable Dense Linear Algebra on Heterogeneous Hardware %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Abstract. Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect, that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand, when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs), which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores. %B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale %A George Bosilca %A Aurelien Bouteiller %A Elisabeth Brunet %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %X In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation. %B Concurrency and Computation: Practice and Experience %8 2013-11 %G eng %R 10.1002/cpe.3173 %0 Conference Proceedings %B Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012 %D 2012 %T Algorithm-Based Fault Tolerance for Dense Matrix Factorization %A Peng Du %A Aurelien Bouteiller %A George Bosilca %A Thomas Herault %A Jack Dongarra %E J. Ramanujam %E P. Sadayappan %K ft-la %K ftmpi %X Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. %B Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012 %I ACM %C New Orleans, LA, USA %P 225-234 %8 2012-02 %G eng %R 10.1145/2145816.2145845 %0 Conference Proceedings %B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award) %D 2012 %T A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %E Christos Kaklamanis %E Theodore Papatheodorou %E Paul Spirakis %B 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award) %I Springer-Verlag %C Rhodes, Greece %8 2012-08 %G eng %0 Journal Article %J Parallel Computing %D 2012 %T DAGuE: A generic distributed DAG Engine for High Performance Computing. %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %K dague %K parsec %B Parallel Computing %I Elsevier %V 38 %P 27-51 %8 2012-00 %G eng %0 Conference Proceedings %B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012 %D 2012 %T An Evaluation of User-Level Failure Mitigation Support in MPI %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A Joshua Hursey %A George Bosilca %A Jack Dongarra %B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012 %I Springer %C Vienna, Austria %8 2012-09 %G eng %0 Generic %D 2012 %T Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI %A Wesley Bland %A Peng Du %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %K ftmpi %B University of Tennessee Computer Science Technical Report %8 2012-00 %G eng %0 Conference Paper %B International European Conference on Parallel and Distributed Computing (Euro-Par '12) %D 2012 %T From Serial Loops to Parallel Execution on Distributed Systems %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Jack Dongarra %B International European Conference on Parallel and Distributed Computing (Euro-Par '12) %C Rhodes, Greece %8 2012-08 %G eng %0 Journal Article %J IPDPS 2012 (Best Paper) %D 2012 %T HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Jack Dongarra %B IPDPS 2012 (Best Paper) %C Shanghai, China %8 2012-05 %G eng %0 Generic %D 2012 %T A Proposal for User-Level Failure Mitigation in the MPI-3 Standard %A Wesley Bland %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Jack Dongarra %K ftmpi %B University of Tennessee Electrical Engineering and Computer Science Technical Report %I University of Tennessee %8 2012-02 %G eng %0 Generic %D 2012 %T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale %A George Bosilca %A Aurelien Bouteiller %A Elisabeth Brunet %A Franck Cappello %A Jack Dongarra %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Frederic Vivien %A Dounia Zaidouni %B University of Tennessee Computer Science Technical Report (also LAWN 269) %8 2012-06 %G eng %0 Generic %D 2011 %T Algorithm-based Fault Tolerance for Dense Matrix Factorizations %A Peng Du %A Aurelien Bouteiller %A George Bosilca %A Thomas Herault %A Jack Dongarra %K ft-la %B University of Tennessee Computer Science Technical Report %C Knoxville, TN %8 2011-08 %G eng %0 Conference Proceedings %B Proceedings of 17th International Conference, Euro-Par 2011, Part II %D 2011 %T Correlated Set Coordination in Fault Tolerant Message Logging Protocols %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %E Emmanuel Jeannot %E Raymond Namyst %E Jean Roman %K ftmpi %B Proceedings of 17th International Conference, Euro-Par 2011, Part II %I Springer %C Bordeaux, France %V 6853 %P 51-64 %8 2011-08 %G eng %0 Conference Proceedings %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %D 2011 %T DAGuE: A Generic Distributed DAG Engine for High Performance Computing %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %K dague %K parsec %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %I IEEE %C Anchorage, Alaska, USA %P 1151-1158 %8 2011-00 %G eng %0 Conference Proceedings %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %D 2011 %T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %I IEEE %C Anchorage, Alaska, USA %P 1432-1441 %8 2011-05 %G eng %0 Journal Article %J 18th EuroMPI %D 2011 %T Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW %A Teng Ma %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %K dague %B 18th EuroMPI %I Springer %C Santorini, Greece %P 247-254 %8 2011-09 %G eng %0 Conference Proceedings %B Int'l Conference on Parallel Processing (ICPP '11) %D 2011 %T Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Brice Goglin %A J. Squyres %A Jack Dongarra %B Int'l Conference on Parallel Processing (ICPP '11) %C Taipei, Taiwan %8 2011-09 %G eng %0 Journal Article %J IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC) %D 2011 %T Performance Portability of a GPU Enabled Factorization with the DAGuE Framework %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Narapat Ohm Saengpatsa %A Stanimire Tomov %A Jack Dongarra %K dague %K magma %K parsec %B IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC) %8 2011-06 %G eng %0 Conference Proceedings %B IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2011 %T A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Narapat Ohm Saengpatsa %A Stanimire Tomov %A Jack Dongarra %K dague %B IEEE International Parallel and Distributed Processing Symposium (submitted) %C Anchorage, AK %8 2011-05 %G eng %0 Generic %D 2010 %T DAGuE: A generic distributed DAG engine for high performance computing %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %K dague %B Innovative Computing Laboratory Technical Report %8 2010-04 %G eng %0 Generic %D 2010 %T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-10-660 %8 2010-09 %G eng %0 Generic %D 2010 %T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K plasma %B Innovative Computing Laboratory Technical Report %8 2010-00 %G eng %0 Conference Proceedings %B Proceedings of EuroMPI 2010 %D 2010 %T Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %E Jack Dongarra %E Michael Resch %E Rainer Keller %E Edgar Gabriel %K ftmpi %B Proceedings of EuroMPI 2010 %I Springer %C Stuttgart, Germany %8 2010-09 %G eng %0 Generic %D 2010 %T Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs %A Teng Ma %A George Bosilca %A Aurelien Bouteiller %A Brice Goglin %A J. Squyres %A Jack Dongarra %B University of Tennessee Computer Science Technical Report, UT-CS-10-663 %8 2010-11 %G eng %0 Conference Proceedings %B Proceedings of the 17th EuroMPI conference %D 2010 %T Locality and Topology aware Intra-node Communication Among Multicore CPUs %A Teng Ma %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %B Proceedings of the 17th EuroMPI conference %I LNCS %C Stuttgart, Germany %8 2010-09 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience (online version) %D 2010 %T Redesigning the Message Logging Model for High Performance %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %B Concurrency and Computation: Practice and Experience (online version) %8 2010-06 %G eng %0 Conference Paper %B CLUSTER '09 %D 2009 %T Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery %A Aurelien Bouteiller %A Thomas Ropars %A George Bosilca %A Christine Morin %A Jack Dongarra %K fault tolerant computing %K libraries message passing %K parallel machines %K protocols %X With the growing scale of high performance computing platforms, fault tolerance has become a major issue. Among the various approaches for providing fault tolerance to MPI applications, message logging has been proved to tolerate higher failure rate. However, this advantage comes at the expense of a higher overhead on communications, due to latency intrusive logging of events to a stable storage. Previous work proposed and evaluated several protocols relaxing the synchronicity of event logging to moderate this overhead. Recently, the model of message logging has been refined to better match the reality of high performance network cards, where message receptions are decomposed in multiple interdependent events. According to this new model, deterministic and non-deterministic events are clearly discriminated, reducing the overhead induced by message logging. In this paper we compare, experimentally, a pessimistic and an optimistic message logging protocol, using this new model and implemented in the Open MPI library. Although pessimistic and optimistic message logging are, respectively, the most and less synchronous message logging paradigms, experiments show that most of the time their performance is comparable. %B CLUSTER '09 %I IEEE %C New Orleans %8 2009-08 %G eng %R 10.1109/CLUSTR.2009.5289157 %0 Conference Proceedings %B 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008) %D 2008 %T Fault Tolerance Management for a Hierarchical GridRPC Middleware %A Aurelien Bouteiller %A Frederic Desprez %B 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008) %C Lyon, France %8 2008-01 %G eng %0 Conference Proceedings %B International Supercomputer Conference (ISC 2008) %D 2008 %T Redesigning the Message Logging Model for High Performance %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %B International Supercomputer Conference (ISC 2008) %C Dresden, Germany %8 2008-01 %G eng %0 Journal Article %J Accepted for Euro PVM/MPI 2007 %D 2007 %T Retrospect: Deterministic Relay of MPI Applications for Interactive Distributed Debugging %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %K ftmpi %B Accepted for Euro PVM/MPI 2007 %I Springer %8 2007-09 %G eng