Thomas Herault

Thomas Herault

Position

I am working as a Research Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. I can be joined at +1 (865) 974-6321 or in person in my office (308 Claxton).

Former Positions

From Sept. 2010 to January 2018, I was working as a Research Scientist at the Innovative Computing Laboratory at University of Tennessee, Knoxville.

From Sept 2004 to August 2010, I was an Assistant Professor (Maitre de Conferences) at the Université Paris-Sud XI, inside the Laboratoire de Recherche en Informatique. I am detached to the University of Tennessee since this date.

Diploma and Titles

1998 - BsC (Licence & Maitrise) in Computer Science from the Universite Paris-Sud XI (France)

1999 - MsC (Diplome d'Etudes Approfondies) in Computer Science from the Universite Paris-Sud XI (France)

2003 - PhD in Computer Science (These) from the Universite Paris-Sud XI (France)

Research Projects

Fault Tolerant HPC Systems

ULFM is a set of MPI interface extensions enabling Message Passing programs to restore MPI communication capabilities affected by process failures. It supports rebuilding communicators, RMA windows and I/O Files. No particular recovery model is imposed or favored, instead a set of versatile APIs is included that provides support for different recovery styles. The application directs the recovery, so it can pay for the cost of repairing only the necessary MPI objects. The ULFM specification is a crucial infrastructure to enable the deployment of advanced, production quality fault tolerant techniques; it is a versatile solution to improve the efficiency of novel and established fault tolerant techniques. Look at the flyer.

SMURFS was an NSF SHF Collaborative Research project with Kurt Feirrera (Sandia National Laboratory) and Dorian Arnold (Emory University) in which we extend theoretical performance models for the large variety of fault tolerant protocols for High Performance Computing, evaluate these models for accuracy and predictability in forecoming systems, and design, develop and evaluate simulation tools to complete and extend these performance models with validation mechanisms.

MPICH-V was a research effort with theoretical studies, experimental evaluations and pragmatic implementations aiming to provide a MPI implementation based on MPICH, featuring multiple fault tolerant protocols. MPICH-V provides automatic fault tolerant MPI library (i.e. a totaly unchanged application linked with the mpich-v library is a fault tolerant application).

Dataflow Execution Model for HPC

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider can be expressed as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact problem-size independent format that can be queried on-demand to discover data dependencies in a totally distributed fashion. PaRSEC assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.

The framework includes libraries, a runtime system, and development tools to help application developers tackle the difficult task of porting their applications to highly heterogeneous and diverse environment.

PaRSEC is the underlying infrastructure for the DPLASMA distributed memory, tile algorithm based linear algebra package.

TESSE was a collaborative Research funded by the NSF. The goals of TESSE are to design and demonstrate via substantial scientific simulations within chemistry and other disciplines a prototype software framework that provides a groundbreaking response to the twin problems of portable performance and programmer productivity for advanced scientific applications on emerging massively-parallel, hybrid, many-core systems. TESSE will create a viable foundation for a new generation of science codes, one which enables even more rapid exploration of new physical models, provides greatly enhanced performance portability through directed acyclic graph (DAG) scheduling and auto-tuned kernels, and works towards full interoperability between major chemistry packages through compatible runtimes and data structures. TESSE will mature to become a standard, widely available, community-based and broadly-applicable parallel programming environment complementing and rivaling MPI/OpenMP. This is needed due to the widely appreciated shortfalls of existing mainstream programming models and the already huge successes of the existing DAG-based runtimes that are the foundation of the next generation of NSF and DOE supported (Sca)LAPACK high-performance linear algebra libraries.

EPEXA is an NSF-supported R&D project that will create a production-quality, general-purpose, community-supported, open-source software ecosystem that attacks the twin challenges of programmer productivity and portable performance for advanced scientific applications on modern high-performance computers. Of special interest are irregular and sparse applications that are poorly served by current programming and execution models.

Message Passing

Evolve aimed at enhancing the Open MPI software library, focusing on two aspects: (1) Extend Open MPI to support new features of the MPI specification. The two most significant areas within the context of this proposal are (a) extensions to better support hybrid programming models and (b) support for fault tolerance in MPI applications. (2) Enhance the Open MPI core to support new architectures and improve scalability. While Open MPI has demonstrated very good scalability in the past, there is significant work to be done to ensure similarly good performance on future architectures.

Formal Verification & Security

APMC implemented techniques of approximate model checking, in a collaboration with Richard Lassaigne (Univ Paris 7) and Sylvain Peyronnet (X-labs). This tools is of interest for the community of model checking since it was the only one to implement approximated model checking for probabilistic models. It uses a massive parallelism approach to enable the verification of very large systems, like it was done for the verification of the CSMA/CD protocol.

I was Principal Investigator of the SAFE-OS project, representing University Paris-Sud inside the French ANR defy “Securite et Confidentialite des Systemes d’Information” (SEC&SI: security and confidentiality of information systems). This was a new kind of project for the ANR (Agence Nationale de la Recherche, the french NSF), that put multiple research teams in competition on the same project. This project evolves in two phases that alternate: the work proposed by each team during the development period is evaluated by the other teams during the evaluation period. Teams report security breaches found in other teams operating systems, and the value of these security breaches is ranked by an independent jury. The goal of this project is to design an operating system with improved security features for an internet user. During this project, we used the strong expertise of the Parall team of LRI at University Paris-Sud on virtualization to propose a solution based on virtual machines. Using virtual machines, we transformed the computer in a distributed system, hence providing a better isolation of resources, and increasing the security and confidentiality of the data and of the processes.

Publications

[1]
Anne Benoit Thomas Herault, Lucas Perotin, Yves Robert, and Frederic Vivien. Revisiting I/O bandwidth-sharing strategies for HPC applications. J. Distr. Parll. Comput., to appear [ pdf ]
[1]
George Bosilca, Aurélien Bouteiller, Thomas Hérault, Valentin Le Fèvre, Yves Robert, and Jack J. Dongarra. Comparing distributed termination detection algorithms for modern HPC platforms. Int. J. Netw. Comput., 12(1):26--46, 2022. [ http ] [ pdf ]
[2]
Atsushi Hori, Kazumi Yoshinaga, Thomas Hérault, Aurelien Bouteiller, George Bosilca, and Yutaka Ishikawa. Overhead of using spare nodes. Int. J. High Perform. Comput. Appl., 34(2), 2020. [ DOI ] [ pdf ]
[3]
Valentin Le Fèvre, Thomas Hérault, Yves Robert, Aurélien Bouteiller, Atsushi Hori, George Bosilca, and Jack J. Dongarra. Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms. Parallel Comput., 85:1--12, 2019. [ DOI ] [ pdf ]
[4]
Thomas Hérault, Yves Robert, Aurélien Bouteiller, Dorian C. Arnold, Kurt B. Ferreira, George Bosilca, and Jack J. Dongarra. Checkpointing strategies for shared high-performance computing platforms. Int. J. Netw. Comput., 9(1):28--52, 2019. [ http ] [ pdf ]
[5]
Sangmin Seo, Abdelhalim Amer, Pavan Balaji, Cyril Bordage, George Bosilca, Alex Brooks, Philip H. Carns, Adrián Castelló, Damien Genet, Thomas Hérault, Shintaro Iwasaki, Prateek Jindal, Laxmikant V. Kalé, Sriram Krishnamoorthy, Jonathan Lifflander, Huiwei Lu, Esteban Meneses, Marc Snir, Yanhua Sun, Kenjiro Taura, and Peter H. Beckman. Argobots: A lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distributed Syst., 29(3):512--526, 2018. [ DOI ] [ pdf ]
[6]
George Bosilca, Aurélien Bouteiller, Amina Guermouche, Thomas Hérault, Yves Robert, Pierre Sens, and Jack J. Dongarra. A failure detector for HPC platforms. Int. J. High Perform. Comput. Appl., 32(1):139--158, 2018. [ DOI ] [ pdf ]
[7]
Julien Herrmann, George Bosilca, Thomas Hérault, Loris Marchal, Yves Robert, and Jack J. Dongarra. Assessing the cost of redistribution followed by a computational kernel: Complexity and performance results. Parallel Comput., 52:22--41, 2016. [ DOI ] [ pdf ]
[8]
Aurélien Bouteiller, Thomas Hérault, George Bosilca, Peng Du, and Jack J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans. Parallel Comput., 1(2):10:1--10:28, 2015. [ DOI ] [ pdf ]
[9]
George Bosilca, Aurélien Bouteiller, Thomas Hérault, Yves Robert, and Jack J. Dongarra. Composing resilience techniques: ABFT, periodic and incremental checkpointing. Int. J. Netw. Comput., 5(1):2--25, 2015. [ http ] [ pdf ]
[10]
Jack J. Dongarra, Thomas Hérault, and Yves Robert. Performance and reliability trade-offs for the double checkpointing algorithm. Int. J. Netw. Comput., 4(1):23--41, 2014. [ http ] [ pdf ]
[11]
George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack J. Dongarra, Amina Guermouche, Thomas Hérault, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. Unified model for assessing checkpointing protocols at extreme-scale. Concurr. Comput. Pract. Exp., 26(17):2772--2791, 2014. [ DOI ] [ pdf ]
[12]
Jack J. Dongarra, Mathieu Faverge, Thomas Hérault, Mathias Jacquelin, Julien Langou, and Yves Robert. Hierarchical QR factorization algorithms for multi-core clusters. Parallel Comput., 39(4-5):212--232, 2013. [ DOI ] [ pdf ]
[13]
Wesley Bland, Aurélien Bouteiller, Thomas Hérault, George Bosilca, and Jack J. Dongarra. Post-failure recovery of MPI communication capability: Design and rationale. Int. J. High Perform. Comput. Appl., 27(3):244--254, 2013. [ DOI ] [ pdf ]
[14]
George Bosilca, Aurélien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Hérault, and Jack J. Dongarra. Parsec: Exploiting heterogeneity to enhance scalability. Comput. Sci. Eng., 15(6):36--45, 2013. [ DOI ] [ pdf ]
[15]
Wesley Bland, Peng Du, Aurélien Bouteiller, Thomas Hérault, George Bosilca, and Jack J. Dongarra. Extending the scope of the checkpoint-on-failure protocol for forward recovery in standard MPI. Concurr. Comput. Pract. Exp., 25(17):2381--2393, 2013. [ DOI ] [ pdf ]
[16]
Aurélien Bouteiller, Thomas Hérault, George Bosilca, and Jack J. Dongarra. Correlated set coordination in fault tolerant message logging protocols for many-core clusters. Concurr. Comput. Pract. Exp., 25(4):572--585, 2013. [ DOI ] [ pdf ]
[17]
Wesley Bland, Aurélien Bouteiller, Thomas Hérault, Joshua Hursey, George Bosilca, and Jack J. Dongarra. An evaluation of user-level failure mitigation support in MPI. Computing, 95(12):1171--1184, 2013. [ DOI ] [ pdf ]
[18]
George Bosilca, Aurélien Bouteiller, Anthony Danalis, Thomas Hérault, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Jack J. Dongarra. Scalable dense linear algebra on heterogeneous hardware. In Erik H. D'Hollander, Jack J. Dongarra, Ian T. Foster, Lucio Grandinetti, and Gerhard R. Joubert, editors, Transition of HPC Towards Exascale Computing - Selected Papers from the High Performance Computing Workshop, Cetraro, Italy, June 25-29, 2012, volume 24 of Advances in Parallel Computing, pages 65--103. IOS Press, 2012. [ DOI ] [ pdf ]
[19]
George Bosilca, Aurélien Bouteiller, Anthony Danalis, Thomas Hérault, Pierre Lemarinier, and Jack J. Dongarra. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Comput., 38(1-2):37--51, 2012. [ DOI ] [ pdf ]
[20]
Emmanuel Agullo, Camille Coti, Thomas Hérault, Julien Langou, Sylvain Peyronnet, Ala Rezmerita, Franck Cappello, and Jack J. Dongarra. QCG-OMPI: MPI applications on grids. Future Gener. Comput. Syst., 27(4):357--369, 2011. [ DOI ] [ pdf ]
[21]
Franck Cappello, Thomas Hérault, and Jack J. Dongarra. Foreword. Parallel Comput., 35(12):571, 2009. [ DOI ]
[22]
Fatiha Bouabache, Thomas Hérault, Gilles Fedak, and Franck Cappello. Hierarchical replication techniques to ensure checkpoint storage reliability in grid environment. J. Interconnect. Networks, 10(4):345--364, 2009. [ DOI ] [ pdf ]
[23]
Darius Buntinas, Camille Coti, Thomas Hérault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols. Future Gener. Comput. Syst., 24(1):73--84, 2008. [ DOI ] [ pdf ]
[24]
Aurélien Bouteiller, Thomas Hérault, Géraud Krawezik, Pierre Lemarinier, and Franck Cappello. MPICH-V project: A multiprotocol automatic fault-tolerant MPI. Int. J. High Perform. Comput. Appl., 20(3):319--333, 2006. [ DOI ] [ pdf ]
[25]
Aurélien Bouteiller, Hinde-Lilia Bouziane, Thomas Hérault, Pierre Lemarinier, and Franck Cappello. Hybrid preemptive scheduling of message passing interface applications on grids. Int. J. High Perform. Comput. Appl., 20(1):77--90, 2006. [ DOI ] [ pdf ]
[26]
Franck Cappello, Samir Djilali, Gilles Fedak, Thomas Hérault, Frédéric Magniette, Vincent Néri, and Oleg Lodygensky. Computing on large-scale distributed systems: Xtremweb architecture, programming models, security, tests and convergence with grid. Future Gener. Comput. Syst., 21(3):417--437, 2005. [ DOI ] [ pdf ]

[1]
Joseph Schuchart, Poornima Nookala, Mohammad Mahdi Javanmard, Thomas Hérault, Edward F. Valeev, George Bosilca, and Robert J. Harrison. Generalized flow-graph programming using template task-graphs: Initial implementation and assessment. In 2022 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022, Lyon, France, May 30 - June 3, 2022, pages 839--849. IEEE, 2022. [ DOI ] [ pdf ]
[2]
Anne Benoit, Yishu Du, Thomas Hérault, Loris Marchal, Guillaume Pallez, Lucas Perotin, Yves Robert, Hongyang Sun, and Frédéric Vivien. Checkpointing à la young/daly: An overview. In Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing, IC3-2022, Noida, India, August 4-6, 2022, pages 701--710. ACM, 2022. [ DOI ] [ pdf ]
[3]
Joseph Schuchart, Poornima Nookala, Thomas Hérault, Edward F. Valeev, and George Bosilca. Pushing the boundaries of small tasks: Scalable low-overhead data-flow programming in TTG. In IEEE International Conference on Cluster Computing, CLUSTER 2022, Heidelberg, Germany, September 5-8, 2022, pages 117--128. IEEE, 2022. [ DOI ] [ pdf ]
[4]
Thomas Hérault, Yves Robert, George Bosilca, Robert J. Harrison, Cannada A. Lewis, Edward F. Valeev, and Jack J. Dongarra. Distributed-memory multi-gpu block-sparse tensor contraction for electronic structure. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021, pages 537--546. IEEE, 2021. [ DOI ] [ pdf ]
[5]
Anne Benoit, Thomas Hérault, Valentin Le Fèvre, and Yves Robert. Replication is more efficient than you think. In Michela Taufer, Pavan Balaji, and Antonio J. Peña, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019, pages 89:1--89:14. ACM, 2019. [ DOI ] [ pdf ]
[6]
George Bosilca, Aurélien Bouteiller, Amina Guermouche, Thomas Hérault, Yves Robert, Pierre Sens, and Jack J. Dongarra. Failure detection and propagation in HPC systems. In John West and Cherri M. Pancake, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 13-18, 2016, pages 312--322. IEEE Computer Society, 2016. best paper finalist. [ DOI ] [ pdf ]
[7]
Thomas Hérault, Aurélien Bouteiller, George Bosilca, Marc Gamell, Keita Teranishi, Manish Parashar, and Jack J. Dongarra. Practical scalable consensus for pseudo-synchronous distributed systems. In Jackie Kern and Jeffrey S. Vetter, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 15-20, 2015, pages 31:1--31:12. ACM, 2015. [ DOI ] [ pdf ]
[8]
Atsushi Hori, Kazumi Yoshinaga, Thomas Hérault, Aurélien Bouteiller, George Bosilca, and Yutaka Ishikawa. Sliding substitution of failed nodes. In Jack J. Dongarra, Alexandre Denis, Brice Goglin, Emmanuel Jeannot, and Guillaume Mercier, editors, Proceedings of the 22nd European MPI Users' Group Meeting, EuroMPI 2015, Bordeaux, France, September 21-23, 2015, pages 14:1--14:10. ACM, 2015. [ DOI ] [ pdf ]
[9]
Chongxiao Cao, Thomas Hérault, George Bosilca, and Jack J. Dongarra. Design for a soft error resilient dynamic task-based runtime. In 2015 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, Hyderabad, India, May 25-29, 2015, pages 765--774. IEEE Computer Society, 2015. [ DOI ] [ pdf ]
[10]
Aurélien Bouteiller, Thomas Hérault, and George Bosilca. A multithreaded communication substrate for openshmem. In Allen D. Malony and Jeff R. Hammond, editors, Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014, Eugene, OR, USA, October 6-10, 2014, pages 16:1--16:2. ACM, 2014. [ DOI ] [ pdf ]
[11]
Thomas Hérault, Julien Herrmann, Loris Marchal, and Yves Robert. Determining the optimal redistribution for a given data partition. In Traian Muntean, Robert Rolland, and Léon Mugwaneza, editors, IEEE 13th International Symposium on Parallel and Distributed Computing, ISPDC 2014, Marseille, France, June 24-27, 2014, pages 95--102. IEEE, 2014. [ DOI ] [ pdf ]
[12]
Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. On the combination of silent error detection and checkpointing. In IEEE 19th Pacific Rim International Symposium on Dependable Computing, PRDC 2013, Vancouver, BC, Canada, December 2-4, 2013, pages 11--20. IEEE Computer Society, 2013. [ DOI ] [ pdf ]
[13]
Aurélien Bouteiller, Franck Cappello, Jack J. Dongarra, Amina Guermouche, Thomas Hérault, and Yves Robert. Multi-criteria checkpointing strategies: Response-time versus resource utilization. In Felix Wolf, Bernd Mohr, and Dieter an Mey, editors, Euro-Par 2013 Parallel Processing - 19th International Conference, Aachen, Germany, August 26-30, 2013. Proceedings, volume 8097 of Lecture Notes in Computer Science, pages 420--431. Springer, 2013. [ DOI ] [ pdf ]
[14]
Wesley Bland, Aurélien Bouteiller, Thomas Hérault, Joshua Hursey, George Bosilca, and Jack J. Dongarra. An evaluation of user-level failure mitigation support in MPI. In Jesper Larsson Träff, Siegfried Benkner, and Jack J. Dongarra, editors, Recent Advances in the Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings, volume 7490 of Lecture Notes in Computer Science, pages 193--203. Springer, 2012. [ DOI ] [ pdf ]
[15]
Peng Du, Aurélien Bouteiller, George Bosilca, Thomas Hérault, and Jack J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. In J. Ramanujam and P. Sadayappan, editors, Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, February 25-29, 2012, pages 225--234. ACM, 2012. [ DOI ] [ pdf ]
[16]
Jack J. Dongarra, Mathieu Faverge, Thomas Hérault, Julien Langou, and Yves Robert. Hierarchical QR factorization algorithms for multi-core cluster systems. In 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25, 2012, pages 607--618. IEEE Computer Society, 2012. [ DOI ] [ pdf ]
[17]
Wesley Bland, Peng Du, Aurélien Bouteiller, Thomas Hérault, George Bosilca, and Jack J. Dongarra. A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI. In Christos Kaklamanis, Theodore S. Papatheodorou, and Paul G. Spirakis, editors, Euro-Par 2012 Parallel Processing - 18th International Conference, Euro-Par 2012, Rhodes Island, Greece, August 27-31, 2012. Proceedings, volume 7484 of Lecture Notes in Computer Science, pages 477--488. Springer, 2012. distinguished paper. [ DOI ] [ pdf ]
[18]
George Bosilca, Aurélien Bouteiller, Anthony Danalis, Thomas Hérault, and Jack J. Dongarra. From serial loops to parallel execution on distributed systems. In Christos Kaklamanis, Theodore S. Papatheodorou, and Paul G. Spirakis, editors, Euro-Par 2012 Parallel Processing - 18th International Conference, Euro-Par 2012, Rhodes Island, Greece, August 27-31, 2012. Proceedings, volume 7484 of Lecture Notes in Computer Science, pages 246--257. Springer, 2012. [ DOI ] [ pdf ]
[19]
George Bosilca, Thomas Hérault, Pierre Lemarinier, Ala Rezmerita, and Jack J. Dongarra. Scalable runtime for MPI: efficiently building the communication infrastructure. In Yiannis Cotronis, Anthony Danalis, Dimitrios S. Nikolopoulos, and Jack J. Dongarra, editors, Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011, Santorini, Greece, September 18-21, 2011. Proceedings, volume 6960 of Lecture Notes in Computer Science, pages 342--344. Springer, 2011. [ DOI ] [ pdf ]
[20]
Aurélien Bouteiller, Thomas Hérault, George Bosilca, and Jack J. Dongarra. Correlated set coordination in fault tolerant message logging protocols. In Emmanuel Jeannot, Raymond Namyst, and Jean Roman, editors, Euro-Par 2011 Parallel Processing - 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part II, volume 6853 of Lecture Notes in Computer Science, pages 51--64. Springer, 2011. [ DOI ] [ pdf ]
[21]
George Bosilca, Aurélien Bouteiller, Thomas Hérault, Pierre Lemarinier, Narapat Ohm Saengpatsa, Stanimire Tomov, and Jack J. Dongarra. Performance portability of a GPU enabled factorization with the DAGuE framework. In 2011 IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, September 26-30, 2011, pages 395--402. IEEE Computer Society, 2011. [ DOI ] [ pdf ]
[22]
Teng Ma, Thomas Hérault, George Bosilca, and Jack J. Dongarra. Process distance-aware adaptive MPI collective communications. In 2011 IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, September 26-30, 2011, pages 196--204. IEEE Computer Society, 2011. [ DOI ] [ pdf ]
[23]
George Bosilca, Thomas Hérault, Ala Rezmerita, and Jack J. Dongarra. On scalability for MPI runtime systems. In 2011 IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, September 26-30, 2011, pages 187--195. IEEE Computer Society, 2011. [ DOI ] [ pdf ]
[24]
George Bosilca, Aurélien Bouteiller, Thomas Hérault, Pierre Lemarinier, and Jack J. Dongarra. Dodging the cost of unavoidable memory copies in message logging protocols. In Rainer Keller, Edgar Gabriel, Michael M. Resch, and Jack J. Dongarra, editors, Recent Advances in the Message Passing Interface - 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings, volume 6305 of Lecture Notes in Computer Science, pages 189--197. Springer, 2010. [ DOI ] [ pdf ]
[25]
Aline Carneiro Viana, Thomas Hérault, Thomas Largillier, Sylvain Peyronnet, and Fatiha Zaïdi. Supple: a flexible probabilistic data dissemination protocol for wireless sensor networks. In Violet R. Syrotiuk, Fatih Alagöz, Brahim Bensaou, and Özgür B. Akan, editors, Proceedings of the 13th International Symposium on Modeling Analysis and Simulation of Wireless and Mobile Systems, MSWiM 2010, Bodrum, Turkey, October 17-21, 2010, pages 385--392. ACM, 2010. [ DOI ] [ pdf ]
[26]
Gilles Fedak, Jean-Patrick Gelas, Thomas Hérault, Victor Iniesta, Derrick Kondo, Laurent Lefèvre, Paul Malecot, Lucas Nussbaum, Ala Rezmerita, and Olivier Richard. DSL-Lab: A low-power lightweight platform to experiment on domestic broadband internet. In Ninth International Symposium on Parallel and Distributed Computing, ISPDC 2010, Istanbul, Turkey, July 7-9, 2010, pages 141--148. IEEE Computer Society, 2010. [ DOI ] [ pdf ]
[27]
Emmanuel Agullo, Camille Coti, Jack J. Dongarra, Thomas Hérault, and Julien Langou. QR factorization of tall and skinny matrices in a grid computing environment. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010 - Conference Proceedings, pages 1--11. IEEE, 2010. [ DOI ] [ pdf ]
[28]
François Lesueur, Ala Rezmerita, Thomas Hérault, Sylvain Peyronnet, and Sébastien Tixeuil. SAFE-OS: A secure and usable desktop operating system. In CRiSIS 2010, Proceedings of the Fifth International Conference on Risks and Security of Internet and Systems, Montreal, QC, Canada, October 10-13, 2010, pages 1--7. IEEE Computer Society, 2010. [ DOI ] [ pdf ]
[29]
Amine Bourki, Guillaume Chaslot, Matthieu Coulm, Vincent Danjean, Hassen Doghmen, Jean-Baptiste Hoock, Thomas Hérault, Arpad Rimmel, Fabien Teytaud, Olivier Teytaud, Paul Vayssière, and Ziqin Yut. Scalability and parallelization of monte-carlo tree search. In H. Jaap van den Herik, Hiroyuki Iida, and Aske Plaat, editors, Computers and Games - 7th International Conference, CG 2010, Kanazawa, Japan, September 24-26, 2010, Revised Selected Papers, volume 6515 of Lecture Notes in Computer Science, pages 48--58. Springer, 2010. [ DOI ] [ pdf ]
[30]
Fatiha Bouabache, Thomas Hérault, Sylvain Peyronnet, and Franck Cappello. Planning large data transfers in institutional grids. In 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGrid 2010, 17-20 May 2010, Melbourne, Victoria, Australia, pages 547--552. IEEE Computer Society, 2010. [ DOI ] [ pdf ]
[31]
George Bosilca, Camille Coti, Thomas Hérault, Pierre Lemarinier, and Jack J. Dongarra. Constructing resiliant communication infrastructure for runtime environments. In Barbara M. Chapman, Frédéric Desprez, Gerhard R. Joubert, Alain Lichnewsky, Frans J. Peters, and Thierry Priol, editors, Parallel Computing: From Multicores and GPU's to Petascale, Proceedings of the conference ParCo 2009, 1-4 September 2009, Lyon, France, volume 19 of Advances in Parallel Computing, pages 441--451. IOS Press, 2009. [ DOI ] [ pdf ]
[32]
Camille Coti, Thomas Hérault, and Franck Cappello. MPI applications on grids: A topology aware approach. In Henk J. Sips, Dick H. J. Epema, and Hai-Xiang Lin, editors, Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25-28, 2009. Proceedings, volume 5704 of Lecture Notes in Computer Science, pages 466--477. Springer, 2009. [ DOI ] [ pdf ]
[33]
Pavel Bar, Camille Coti, Derek Groen, Thomas Hérault, Valentin Kravtsov, Assaf Schuster, and Martin T. Swain. Running parallel applications with topology-aware grid middleware. In Fifth International Conference on e-Science, e-Science 2009, 9-11 December 2009, Oxford, UK, pages 292--299. IEEE Computer Society, 2009. [ DOI ] [ pdf ]
[34]
Julien Clément, Thomas Hérault, Stéphane Messika, and Olivier Peres. On the complexity of a self-stabilizing spanning tree algorithm for large scale systems. In 14th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2008, 15-17 December 2008, Taipei, Taiwan, pages 48--55. IEEE Computer Society, 2008. [ DOI ] [ pdf ]
[35]
Thomas Hérault, Mathieu Jan, Thomas Largillier, Sylvain Peyronnet, Benjamin Quétier, and Franck Cappello. Emulation platform for high accuracy failure injection in grids. In Wolfgang Gentzsch, Lucio Grandinetti, and Gerhard R. Joubert, editors, High Speed and Large Scale Scientific Computing - Selected Papers from the High Performance Computing Workshop, Cetraro, Italy, June 30 - July 4, 2008, volume 18 of Advances in Parallel Computing, pages 127--140. IOS Press, 2008. [ DOI ] [ pdf ]
[36]
Fatiha Bouabache, Thomas Hérault, Gilles Fedak, and Franck Cappello. Hierarchical replication techniques to ensure checkpoint storage reliability in grid environment. In 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 19-22 May 2008, Lyon, France, pages 475--483. IEEE Computer Society, 2008. [ DOI ] [ pdf ]
[37]
Camille Coti, Thomas Hérault, Sylvain Peyronnet, Ala Rezmerita, and Franck Cappello. Grid services for MPI. In 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 19-22 May 2008, Lyon, France, pages 417--424. IEEE Computer Society, 2008. [ DOI ] [ pdf ]
[38]
Thomas Hérault, Pierre Lemarinier, Olivier Peres, Laurence Pilard, and Joffroy Beauquier. A model for large scale self-stabilization. In 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26-30 March 2007, Long Beach, California, USA, pages 1--10. IEEE, 2007. [ DOI ] [ pdf ]
[39]
Michaël Cadilhac, Thomas Hérault, Richard Lassaigne, Sylvain Peyronnet, and Sébastien Tixeuil. Evaluating complex MAC protocols for sensor networks with APMC. In Stephan Merz and Tobias Nipkow, editors, Proceedings of the 6th International Workshop on Automated Verification of Critical Systems, AVoCS 2006, Nancy, France, September 18-19, 2006, volume 185 of Electronic Notes in Theoretical Computer Science, pages 33--46. Elsevier, 2006. [ DOI ] [ pdf ]
[40]
Camille Coti, Thomas Hérault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. MPI tools and performance studies - blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, November 11-17, 2006, Tampa, FL, USA, page 127. ACM Press, 2006. [ DOI ] [ pdf ]
[41]
Akim Demaille, Thomas Hérault, and Sylvain Peyronnet. Probabilistic verification of sensor networks. In 4th International Confernce on Computer Sciences: Research, Innovation and Vision for the Future, February 12-16, 2006, Ho Chi Minh City, Vietnam, pages 45--54. IEEE, 2006. [ DOI ] [ pdf ]
[42]
William Hoarau, Pierre Lemarinier, Thomas Hérault, Eric Rodriguez, Sébastien Tixeuil, and Franck Cappello. FAIL-MPI: how fault-tolerant is fault-tolerant mpi? In Proceedings of the 2006 IEEE International Conference on Cluster Computing, September 25-28, 2006, Barcelona, Spain. IEEE Computer Society, 2006. [ DOI ] [ pdf ]
[43]
Aurélien Bouteiller, Boris Collin, Thomas Hérault, Pierre Lemarinier, and Franck Cappello. Impact of event logger on causal message logging protocols for fault tolerant MPI. In 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), CD-ROM / Abstracts Proceedings, 4-8 April 2005, Denver, CO, USA. IEEE Computer Society, 2005. [ DOI ] [ pdf ]
[44]
Thomas Hérault, Richard Lassaigne, Frédéric Magniette, and Sylvain Peyronnet. Approximate probabilistic model checking. In Bernhard Steffen and Giorgio Levi, editors, Verification, Model Checking, and Abstract Interpretation, 5th International Conference, VMCAI 2004, Venice, Italy, January 11-13, 2004, Proceedings, volume 2937 of Lecture Notes in Computer Science, pages 73--84. Springer, 2004. [ DOI ] [ pdf ]
[45]
Samir Djilali, Thomas Hérault, Oleg Lodygensky, Tangui Morlier, Gilles Fedak, and Franck Cappello. RPC-V: toward fault-tolerant RPC for internet connected desktop grids with volatile nodes. In Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, 6-12 November 2004, Pittsburgh, PA, USA, CD-Rom, page 39. IEEE Computer Society, 2004. [ DOI ] [ pdf ]
[46]
Pierre Lemarinier, Aurélien Bouteiller, Thomas Hérault, Géraud Krawezik, and Franck Cappello. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), September 20-23 2004, San Diego, California, USA, pages 115--124. IEEE Computer Society, 2004. [ DOI ] [ pdf ]
[47]
Aurélien Bouteiller, Franck Cappello, Thomas Hérault, Géraud Krawezik, Pierre Lemarinier, and Frédéric Magniette. MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 15-21 November 2003, Phoenix, AZ, USA, CD-Rom, page 25. ACM, 2003. [ DOI ] [ pdf ]
[48]
Joffroy Beauquier and Thomas Hérault. Fault-local stabilization: The shortest path tree. In 21st Symposium on Reliable Distributed Systems (SRDS 2002), 13-16 October 2002, Osaka, Japan, pages 62--69. IEEE Computer Society, 2002. [ DOI ] [ pdf ]
[49]
George Bosilca, Aurélien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fedak, Cécile Germain, Thomas Hérault, Pierre Lemarinier, Oleg Lodygensky, Frédéric Magniette, Vincent Néri, and Anton Selikhov. MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In Roscoe C. Giles, Daniel A. Reed, and Kathryn Kelley, editors, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, USA, November 16-22, 2002, CD-ROM, pages 31:1--31:18. IEEE Computer Society, 2002. [ DOI ] [ pdf ]

[1]
Qinglei Cao, Thomas Herault, Aurelien Bouteiller, Joseph Schuchart, and George Bosilca. Evaluating cholesky over parsec in scientific applications. In Workshop on Asynchronous Many-Task Systems and Applications 2024, 2024. to be published.
[3]
Quentin Barbut, Anne Benoit, Thomas Hérault, Yves Robert, and Frédéric Vivien. When to checkpoint at the end of a fixed-length reservation? In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W 2023, Denver, CO, USA, November 12-17, 2023, pages 466--476. ACM, 2023. [ DOI ] [ pdf ]
[4]
Thomas Hérault, Joseph Schuchart, Edward F. Valeev, and George Bosilca. Composition of algorithmic building blocks in template task graphs. In IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X, PAW-ATM@SC 2022, Dallas, TX, USA, November 13-18, 2022, pages 26--38. IEEE, 2022. [ DOI ] [ pdf ]
[5]
George Bosilca, Aurélien Bouteiller, Thomas Hérault, Valentin Le Fèvre, Yves Robert, and Jack J. Dongarra. Revisiting credit distribution algorithms for distributed termination detection. In IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2021, Portland, OR, USA, June 17-21, 2021, pages 611--620. IEEE, 2021. [ DOI ] [ pdf ]
[6]
George Bosilca, Robert J. Harrison, Thomas Hérault, Mohammad Mahdi Javanmard, Poornima Nookala, and Edward F. Valeev. The template task graph (TTG) - an emerging practical dataflow programming paradigm for scientific simulation at extreme scale. In 5th IEEE/ACM International Workshop on Extreme Scale Programming Models and Middleware, ESPM2@SC 2020, Atlanta, GA, USA, November 11, 2020, pages 1--7. IEEE, 2020. [ DOI ] [ pdf ]
[7]
Valentin Le Fèvre, Thomas Hérault, Julien Langou, and Yves Robert. A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. In Bartosz Balis, Dora B. Heras, Laura Antonelli, Andrea Bracciali, Thomas Gruber, Jin Hyun-Wook, Michael Kuhn, Stephen L. Scott, Didem Unat, and Roman Wyrzykowski, editors, Euro-Par 2020: Parallel Processing Workshops - Euro-Par 2020 International Workshops, Warsaw, Poland, August 24-25, 2020, Revised Selected Papers, volume 12480 of Lecture Notes in Computer Science, pages 303--315. Springer, 2020. [ DOI ] [ pdf ]
[8]
Thomas Hérault, Yves Robert, George Bosilca, and Jack J. Dongarra. Generic matrix multiplication for multi-gpu accelerated distributed-memory platforms over parsec. In 10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA@SC 2019, Denver, CO, USA, November 18, 2019, pages 33--41. IEEE, 2019. [ DOI ] [ pdf ]
[9]
Qinglei Cao, Yu Pei, Thomas Hérault, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David E. Keyes, and Jack J. Dongarra. Performance analysis of tile low-rank cholesky factorization using parsec instrumentation tools. In IEEE/ACM International Workshop on Programming and Performance Visualization Tools, ProTools@SC 2019, Denver, CO, USA, November 17, 2019, pages 25--32. IEEE, 2019. [ DOI ] [ pdf ]
[10]
Anthony Danalis, Heike Jagode, Thomas Hérault, Piotr Luszczek, and Jack J. Dongarra. Software-defined events through PAPI. In IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019, Rio de Janeiro, Brazil, May 20-24, 2019, pages 363--372. IEEE, 2019. [ DOI ] [ pdf ]
[11]
Thomas Hérault, Yves Robert, Aurélien Bouteiller, Dorian C. Arnold, Kurt B. Ferreira, George Bosilca, and Jack J. Dongarra. Optimal cooperative checkpointing for shared high-performance computing platforms. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2018, Vancouver, BC, Canada, May 21-25, 2018, pages 803--812. IEEE Computer Society, 2018. [ DOI ] [ pdf ]
[12]
Valentin Le Fèvre, George Bosilca, Aurélien Bouteiller, Thomas Hérault, Atsushi Hori, Yves Robert, and Jack J. Dongarra. Do moldable applications perform better on failure-prone HPC platforms? In Gabriele Mencagli, Dora B. Heras, Valeria Cardellini, Emiliano Casalicchio, Emmanuel Jeannot, Felix Wolf, Antonio Salis, Claudio Schifanella, Ravi Reddy Manumachu, Laura Ricci, Marco Beccuti, Laura Antonelli, José Daniel García Sánchez, and Stephen L. Scott, editors, Euro-Par 2018: Parallel Processing Workshops - Euro-Par 2018 International Workshops, Turin, Italy, August 27-28, 2018, Revised Selected Papers, volume 11339 of Lecture Notes in Computer Science, pages 787--799. Springer, 2018. [ DOI ] [ pdf ]
[13]
Reazul Hoque, Thomas Hérault, George Bosilca, and Jack J. Dongarra. Dynamic task discovery in parsec: a data-flow task-based runtime. In Vassil Alexandrov, Al Geist, and Jack J. Dongarra, editors, Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA@SC 2017, Denver, CO, USA, November 13, 2017, pages 6:1--6:8. ACM, 2017. [ DOI ] [ pdf ]
[14]
Chunyan Tang, Aurélien Bouteiller, Thomas Hérault, Manjunath Gorentla Venkata, and George Bosilca. From MPI to OpenSHMEM: Porting LAMMPS. In Manjunath Gorentla Venkata, Pavel Shamis, Neena Imam, and M. Graham Lopez, editors, OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies - Second Workshop, OpenSHMEM 2015, Annapolis, MD, USA, August 4-6, 2015. Revised Selected Papers, volume 9397 of Lecture Notes in Computer Science, pages 121--137. Springer, 2015. [ DOI ] [ pdf ]
[15]
Anthony Danalis, George Bosilca, Aurélien Bouteiller, Thomas Hérault, and Jack J. Dongarra. PTG: an abstraction for unhindered parallelism. In Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC '14, New Orleans, Louisiana, USA, November 16-21, 2014, pages 21--30. IEEE Computer Society, 2014. [ DOI ] [ pdf ]
[16]
George Bosilca, Aurélien Bouteiller, Thomas Hérault, Yves Robert, and Jack J. Dongarra. Assessing the impact of ABFT and checkpoint composite strategies. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, Phoenix, AZ, USA, May 19-23, 2014, pages 679--688. IEEE Computer Society, 2014. [ DOI ] [ pdf ]
[17]
Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, and Jack J. Dongarra. Optimal checkpointing period: Time vs. energy. In Stephen A. Jarvis, Steven A. Wright, and Simon D. Hammond, editors, High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation - 4th International Workshop, PMBS 2013, Denver, CO, USA, November 18, 2013. Revised Selected Papers, volume 8551 of Lecture Notes in Computer Science, pages 203--214. Springer, 2013. [ DOI ] [ pdf ]
[18]
Jack J. Dongarra, Thomas Hérault, and Yves Robert. Revisiting the double checkpointing algorithm. In 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, Cambridge, MA, USA, May 20-24, 2013, pages 706--715. IEEE, 2013. [ DOI ] [ pdf ]
[19]
George Bosilca, Aurélien Bouteiller, Anthony Danalis, Mathieu Faverge, Azzam Haidar, Thomas Hérault, Jakub Kurzak, Julien Langou, Pierre Lemarinier, Hatem Ltaief, Piotr Luszczek, Asim YarKhan, and Jack J. Dongarra. Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16-20 May 2011 - Workshop Proceedings, pages 1432--1441. IEEE, 2011. [ DOI ] [ pdf ]
[20]
George Bosilca, Aurélien Bouteiller, Anthony Danalis, Thomas Hérault, Pierre Lemarinier, and Jack J. Dongarra. DAGuE: A generic distributed DAG engine for high performance computing. In 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16-20 May 2011 - Workshop Proceedings, pages 1151--1158. IEEE, 2011. [ DOI ] [ pdf ]
[21]
Fatiha Bouabache, Thomas Hérault, Gilles Fedak, and Franck Cappello. A distributed and replicated service for checkpoint storage. In Marco Danelutto, Paraskevi Fragopoulou, and Vladimir Getov, editors, Making Grids Work: Proceedings of the CoreGRID Workshop on Programming Models Grid and P2P System Architecture Grid Systems, Tools and Environments, 12-13 June 2007, Heraklion, Crete, Greece, pages 295--306. Springer, 2007. [ DOI ] [ pdf ]
[22]
Guillaume Guirado, Thomas Hérault, Richard Lassaigne, and Sylvain Peyronnet. Distribution, approximation and probabilistic model checking. In Martin Leucker and Jaco van de Pol, editors, Proceedings of the 4th International Workshop on Parallel and Distributed Methods in Verification, PDMC@ICALP 2005, Lisbon, Portugal, July 10, 2005, volume 135 of Electronic Notes in Theoretical Computer Science, pages 19--30. Elsevier, 2005. [ DOI ] [ pdf ]
[23]
Marie Duflot, Laurent Fribourg, Thomas Hérault, Richard Lassaigne, Frédéric Magniette, Stéphane Messika, Sylvain Peyronnet, and Claudine Picaronny. Probabilistic model checking of the CSMA/CD protocol using PRISM and APMC. In Michael Huth, editor, Proceedings of the Fouth International Workshop on Automated Verification of Critical Systems, AVoCS 2004, London, UK, September 4, 2004, volume 128 of Electronic Notes in Theoretical Computer Science, pages 195--214. Elsevier, 2004. [ DOI ] [ pdf ]
[24]
Aurélien Bouteiller, Hinde-Lilia Bouziane, Thomas Hérault, Pierre Lemarinier, and Franck Cappello. Hybrid preemptive scheduling of MPI applications on the grids. In Rajkumar Buyya, editor, 5th International Workshop on Grid Computing (GRID 2004), 8 November 2004, Pittsburgh, PA, USA, Proceedings, pages 130--137. IEEE Computer Society, 2004. [ DOI ] [ pdf ]
[25]
Joffroy Beauquier, Thomas Hérault, and Elad Schiller. Easy stabilization with an agent. In Ajoy Kumar Datta and Ted Herman, editors, Self-Stabilizing Systems, 5th International Workshop, WSS 2001, Lisbon, Portugal, October 1-2, 2001, Proceedings, volume 2194 of Lecture Notes in Computer Science, pages 35--50. Springer, 2001. [ DOI ] [ pdf ]

[1]
Atsushi Hori, Yuichi Tsujita, Akio Shimada, Kazumi Yoshinaga, Namiki Mitaro, Go Fukazawa, Mikiko Sato, George Bosilca, Aurélien Bouteiller, and Thomas Herault. Advanced Software Technologies for Post-Peta Scale Computing, chapter System Software for Many-Core and Multi-core Architecture: The Japanese Post-Peta CREST Research Project., pages 59--75. Springer, 2019.
[2]
Yves Robert Jack Dongarra, Thomas Herault. Fault Tolerance Techniques for High-Performance Computing, chapter Fault Tolerance Techniques for High-Performance Computing, pages 3--85. Computer Communications and Networks. Springer, 2015. [ pdf ]
[3]
Thomas Herault and Yves Robert, editors. Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks. Springer, 2015. [ DOI ]
[4]
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Piotr Luszczek, and Jack J. Dongarra. Scalable Computing and communications: Theory and Practice, chapter Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach, pages 699--735. John Wiley & Sons, 2013. [ pdf ]
[5]
Krysztof Kurowski, Bartosz Bosak, Tomasz Piontek, Piotr Grabowski, Mariusz Mamonski, George Kampis, Laszlo Gulyas, Camille Coti, Thomas Herault, and Franck Cappello. QosCosGrid E-Science Infrastructure, chapter Large-Scale Computing Techniques for Complex System Simulations. John Wiley & Sons, werner dubitzky, krzysztof kurowski, and bernard schott edition, 2011. [ pdf ]
[6]
Christophe Cérin, Jean-Christophe Dubaq, Thomas Hérault, Ronan Keryell, Jean-Louis Pazat, Jean-Louis Roch, and Sébastien Varrette. Systèmes répartis en action: De l'embarqué aux systèmes à large échelle, chapter Sécurité dans les grilles de calcul. Hermes Science Publications, laurent pautet and fabrice kordon and laure petrucci edition, 2008. [ pdf ]

[1]
Thomas Herault, Aurelien Bouteiller, Joseph Schuchart, Qinglei Cao, and George Bosilca. PaRSEC: Scalability, flexibility, and hybrid architecture support for task-based applications in ECP. Int. J. High Perform. Comput. Appl., 2024. under review.
[2]
Leonardo Bautista-Gomez, Anne Benoit, Sheng Di, Thomas Herault, Yves Robert, and Hongyang Sune. A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly? Future Gener. Comput. Syst., 2024. under review.
[3]
Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. On the combination of silent error detection and checkpointing. CoRR, abs/1310.8486, 2013. [ arXiv | http ]
[4]
Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, and Jack J. Dongarra. Optimal checkpointing period: Time vs. energy. CoRR, abs/1310.8456, 2013. [ arXiv | http ]
[5]
Jack J. Dongarra, Mathieu Faverge, Thomas Hérault, Julien Langou, and Yves Robert. Hierarchical QR factorization algorithms for multi-core cluster systems. CoRR, abs/1110.1553, 2011. [ arXiv | http ]
[6]
Emmanuel Agullo, Camille Coti, Jack J. Dongarra, Thomas Hérault, and Julien Langou. QR factorization of tall and skinny matrices in a grid computing environment. CoRR, abs/0912.2572, 2009. [ arXiv | http ]
[7]
Thomas Hérault and Pierre Lemarinier. A rollback-recovery protocol on peer to peer systems. In Modeling and verification of parallel processes. Summer school (5 ; Nantes 2002-06-17), 2002.

Editing Roles

  • Associate Editor of the Journal of Parallel and Distributed Computing, since 2022
  • Program Committee Chairing

  • High Performance Computing (HiPC) 2022, vice-chair track ’algorithms’
  • International Parallel and Distributed Processing Symposium (IPDPS) 2022, Track chair, track System
  • International Conference on Parallel Processing (ICPP) 2017, PC Vice-Chair, track ’Systems’
  • International Conference on Parallel and Distributed Systems (ICPADS) 2015, PC Vice-Chair, track ’multicore computing’
  • High Performance Computing (HiPC) 2014, PC chair
  • High Performance Computing (HiPC) 2013, PC vice-chair track ’Software’
  • EuroPVM/MPI 2007, PC co-chair
  • Conference Organizing

  • Workshop on Asynchronous Many-Task Systems and Applications (WAMTA) 2024, local organization co-chair
  • EuroPVM/MPI 2007, local organizer co-chair
  • Program Committee Participation

  • Supercomputing (SC) 2024 (area Algorithms)
  • International Parallel and Distributed Processing Symposium (IPDPS) 2024, (track Algorithms for Computational Science)
  • Supercomputing (SC) 2023 (Student Cluster Competition, ACM Undergraduate Poster)
  • High Performance Computing (HiPC) 2021 (track algorithms)
  • Supercomputing (SC) 2021 (area Algorithms)
  • International Conference on Parallel Processing (ICPP) 2021 (track software)
  • International Parallel and Distributed Processing Symposium (IPDPS) 2021 (track algorithms)
  • Workshop on Advances in Parallel and Distributed Computational Models (APDCM) 2021
  • International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) 2019
  • IEEE International Conference on Cluster Computing (Cluster) 2019
  • Advances in Parallel and Distributed Computational Models (APDCM) 2019
  • Supercomputing (SC) 2019 (area Algorithms)
  • International Conference on Parallel Processing (ICPP) 2019 (track Applications)
  • Supercomputing (SC) 2018 (Best poster Selection, Posters, Workshop selection)
  • Supercomputing (SC) 2017 (area Algorithms)
  • High Performance Computing (HiPC) 2016 – 2017
  • International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) 2016
  • Supercomputing (SC) 2016 (area Algorithms)
  • International Parallel \& Distributed Processing Symposium (IPDPS) 2014
  • Cloud Computing 2013
  • High Performance Computing (HiPC) 2012
  • Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Heteropar) 2011
  • IEEE International Conference on Cluster Computing (Cluster) 2010
  • IEEE/ACM International Symposium on Cluster, Cloud and Grid (CCGRID) 2009 – 2010
  • Facing the Multicore-Challenge 2010 – 2012
  • IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) 2009
  • International Symposium on Parallel and Distributed Computing (ISPDC) 2008 – 2010
  • High Performance Computing for Computational Science (VECPAR) 2008, 2010, 2012
  • European PVM/MPI Users Group Meeting (EuroPVM/MPI), now EuroMPI 2008 – 2018
  • Journal Reviewing

  • Communication of the ACM (CACM);
  • Future Generation Computer Systems (FGCS);
  • Techniques et Sciences Industrielles (TSI);
  • Journal of Aerospace Computing, Information and Communication;
  • Parallel Processing Letter (PPL);
  • International Journal of High Performance Computing and Applications (IJHPCA);
  • International Journal of High Performance Computing and Network (IJHPCN);
  • Journal of Computational Science; Journal of Parallel and Distributed Computing (JPDC);
  • Simulation Modelling Practice and Theory;
  • Journal of Supercomputing;
  • Theoretical Computer Science (TCS);
  • Journal of High Performance Computing;
  • IEEE’s Transactions on Dependable and Secure Computing (TPDS)