%0 Journal Article %J Parallel Computing %D 2018 %T PMIx: Process Management for Exascale Environments %A Ralph Castain %A Joshua Hursey %A Aurelien Bouteiller %A David Solt %B Parallel Computing %V 79 %P 9–29 %8 2018-01 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S0167819118302424https://api.elsevier.com/content/article/PII:S0167819118302424?httpAccept=text/xmlhttps://api.elsevier.com/content/article/PII:S0167819118302424?httpAccept=text/plain %! Parallel Computing %R 10.1016/j.parco.2018.08.002 %0 Conference Proceedings %B Proceedings of the 24th European MPI Users' Group Meeting %D 2017 %T PMIx: Process Management for Exascale Environments %A Castain, Ralph H. %A David Solt %A Joshua Hursey %A Aurelien Bouteiller %X High-Performance Computing (HPC) applications have historically executed in static resource allocations, using programming models that ran independently from the resident system management stack (SMS). Achieving exascale performance that is both cost-effective and fits within site-level environmental constraints will, however, require that the application and SMS collaboratively orchestrate the flow of work to optimize resource utilization and compensate for on-the-fly faults. The Process Management Interface - Exascale (PMIx) community is committed to establishing scalable workflow orchestration by defining an abstract set of interfaces by which not only applications and tools can interact with the resident SMS, but also the various SMS components can interact with each other. This paper presents a high-level overview of the goals and current state of the PMIx standard, and lays out a roadmap for future directions. %B Proceedings of the 24th European MPI Users' Group Meeting %S EuroMPI '17 %I ACM %C New York, NY, USA %P 14:1–14:10 %@ 978-1-4503-4849-2 %G eng %U http://doi.acm.org/10.1145/3127024.3127027 %R 10.1145/3127024.3127027 %0 Journal Article %J Computing %D 2013 %T An evaluation of User-Level Failure Mitigation support in MPI %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A Joshua Hursey %A George Bosilca %A Jack Dongarra %K Fault tolerance %K MPI %K User-level fault mitigation %X As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures. %B Computing %V 95 %P 1171-1184 %8 2013-12 %G eng %N 12 %R 10.1007/s00607-013-0331-3 %0 Conference Proceedings %B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012 %D 2012 %T An Evaluation of User-Level Failure Mitigation Support in MPI %A Wesley Bland %A Aurelien Bouteiller %A Thomas Herault %A Joshua Hursey %A George Bosilca %A Jack Dongarra %B Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012 %I Springer %C Vienna, Austria %8 2012-09 %G eng