Evolve, a collaborative effort between ICL and the University of Houston, expands the capabilities of Open MPI to support the NSF’s critical software infrastructure missions. Core challenges include: extending the software to scale to 10,000–100,000 processes; ensuring support for accelerators; enabling highly asynchronous execution of communication and I/O operations; and ensuring resilience. Part of the effort involves careful consideration of modifications to the MPI specification to account for the emerging needs of application developers on future extreme-scale systems.
For 2017, Evolve efforts revolved around exploratory research in improving the performance of multithreaded programs using MPI. Collective operations based on events have been investigated and have demonstrated a clear advantage in terms of aggregate bandwidth in heterogeneous (shared-memory + network) systems. User-Level Failure Mitigation (ULFM) fault-tolerance was released based on the latest Open MPI. Counters and performance profiling of internal Open MPI events are now exposed, which has enabled the team to discover and eliminate important performance limitations in the MPI implementation.
In Collaboration With
- University of Houston