%0 Journal Article %J International Journal of Networking and Computing %D 2019 %T Checkpointing Strategies for Shared High-Performance Computing Platforms %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Dorian Arnold %A Kurt Ferreira %A George Bosilca %A Jack Dongarra %X Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Overall, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem. %B International Journal of Networking and Computing %V 9 %P 28–52 %G eng %U http://www.ijnc.org/index.php/ijnc/article/view/195 %0 Conference Paper %B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award %D 2018 %T Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms %A Thomas Herault %A Yves Robert %A Aurelien Bouteiller %A Dorian Arnold %A Kurt Ferreira %A George Bosilca %A Jack Dongarra %X In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance. %B 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award %I IEEE %C Vancouver, BC, Canada %8 2018-05 %G eng %R 10.1109/IPDPSW.2018.00127 %0 Journal Article %J Concurrency: Practice and Experience %D 2002 %T Innovations of the NetSolve Grid Computing System %A Dorian Arnold %A Henri Casanova %A Jack Dongarra %K netsolve %B Concurrency: Practice and Experience %V 14 %P 1457-1479 %8 2002-01 %G eng %0 Journal Article %J Parallel Computing %D 2002 %T Middleware for the Use of Storage in Communication %A Micah Beck %A Dorian Arnold %A Alessandro Bassi %A Francine Berman %A Henri Casanova %A Jack Dongarra %A Terry Moore %A Graziano Obertelli %A James Plank %A Martin Swany %A Sathish Vadhiyar %A Rich Wolski %K netsolve %B Parallel Computing %V 28 %P 1773-1788 %8 2002-08 %G eng %0 Generic %D 2002 %T Users' Guide to NetSolve v1.4.1 %A Sudesh Agrawal %A Dorian Arnold %A Susan Blackford %A Jack Dongarra %A Michelle Miller %A Kiran Sagi %A Zhiao Shi %A Keith Seymour %A Sathish Vadhiyar %K netsolve %B ICL Technical Report %8 2002-06 %G eng %0 Journal Article %J Parallel Processing Letters %D 2001 %T On the Convergence of Computational and Data Grids %A Dorian Arnold %A Sathish Vadhiyar %A Jack Dongarra %K netsolve %B Parallel Processing Letters %V 11 %P 187-202 %8 2001-01 %G eng %0 Journal Article %J submitted to SC2001 %D 2001 %T Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication %A Micah Beck %A Dorian Arnold %A Alessandro Bassi %A Francine Berman %A Henri Casanova %A Jack Dongarra %A Terry Moore %A Graziano Obertelli %A James Plank %A Martin Swany %A Sathish Vadhiyar %A Rich Wolski %K netsolve %B submitted to SC2001 %C Denver, Colorado %8 2001-11 %G eng %0 Conference Proceedings %B Department of Defense Users' Group Conference (to appear) %D 2001 %T Metacomputing Support for the SARA3D Structural Acoustics Application %A Shirley Moore %A Dorian Arnold %A David Cronk %K netsolve %B Department of Defense Users' Group Conference (to appear) %C Biloxi, Mississippi %8 2001-06 %G eng %0 Conference Proceedings %B to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications %D 2000 %T Developing an Architecture to Support the Implementation and Development of Scientific Computing Applications %A Dorian Arnold %A Jack Dongarra %K netsolve %B to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications %C Ottawa, Canada %8 2000-10 %G eng %0 Conference Proceedings %B 2000 International Conference on Parallel Processing (ICPP-2000) %D 2000 %T The NetSolve Environment: Progressing Towards the Seamless Grid %A Dorian Arnold %A Jack Dongarra %K netsolve %B 2000 International Conference on Parallel Processing (ICPP-2000) %C Toronto, Canada %8 2000-08 %G eng %0 Journal Article %J ASTC-HPC 2000 %D 2000 %T Providing Infrastructure and Interface to High Performance Applications in a Distributed Setting %A Dorian Arnold %A Wonsuck Lee %A Jack Dongarra %A Mary Wheeler %B ASTC-HPC 2000 %C Washington, DC %8 2000-04 %G eng %0 Conference Proceedings %B Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing %D 2000 %T Request Sequencing: Optimizing Communication for the Grid %A Dorian Arnold %A Dieter Bachmann %A Jack Dongarra %K netsolve %B Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing %C (Germany: Springer Verlag 2000) %P V1900,1213-1222 %8 2000-01 %G eng %0 Conference Proceedings %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %D 2000 %T Seamless Access to Adaptive Solver Algorithms %A Dorian Arnold %A Susan Blackford %A Jack Dongarra %A Victor Eijkhout %A Tinghua Xu %K netsolve %B Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation %C Lausanne, Switzerland %8 2000-08 %G eng %0 Generic %D 2000 %T Secure Remote Access to Numerical Software and Computation Hardware %A Dorian Arnold %A Shirley Browne %A Jack Dongarra %A Graham Fagg %A Keith Moore %B University of Tennessee Computer Science Technical Report, UT-CS-00-446 %8 2000-07 %G eng %0 Conference Proceedings %B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000 %D 2000 %T Secure Remote Access to Numerical Software and Computational Hardware %A Dorian Arnold %A Shirley Browne %A Jack Dongarra %A Graham Fagg %A Keith Moore %K netsolve %B Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000 %C Albuquerque, NM %8 2000-06 %G eng