The Cross-layer Application-Aware Resilience at Extreme Scale (CAARES) project, a collaborative effort between ICL, Rutgers University, and Stony Brook, aims to provide a theoretical foundation for multi-level fault management and a clear understanding of existing obstacles that could obstruct generic and efficient approaches for fault management at scale. This effort is vital for large-scale science, because, as extreme-scale computational power enables new and important discoveries across all science domains, the current understanding of fault rates is casting a grim shadow, revealing a future where failures are not exceptions but are the norm.
By studying a combination of fault tolerance techniques not in isolation from each other, CAARES seizes the opportunity to identify moldable techniques at the frontier of known approaches, a composition of methodologies that will inherit their individual benefits but not exhibit their drawbacks, and techniques able to bridge the gap between fault tolerance ergonomics and efficiency.
In Collaboration With
- Rutgers University
- Stony Brook University