Coping with silent and fail-stop errors at scale by combining replication and checkpointing