Fragment
A DNN fault tolerance framework that eliminates training stalls during checkpointing by allowing controlled relaxation of model consistency, targeting zero-overhead fault tolerance for large-scale training.
ai-systems in-progress
cpythonpytorchcuda
Fragment
Fragment addresses a fundamental tension in DNN training: checkpointing is necessary for fault tolerance, but it stalls training while model state is persisted. Fragment relaxes checkpoint consistency requirements to eliminate these stalls entirely.
Key Results
- Eliminates training stalls during checkpointing by allowing controlled model inconsistency
- Novel fault-tolerant scheme that maintains recovery guarantees while avoiding synchronization overhead
- Under review for publication in JMLR