Fragment

A DNN fault tolerance framework that eliminates training stalls during checkpointing by allowing controlled relaxation of model consistency, targeting zero-overhead fault tolerance for large-scale training.

ai-systems in-progress
cpythonpytorchcuda

Fragment

Fragment addresses a fundamental tension in DNN training: checkpointing is necessary for fault tolerance, but it stalls training while model state is persisted. Fragment relaxes checkpoint consistency requirements to eliminate these stalls entirely.

Key Results

  • Eliminates training stalls during checkpointing by allowing controlled model inconsistency
  • Novel fault-tolerant scheme that maintains recovery guarantees while avoiding synchronization overhead
  • Under review for publication in JMLR