Fragment

A DNN fault tolerance framework that eliminates training stalls during checkpointing by allowing controlled relaxation of model consistency, targeting zero-overhead fault tolerance for large-scale training.

ai-systems in-progress

cpythonpytorchcuda

Fragment

Fragment addresses a fundamental tension in DNN training: checkpointing is necessary for fault tolerance, but it stalls training while model state is persisted. Fragment relaxes checkpoint consistency requirements to eliminate these stalls entirely.

Key Results

Eliminates training stalls during checkpointing by allowing controlled model inconsistency
Novel fault-tolerant scheme that maintains recovery guarantees while avoiding synchronization overhead
Under review for publication in JMLR