LATTICE: Efficient In-Memory DNN Model Versioning

May 31, 2025

ai-systemspersistent-memorycheckpointingml-systems

Deep learning training generates hundreds of model checkpoints over the course of a single run. These checkpoints serve critical purposes — fault recovery, hyperparameter tuning, model selection, and debugging. Yet writing them to disk remains one of the most expensive operations in the training pipeline.

The Storage Bottleneck

Traditional checkpointing writes model state to a filesystem, which means every checkpoint goes through the kernel I/O stack, page cache, and eventually to disk. For large models, this can stall training for seconds or even minutes per checkpoint. The problem compounds when you need to maintain multiple versions for rollback, fine-tuning, or explainability — each version multiplies the storage overhead.

Direct Persistence with NVMM

LATTICE takes a different approach by leveraging non-volatile main memory (NVMM) expansion devices for direct persistence. Instead of serializing model state to files, LATTICE writes directly to persistent memory regions that survive power failures. This eliminates the filesystem overhead entirely and enables efficient versioning through copy-on-write semantics — only the parameters that changed between versions need to be stored.

Key Results

We integrated LATTICE with the Darknet deep learning framework and evaluated it against conventional checkpointing approaches. The library maintains multiple model version snapshots without redundant storage, and the direct persistence path avoids the serialization and I/O stack overhead that dominates traditional approaches.

Why It Matters

As models grow larger and training runs grow longer, the cost of checkpointing becomes a meaningful fraction of total training time. LATTICE demonstrates that emerging memory technologies can fundamentally change this equation — making checkpointing cheap enough to do frequently without impacting training throughput.

Published at SYSTOR 2025 (18th ACM International Systems and Storage Conference). DOI: 10.1145/3757347.3759139