Performance and Consistency Analysis for Distributed Deep Learning

November 14, 2020

ai-systemsdistributed-trainingparameter-servers

Distributed deep learning training promises linear speedup by splitting work across multiple machines. In practice, the distributed setting introduces tradeoffs between training speed and model accuracy that depend heavily on system configuration choices.

The Accuracy-Throughput Tradeoff

Parameter server architectures — where worker nodes compute gradients and a central server aggregates them — offer different consistency models. Synchronous training (BSP) waits for all workers before updating, ensuring consistency but limiting throughput to the slowest worker. Asynchronous training (ASP) lets workers proceed independently, improving throughput but introducing stale gradients that can hurt convergence.

What We Measured

We deployed a real cluster with multiple virtual machines and systematically varied system resource distribution, distribution topologies, and model consistency approaches. By profiling runtime system utilization and tracking application activities, we quantified how these choices affect both training throughput and final model accuracy across different deep learning applications.

Deployment Guidelines

The results revealed that the optimal configuration depends on the specific workload. Some models tolerate stale gradients well and benefit from asynchronous training, while others require strict synchronization. We provide practical guidelines for choosing between parameter distribution strategies based on model characteristics, cluster size, and accuracy requirements.

Why It Matters

As organizations deploy distributed training at scale — often on heterogeneous, virtualized infrastructure — understanding these tradeoffs becomes critical. Choosing the wrong configuration can waste compute resources or produce suboptimal models. These guidelines help practitioners make informed decisions without extensive trial-and-error.

Published at IEEE IPCCC 2020 (39th International Performance Computing and Communications Conference). DOI: 10.1109/IPCCC50635.2020.9391566