Distributed DL Training Analysis

Performance and consistency analysis of distributed deep learning training methods across parameter server architectures, providing deployment guidelines for accuracy-throughput tradeoffs.

ai-systems archived
mxnetpython

Distributed DL Training Analysis

This project investigated how different distributed deep neural network training methods — using parameter server architectures — affect both training accuracy and throughput. We studied the tradeoffs between different parameter distribution strategies and provided guidelines for deploying applications under various configurations.

Key Results

  • Systematic comparison of distributed DNN training methods using parameter server architectures
  • Quantified accuracy-throughput tradeoffs across different parameter distribution strategies
  • Provided practical deployment guidelines for choosing training configurations
  • Published at IEEE IPCCC 2020 (39th International Performance Computing and Communications Conference)