...
The existing parameter server approach to distributed MXNet faces limitations in performance and feature completeness (tensor fusion, gradient compression, single-bit gradient compression and ability to use MPI and NCCL.).
...
Usability - Users do not have to experiment with number of workers and number of servers to get best performance out-of-the-box.
Performance - Horovod + Tensorflow has shown 2x performance of Distributed Tensorflow [1], so we expect it to compare well to parameter servershow similar gains.
Cost savings - Parameter servers are not needed when they use Horovod.
Simplified architecture - Leverage battle-tested libraries such as MPI and NCCL, as well as network optimizations such as RDMA.
Profiler - Horovod has an excellent profiler for finding bottlenecks.
Online learning - Due to its MPI paradigm, Horovod can save checkpoints which enables online learning and fine-tuning of your model. With parameter server, it takes some additional work to save Optimizer state located on servers, but with Horovod , this feature comes for free. Note: this feature is currently not supported.
Community - Horovod is a way for MXNet to leverage the Deep Learning community for advancements in distributed training, and for increasing MXNet's visibility.
Proposed Approach
User Interface
...
Instance types: Horovod+X (16 p3.16xlarge), parameter server (16 p3.16xlarge, 32 r4.16xlarge).
...
...
Figure 4. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers), parameter server 2 servers:1 worker) , Intel MPI+MXNet, Horovod+Tensorflow, and Horovod+MXNet.
Addition of New APIs
We are introducing a new MXWaitForHorovodAllreduce and MXWaitForHorovodBroadcast function to the MXNet C API. This function will takes the form of:
...
Oct. 5, 2018: Beta release of Final APIAP
References
[1] Sergeev, Alexander, and Mike Del Balso. "Horovod: fast and easy distributed deep learning in TensorFlow." arXiv preprint arXiv:1802.05799 (2018). https://arxiv.org/pdf/1802.05799.pdf
...