...
(a) hvd.allreduce (b) hvd.broadcast_parameters
...
- In every iteration the DistributedOptimizer wrapper will insert an Allreduce of the gradients before the weight update is done.
- This is done by calling hvd.allreduce.
- This calls down into the C API horovod_mxnet_allreduce_async
- This calls a new method MXWaitForHorovod on the MXNet side of things. This is the only information that MXNet library knows about Horovod.This calls MXNet's PushAsync, which creates a callback for Horovod to call upon completion of the Allreduce.
- After the Allreduce is complete, Optimizer's weight update is done.
...
Easier integration with other MXNet bindings, because those bindings already support KVStore
User does not have to install another dependency in order to do distributed training, because MXNet build tool includes Horovod source code as a 3rd party dependency.
However, there is a trade-off, because then the Horovod source code would need to be maintained to ensure there are no regressions
Language bindings for languages other than Python are available without additional work
Performance Benchmarks
Final API Benchmarks
Model: resnet-v1, 50 layers
Dataset: synthetic
Dtype: float32
Instance types: Horovod+X (32 p3.16xlarge), parameter server (32 p3.16xlarge)
Figure 4. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers) and Horovod+MXNet
Prototype Benchmarks
Model: resnet-v1, 50 layers
Dataset: synthetic
Dtype: float32
Instance types: Horovod+X (16 p3.16xlarge), parameter server (16 p3.16xlarge, 32 r4.16xlarge).
Figure 45. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers), parameter server 2 servers:1 worker, Intel MPI+MXNet, Horovod+Tensorflow, and Horovod+MXNet.
...