...
$ mpirun -np 8 --hostfile ~/hosts --bind-to none --map-by slot -x NCCL_DEBUG=INFO -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x MXNET_USE_OPERATOR_TUNING=0 -mca pml ob1 -mca btl ^openib python3 /home/ubuntu/master/3rdparty/horovod/examples/mxnet_imagenet_resnet50.py --benchmark 1 --batch-size=256 --network resnet-v1 --num-layers=50 --num-epochs 1 --kv-store horovod None --dtype float16 --gpus 0
Note: This call is valid when using OpenMPI. To improve user experience, this training script may be wrapped into a Bash script in the future.
...
(a) hvd.allreduce (b) hvd.broadcast_parameters
...
- In every iteration the DistributedOptimizer wrapper will insert an Allreduce of the gradients before the weight update is done.
- This is done by calling hvd.allreduce.
- This calls down into the C API horovod_mxnet_allreduce_asyncThis calls a new method MXWaitForHorovod on the MXNet side of things. This is the only information that MXNet library knows about Horovod.
- This calls MXNet's PushAsync, which creates a callback for Horovod to call upon completion of the Allreduce.
- After the Allreduce is complete, Optimizer's weight update is done.
...
This behaviour is the same as when doing pip install horovod for TensorFlow and PyTorch support. When those libraries are not present, Horovod installation will fail.
(a) design using callbacks (b) simplified design
Figure 2. How Horovod interacts with MXNet engine
...
Easier integration with other MXNet bindings, because those bindings already support KVStore
User does not have to install another dependency in order to do distributed training, because MXNet build tool includes Horovod source code as a 3rd party dependency.
However, there is a trade-off, because then the Horovod source code would need to be maintained to ensure there are no regressions
Language bindings for languages other than Python are available without additional work
Performance Benchmarks
Final API Benchmarks
Model: resnet-v1, 50 layers
Dataset: synthetic
Dtype: float32
Instance types: Horovod+X (32 p3.16xlarge), parameter server (32 p3.16xlarge)
Figure 4. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers) and Horovod+MXNet
Prototype Benchmarks
Model: resnet-v1, 50 layers
Dataset: synthetic
Dtype: float32
Instance types: Horovod+X (16 p3.16xlarge), parameter server (16 p3.16xlarge, 32 r4.16xlarge).
Figure 45. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers), parameter server 2 servers:1 worker, Intel MPI+MXNet, Horovod+Tensorflow, and Horovod+MXNet.
...
Since CPU support and GPU fp16 support are listed as experimental at the moment, we do not have performance numbers for them.
Addition of New APIs
We are introducing a new MXWaitForHorovodAllreduce and MXWaitForHorovodBroadcast function to the MXNet C API. This function will takes the form of:
- void MXWaitforHorovodAllreduce( NDArray* input, NDArray* output, bool average, char* name, void (*func)(NDArray*, NDArray*, bool, char*, void (*)(Engine*, void*)))
- void MXWaitforHorovodBroadcast( NDArray* input, NDArray* output, bool average, char* name, void (*func)(NDArray*, NDArray*, bool, char*, void (*)(Engine* void*)))
The parameters are:
...
Test Plan
Functionality Tests
...