...

$ mpirun -np 8 --hostfile ~/hosts --bind-to none --map-by slot -x NCCL_DEBUG=INFO -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x MXNET_USE_OPERATOR_TUNING=0 -mca pml ob1 -mca btl ^openib python3 /home/ubuntu/master/3rdparty/horovod/examples/mxnet_imagenet_resnet50.py --benchmark 1 --batch-size=256 --network resnet-v1 --num-layers=50 --num-epochs 1 --kv-store horovod None --dtype float16 --gpus 0

Note: This call is valid when using OpenMPI. To improve user experience, this training script may be wrapped into a Bash script in the future.

...

(a) hvd.allreduce (b) hvd.broadcast_parameters

...

In every iteration the DistributedOptimizer wrapper will insert an Allreduce of the gradients before the weight update is done.
This is done by calling hvd.allreduce.
This calls down into the C API horovod_mxnet_allreduce_asyncThis calls a new method MXWaitForHorovod on the MXNet side of things. This is the only information that MXNet library knows about Horovod.
This calls MXNet's PushAsync, which creates a callback for Horovod to call upon completion of the Allreduce.
After the Allreduce is complete, Optimizer's weight update is done.

...

This behaviour is the same as when doing pip install horovod for TensorFlow and PyTorch support. When those libraries are not present, Horovod installation will fail.

Image Removed Image Modified

(a) design using callbacks (b) simplified design

Figure 2. How Horovod interacts with MXNet engine

...

Easier integration with other MXNet bindings, because those bindings already support KVStore
User does not have to install another dependency in order to do distributed training, because MXNet build tool includes Horovod source code as a 3rd party dependency.

However, there is a trade-off, because then the Horovod source code would need to be maintained to ensure there are no regressions

Language bindings for languages other than Python are available without additional work

Performance Benchmarks

Final API Benchmarks

Model: resnet-v1, 50 layers

Dataset: synthetic

Dtype: float32

Instance types: Horovod+X (32 p3.16xlarge), parameter server (32 p3.16xlarge)

Image Added

Figure 4. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers) and Horovod+MXNet

Prototype Benchmarks

Model: resnet-v1, 50 layers

Dataset: synthetic

Dtype: float32

Instance types: Horovod+X (16 p3.16xlarge), parameter server (16 p3.16xlarge, 32 r4.16xlarge).

Figure 45. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers), parameter server 2 servers:1 worker, Intel MPI+MXNet, Horovod+Tensorflow, and Horovod+MXNet.

...

Since CPU support and GPU fp16 support are listed as experimental at the moment, we do not have performance numbers for them.

Addition of New APIs

We are introducing a new MXWaitForHorovodAllreduce and MXWaitForHorovodBroadcast function to the MXNet C API. This function will takes the form of:

void MXWaitforHorovodAllreduce( NDArray* input, NDArray* output, bool average, char* name, void (*func)(NDArray*, NDArray*, bool, char*, void (*)(Engine*, void*)))
void MXWaitforHorovodBroadcast( NDArray* input, NDArray* output, bool average, char* name, void (*func)(NDArray*, NDArray*, bool, char*, void (*)(Engine* void*)))

The parameters are:

...

Test Plan

Functionality Tests

...

Page tree

Versions Compared

Old Version 10

New Version Current

Key

Performance Benchmarks

Final API Benchmarks

Prototype Benchmarks

Addition of New APIs

Test Plan

Functionality Tests

Page tree

Page History

Versions Compared

Old Version 10

New Version Current

Key

Performance Benchmarks

Final API Benchmarks

Prototype Benchmarks

Addition of New APIs

Test Plan

Functionality Tests