...

(a) hvd.allreduce (b) hvd.broadcast_parameters

...

In every iteration the DistributedOptimizer wrapper will insert an Allreduce of the gradients before the weight update is done.
This is done by calling hvd.allreduce.
This calls down into the C API horovod_mxnet_allreduce_async
This calls a new method MXWaitForHorovod on the MXNet side of things. This is the only information that MXNet library knows about Horovod.This calls MXNet's PushAsync, which creates a callback for Horovod to call upon completion of the Allreduce.
After the Allreduce is complete, Optimizer's weight update is done.

...

Easier integration with other MXNet bindings, because those bindings already support KVStore
User does not have to install another dependency in order to do distributed training, because MXNet build tool includes Horovod source code as a 3rd party dependency.

However, there is a trade-off, because then the Horovod source code would need to be maintained to ensure there are no regressions

Language bindings for languages other than Python are available without additional work

Performance Benchmarks

Final API Benchmarks

Model: resnet-v1, 50 layers

Dataset: synthetic

Dtype: float32

Instance types: Horovod+X (32 p3.16xlarge), parameter server (32 p3.16xlarge)

Image Added

Figure 4. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers) and Horovod+MXNet

Prototype Benchmarks

Model: resnet-v1, 50 layers

Dataset: synthetic

Dtype: float32

Instance types: Horovod+X (16 p3.16xlarge), parameter server (16 p3.16xlarge, 32 r4.16xlarge).

Figure 45. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers), parameter server 2 servers:1 worker, Intel MPI+MXNet, Horovod+Tensorflow, and Horovod+MXNet.

...

Page tree

Versions Compared

Old Version 11

New Version Current

Key

Performance Benchmarks

Final API Benchmarks

Prototype Benchmarks

Page tree

Page History

Versions Compared

Old Version 11

New Version Current

Key

Performance Benchmarks

Final API Benchmarks

Prototype Benchmarks