Note: Please feel free to comment on the Google Doc. We will merge the finalized proposal back here.


The ring Reduce communication pattern used by NCCL (Figure 1a) and Parameter server Reduce (Figure 1b) currently used in MXNet are not optimal for small batch sizes on p3.16xlarge instances with 8 GPUs.

Note: To clarify, usage of "parameter server" refers to single-machine communication rather than the distributed sense the term is usually used in. It is enabled by adding the flag "--kvstore device".

                  (a) Ring algorithm used by NCCL                                    (b) Parameter server algorithm

Figure 1. Existing Reduce and Broadcast algorithms currently in MXNet.

We run a benchmark to show this shortcoming. NCCL is clearly the best when only one Reduce and Broadcast is called (1 Push-Pull). However, in workloads that require many Reduce and Broadcast in a sequence, NCCL becomes slower than the parameter server implementation for small message sizes.

                                    (a) 1 Push-Pull before Wait                                                               (b) 50 Push-Pull before Wait                                                            (c) 150 Push-Pull before Wait

Figure 2. How long it takes for KVStore to do push-pull using each communication method. The flat region in each plot is the latency-bound region where latency dominates. The diagonal region is the bandwidth-bound region where bandwidth dominates.

Figure 2 explains the end-to-end performance results (see Figure 8) that show Parameter server is faster for networks that require many Push-Pulls over relatively small keys (e.g. ResNet-50, Inception-48 need over 157 Push-Pulls on keys not exceeding 2M floats in size), but NCCL ring Reduce is faster for networks that only need Push-Pull over few keys (e.g. VGG-16 and AlexNet only need fewer than 32 Push-Pulls on keys that exceed 10M in size).

Experimental Setup

Our experiments are performed using the p3.16xlarge instance. The link topology of p3.16xlarge is shown in Figure 3. It is composed of 3 types of links:

  1. Double NVLink connections (50 GB/s)
  2. Single NVLink connections (25 GB/s)
  3. PCI-E connection + CPU (10 GB/s)

                                                        (a) 2D view                                                                (b) 3D view (PCI-E not shown)

Figure 3. Link topology of p3.16xlarge instance.

Proposed Approach

Our approach is based on the idea of using trees to perform the Reduce and Broadcast. We can use the idea of minimum spanning trees to do a binary tree Reduce communication pattern to improve it following this paper by Wang, Li, Edo and Smola [1]. Our strategy will be to use:

  • a single tree (latency-optimal for small messages) to handle Reduce on small messages
  • multiple trees (bandwidth-optimal for large messages) to handle large messages.

Both methods are shown in Figure 4.

                          (a) Single tree algorithm                                             (b) Multiple tree algorithm

Figure 4. Proposed Reduce and Broadcast algorithms.

In addition, our proposed system ought to automatically detect how GPUs are connected in a single machine (topology-awareness), and use this knowledge to automatically generate the balanced binary trees of maximum weight. This way, the proposed system will work out-of-the-box for any network topology. The characteristics we want are:

  • Balanced, because the binary tree’s height determines the latency of the Reduce and Broadcast operations.
  • Maximum weight, because we want to maximize use of the highest bandwidth connections.


                                            (a) where it fits in KVStore                                                                        (b) InitMergeBuffersAndComm                          


                                   (c) Reduce                                                                                                                         (d) Broadcast
Figure 5. Block diagram of proposed addition. Changes to old initialization (InitMergeBuffersAndComm), Reduce and Broadcast are illustrated.

Note: Additional memory copies to temporary buffer (temp) is necessary following Reduce and Broadcast, because we do not know the final destination buffer dst at this time. However due to the interface of kvstore::Push() (i.e. the user-exposed method that calls Reduce), this information is not known until kvstore::Pull() (i.e. until Broadcast). This means that unless the API is changed to PushPull (which has both source and destination arguments), we will need to do extra write to temp buffer in Reduce, and an extra write to temp buffer in Broadcast. The good thing is that these writes are on the same GPU, so they do not take significant amount of time (less than 1% of the runtime). Interestingly, this is the same API change that the other All-Reduce related proposal is asking for.

Technical Challenges

As shown in Figure 5, the design can be broken into two tasks:

  1. Using binary tree topology_ to do a reduce
  2. Generating binary tree topology_

1. Use Binary Tree to do Reduce

First, given a binary tree we show that it gives us the precise sequence of GPU sends in order to perform a reduce. On the left will be the link topology in 3D view. On the right is the portion of the binary tree we are interested in.

(a) Step 1: GPU1 sends to GPU5, where the combined results of GPU1 and GPU5 are reduced on GPU5.

(b) Step 2: GPU7 sends to GPU5, where the combined results of GPU1, GPU3, GPU5 and GPU7 are reduced on GPU5.

(c) Step 3: GPU4 sends to GPU5, where the combined results of all 8 GPUs are reduced on GPU5.

(d) On link topology (left), the sequence of sends forms a spanning tree. The operations also correspond to a binary tree (right).

Figure 6. Reduce on a single binary tree, showing connection between reduce and binary tree.

2. Generate Binary Tree

We looked at using two methods to generate the binary tree. Since we are looking for a balanced binary tree instead of a maximum spanning tree, we cannot use polynomial time algorithms such as Kruskal's or Boruvka's MST algorithm. Using the link topology saved in an adjacency matrix as input, we used:

  • Kernighan-Lin algorithm: This is a popular hierarchical clustering method used to generate graph clusterings based on the Kernighan-Lin algorithm [4]. It is part of the popular METIS package [5].
  • Exhaustive search: We observed that in practice, Kernighan-Lin would get stuck due to our modification to the algorithm. In such cases, we resort to exhaustive search.

Since we did not want to introduce additional dependencies to MXNet, we implemented our own version of the Kernighan-Lin algorithm. We modify the Kernighan-Lin algorithm for a purpose it was not intended for, i.e. to find a binary tree embedding on the link topology we are interested in. Our algorithm works as follows:

1. Add the vertex we want to be root to the set of root nodes. Begin with entire graph in one cluster.

2. While at least one cluster has at least 2 vertices:

a) Apply Kernighan-Lin heuristic to discover a clustering of the graph into two clusters.

b) Look for the edge that satisfies the following conditions:

      • Is one of the edges that crosses between the two clusters
      • One vertex of the edge is one of the rootnodes
      • Has the highest weight of all edges that satisfy the above conditions

c) If such an edge is found, add both vertices to our binary tree discovered so far. Also, add both vertices to the set of root nodes for the next iteration.

    If no such edge is found, use exhaustive search to find a binary tree from this root.

3. Save the binary tree found in order to do reduction.

This worked well most of the time. However, when trying to find such a tree for 6 GPUs, we notice that sometimes this gets stuck and an edge cannot be found to link two such clusters. In such cases, we resorted to exhaustive search.

Link usage penalty

Trees are generated in such a sequential fashion described above. To discourage later trees from using previously used links, we apply a multiplicative penalty term MXNET_KVSTORE_TREE_LINK_USAGE_PENALTY (default = 0.7) whenever a link has been used. This is multiplied to the initial link topology adjacency matrix where 3 represents double NVLink connection and 2 represents single NVLink connection.

When to switch between Single and Multiple tree

(a) Parameter sweep of MXNET_KVSTORE_TREE_ARRAY_BOUND                                      (b) 1 Push-Pull before Wait                                               (c) 150 Push-Pulls before Wait
Figure 7. VGG-16 performance as function of MXNET_KVSTORE_TREE_ARRAY_BOUND using batch size 4 per GPU. These figures show that beyond 1M-10M float32's, multi-tree begins to do better than a single tree.

Alternative Approaches considered

Other communication frameworks such as Horovod [2] built atop Tensorflow and Synkhronos [3] built atop Theano have also noticed the problem described earlier. They try to solve the problem of NCCL not overlapping multiple Reduces well by batching tensors (keys) together. However, this incurs overhead of memory copies within the same GPU.

Instead of this approach, we follow the approach outline above of using the asynchronous MXNet engine to build our own single-machine multi-GPU communication system. This has the advantage of not incurring memory copy overhead, reducing a dependency from MXNet, and potentially getting better performance.

End-to-End Results

Table 1. Peak speed-up on small batch sizes.

Vs. Parameter Server (in comm.h)Vs. NCCL (in kvstore_nccl.h)

Figure 8. End-to-end training results on synthetic data showing speed-up vs. NCCL on fp32 and fp16.


April 30, 2018: Begin work

June 14, 2018: Update Design Proposal

June 19, 2018: Submit Pull Request

Multi-machine Design Proposal

There is another All-Reduce related proposal on this wiki. The problem they are trying to solve differs from ours in two ways:

  1. Theirs is multimachine, but ours is single-machine.
  2. Theirs only supports CPU, but ours only supports GPU.

In light of these two differences, ideally we would be able to combine our efforts by first using single-machine communication (this proposal) to reduce keys within a machine, and use GPU-aware MPI to all-reduce keys between multiple machines (their proposal). Going from MPI for CPU to GPU-aware MPI is not much work, because it only involves choosing an MPI framework that supports (i) GPU-aware communication and (ii) change MPI send and receive buffers from CPU buffer to GPU buffer.


  1. Leyuan Wang and Mu Li, Edo Liberty, Alex J. Smola. 2018. Optimal Message Scheduling for Aggregation. In Proceedings of ACM Conference on Systems and Machine Learning (SysML’18). ACM, New York, NY, USA, Article 4, 4 pages.
  2. Sergeev, Alexander, and Mike Del Balso. "Horovod: fast and easy distributed deep learning in TensorFlow." arXiv preprint arXiv:1802.05799 (2018).
  3. Stooke, Adam, and Pieter Abbeel. "Synkhronos: a Multi-GPU Theano Extension for Data Parallelism." arXiv preprint arXiv:1710.04162 (2017).
  4. B.W. Kernighan and S. Lin, "An efficient heuristic procedure for partitioning graph,s", Bell System Tech. Journ al, vol. 49, Feb. 1970, pp. 291-307.
  5. Karypis, George, and Vipin Kumar. "METIS--unstructured graph partitioning and sparse matrix ordering system, version 2.0." (1995).
  • No labels