BytePS-MXNet Integration

Introduction
Value proposition
Proposed approach
Performance benchmark
Limitation
When should you use BytePS

Introduction

BytePS is a high-performance, cross-framework architecture for distributed training. In various hardware settings, it has shown performance advantages compared with Horovod+NCCL. In addition, it supports asynchronous training that Horovod (and all other all-reduce based distributed framework) cannot support. It works for MXNet, TensorFlow and PyTorch, and can run on TCP and RDMA network. The code repository is at https://github.com/bytedance/byteps.

BytePS implementation is a close relative to MXNet. It largely reuses ps-lite and MXNet kvstore_dist_server implementation for network communication and summing up gradients. However, it also adds its own optimization, which is not backward compatible with upstream MXNet’s ps-lite and kvstore implementation.

Consequently, to use MXNet+BytePS, although users are free to choose any MXNet versions as MXNet workers, they have to use a dedicated MXNet version (provided by BytePS team) as BytePS servers. This often causes confusion because sometimes users have to install two versions of MXNet to use MXNet+BytePS.

~~Hence, the BytePS team proposes to integrate the changes BytePS made to ps-lite and kvstore_dist_server into upstream MXNet. Ideally, upstream MXNet can work as BytePS servers directly.~~

BytePS now relies on ps-lite as its third party communication library.

Value proposition

Below are the benefits of this proposed integration.

Better MXNet+BytePS integration will largely ease MXNet+BytePS deployment
Unified launcher: potentially, BytePS can reuse the same launcher as MXNet because BytePS reuses DMLC_* environment variables and logic in DMLC core (via ps-lite). This can help users get familiar with MXNet bootstrap process and maybe cluster orchestrator (like K8S) operators.
Increasing the number of MXNet users: BytePS is cross-framework, so non-MXNet users may start using MXNet via using BytePS.

For example, right now the possible setups are like this:

MXNet workers + BytePS servers/scheduler
TensorFlow workers + BytePS servers/scheduler
PyTorch workers + BytePS servers/scheduler

After the integration, the setups will be:

MXNet workers + MXNet servers/scheduler
TensorFlow workers + MXNet servers/scheduler
PyTorch workers + MXNet servers/scheduler

All BytePS users will be converted into (partly) MXNet users.

Below are the benefits of using BytePS. Proposed integration will make these benefits more accessible.

Performance: BytePS shows higher performance than Horovod+NCCL (see appendix)
Asynchronous training: BytePS supports asynchronous training that Horovod does not support
Usability: BytePS interface highly aligns with Horovod, and does not require additional modification on the MXNet worker side
Cloud-friendly: BytePS shows even larger performance advantages when running on cloud (<3% extra monetary investment for 2x performance)

Proposed approach

For now, we have a dedicated MXNet (for BytePS servers only) and ps-lite (for the networking functionalities of workers/servers/scheduler) for BytePS. They basically reuse the interfaces of the original code. Specifically, we reuse the MXNet KVstore interfaces to implement gradient summation on BytePS servers, and reuse the ps-lite interfaces to implement RDMA and optimized-TCP (based on ZeroMQ). However, the original functionalities of MXNet-KVstore and ps-lite are broken.

Therefore, we plan to make the following code changes to adapt to existing ~~MXNet and~~ ps-lite repositories:

~~MXNet KVstore~~

Implement a new subclass of the KVstoreDistServer class, and provide a knob (e.g., an environment variable) for MXNet users to switch between vanilla KVstoreDistServer and BytePS-KVstoreDistServer. The major modifications would be in “incubator-mxnet/src/kvstore/kvstore_dist_server.h” file.

ps-lite

Implement two subclasses of the Van class for BytePS-TCP and BytePS-RDMA, respectively. In fact, existing ps-lite already incorporates ZMQVan and IBVerbsVan. We can follow their approaches similarly and add two more subclasses, i.e., BytepsTcpVan and BytepsRdmaVan. We plan to add two header files for these two subclasses, respectively. At run-time, we can determine the actual Van implementation via environment variables or other methods.
To facilitate high performance RDMA, we also need to add several new fields for the protobuf data format of ps-lite. However, the change is incremental and will not affect the original ps-lite.

There is no additional 3rd party dependency required.

Performance Benchmark

TCP/IP Network Benchmark

To demo MXNet+BytePS performance, we test two models: VGG16 (communication-intensive) and Resnet50 (computation-intensive). Both models are trained using fp32.

We use Tesla V100 16GB GPUs and set batch size equal to 64 per GPU. The machines are in fact VMs on a popular public cloud. Each machine has 8 V100 GPUs with NVLink-enabled. Machines are inter-connected with 20 Gbps TCP/IP network.

BytePS outperforms Horovod (NCCL) by 44% for Resnet50, and 100% for VGG16.

RDMA Network Benchmark

We also test the BERT-large model using fp16 on with RDMA network. The model is implemented using the gluon-nlp toolkit.

We use Tesla V100 32GB GPUs and set batch size equal to 64 per GPU. Each machine has 8 V100 GPUs with NVLink-enabled. Machines are inter-connected with 100 Gbps RoCEv2 network.

BytePS achieves ~90% scaling efficiency for BERT-large with 256 GPUs. As a comparison, Horovod+NCCL has only ~70% scaling efficiency even after expert parameter tunning.

Limitation

BytePS currently has the following limitations:

It only supports data-parallelism. That means your model should be able to fit into a single GPU. Fortunately, many popular DNN models match the size requirement (model size usually several hundred MB or a few GB).
It is not friendly to sparse models. BytePS focus on training dense DNN models. It assumes that all parameters need to be optimized in each update. Meanwhile, some sparse models might not follow this assumption. Moreover, a sparse model is usually very large and cannot fit into a single GPU’s memory, which falls back to limitation #1.
It does not have fault-tolerance yet. Users may need to manually add parameter checkpointing in their own code to recover from failure.

When should you use BytePS?

BytePS can help you train DNN models in distributed manner (i.e., with at least 2 GPUs) with higher performance than any other existing architectures (native PS or Horovod). If you find your distributed training does not give you satisfying performance, then BytePS is your best choice. That said, there are some cases where you may not see evident improvement with BytePS:

You only use one GPU. There is no communication for single GPU training, so BytePS does not give you any benefit.
Your distributed training already achieves (near-)linear scaling. That means your task is not bottlenecked by communication (but by computation instead). BytePS only optimizes the communication.

The rationale of BytePS

This page explains why BytePS outperforms Horovod (and other existing Allreduce or PS based frameworks) in details: https://github.com/bytedance/byteps/blob/master/docs/rationale.md

Contact

For the above, "we" stand for the BytePS team. The primary developers of BytePS team now are Yibo Zhu, Yimin Jiang and Chang Lan. They can be reached via the following email address. We also thank other developers as well.

zhuyibo@bytedance.com

jiangyimin@bytedance.com

lanchang@bytedance.com