Title

Link to dev List discussion

Once you have the design proposal written, please send a email to dev@mxnet.apache.org and provide a link below for reference, you can find the "Permalink" of your email from https://lists.apache.org/list.html?dev@mxnet.apache.org. For an example see the link below

https://lists.apache.org/thread.html/464712f0136fb51916ca9f1b702b99847e108dbdbd0b6a2b73fc91f1@%3Cdev.mxnet.apache.org%3E

Feature Shepherd

Need volunteer to help shepherd

Problem

MXNet is a high performance machine learning framework and leverages many high performance tools and libraries in the backend such as MKLDNN, cuDNN, TensorRT, and nGraph among others. Some recent backend additions to MXNet are: TensorRT (subgraph), and Elastic Inference. Adding each of these backends required modifying the MXNet source code, deep knowledge of how MXNet works, and months of time to work with the community to add custom processor-specific changes.

However, adding support for these backends does not change MXNet, and should not require community approval to run MXNet on a new processor. This proposal adds APIs to enable MXNet to run anywhere, on any custom chip or backend library without requiring the backend code to be committed to MXNet and forcing the developers to open-source their custom architecture-specific code/routines unnecessarily.

Proposed Approach

“Bring your own Accelerator” is a set of Accelerator APIs that allow MXNet to interface with any custom ML accelerator chip or ML library. It will bring a new differentiator to MXNet that other ML frameworks lack.

The main problem with adding new backends to MXNet is adding the new functionality to the MXNet code base, recompiling MXNet, and upstreaming the changes (requiring community support/approval). The library approach we present will enable new backends to be compiled separately from the MXNet code base and will not require linking against all of MXNet's 3rd party dependencies (ie. TVM, NNVM, etc.). A single header file mxnet_acc.h will be used to define the APIs between MXNet and accelerator libraries.

The accelerator library will be loaded via dlopen dynamically in the MXNet backend and the APIs will be located in the library using dlsym (standard posix functions from dlfcn.h). Similar functions exist in Windows (LoadLibrary and GetProcAddress). We will use C types/structs to eliminate the compiler version/compatibility issue. This eliminates the requirement for new backends to be compiled or linked against MXNet, or even using the same compiler.

In terms of operator coverage, we cannot expect that an accelerator supports every operator that MXNet has. Instead we will follow the same subgraphing/partitioning scheme that MXNet already supports where the CPU context will be used for any operators not supported by the accelerator.

In this project, we will create a set of abstractions through an API that allows accelerator vendors to create external libraries to interface their custom hardware to MXNet without modifying the MXNet code base. We'll streamline how MXNet interacts with processors, and create a user-facing API to dynamically load accelerator libraries at runtime. This will allow accelerator vendors to distribute their library separately from MXNet, decoupling the release of MXNet versions from accelerator library versions. 

User experience for backend library creators:

There are two ways that ML chips/libraries can be implemented:

In this proposal, we will focus on the symbolic mode.

User experience for ML/DL scientists:

We expect users (data scientists) to treat accelerators like any other context as they would normally in MXNet. The only things they need to be aware of are:

Below is an example code snippet for using the Accelerator APIs:

import mxnet as mx

#load accelerator library, returns a context with device id 0
ctx = mx.load_acc("/path/to/libmyacc.so")

#after loading library, accel context can also be created by
ctx = mx.acc()
ctx = mx.acc(0)

#can also list the available accelerators just like
#mx.test_utils.list_gpus(), returns [0, 1, ...]
ctx_list = []
acc_list = mx.test_utils.list_acc(mx.acc())
for i in acc_list:
ctx_list.append(mx.acc(i))

#bind model`
sym, arg_params, aux_params = mx.model.load_checkpoint(NAME, EPOCH)
mod = mx.mod.Module(symbol=sym, context=ctx)
mod.bind(data_shapes=[('data', (1,3,224,224))], label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_missing=True)

#forward pass
mx_img = mx.nd.array(IMG, ctx=ctx)
Batch = namedtuple('Batch', ['data'])
data = Batch([mx_img])
model.forward(data)

Loading Accelerator Libraries

We will provide users the simplest and most familiar ways to user accelerator libraries.

User-specified

Users can custom load accelerator libraries through the load_acc API specifying the path. This will enable users to write some code quick and try things out without too much setup or configuration.

Bundled

MXNet can bundle libraries in with its installation (pip, jar, etc) and can find those libraries during the init process (ie. import mxnet). This will create a better user experience that “just works” for specific use-cases like EIA.

Environment Variable

Users can point to a directory of accelerator libraries by setting the MXNET_ACC_LIBRARIES variable. This will make it easier for users to generalize their MXNet code by removing environment-specific paths. This variable will be checked during MXNet's initialization process


Accelerator APIs

The main APIs that will be defined in mxnet_acc.h are categorized and described below. These APIs use only C (no C++) to avoid potential problems with using different compilers/STL.

Future Proofing APIs

We are future proofing accelerator library APIs by providing generic interfaces to interact with the accelerator library. The configure function takes a set of keyword args (inputs) and returns a set of keyword args (outputs). This API can be called multiple times with different behavior each time depending on the inputs, and so it can represent any set of additional APIs that an accelerator might need.

Other API concerns

Some accelerators perform special handling of the weights/params to optimize execution by placing in special on-chip/high-speed memories. In the LoadModel API, we need to clearly identify which MXTensors are weights/params and which are input data (ie. image, text, etc. to the model).

Backward compatibility

No issues, this is a new functionality. Existing custom hardware backends for MKL/MKL-DNN/CUDNN/TensorRT will continue working.

Performance Considerations

We will performance analyze the overheads introduced by using a dynamically loaded library by creating a test accelerator library that simply reuses the existing CPU and GPU operator implementations. Then we'll compare these "accelerators" agains the current CPU and GPU contexts.

Test Plan

We will create a test accelerator library that simply reuses the existing CPU and GPU operator implementations an run all existing unit tests. 

Implementation plan

  1. Implement a PR with basic symbolic flow: supported ops, load/unload model, infer
    Link to WIP PR: https://github.com/apache/incubator-mxnet/pull/15489
  2. Implement a followup PR with imperative accelerator flow (fcompute, storage, copy, etc)

Alternative Approaches

Currently, custom accelerators like TensorRT must be implemented by modifying the MXNet backend and learning how MXNet works at the lowest level. The team that implemented TensorRT support in MXNet ran through many hurdles and the learnings from that effort are being applied in this proposal. 

Technical Challenges 

We'll need to version the MXNet operators with accelerator libraries so that as operator implementations change we catch the mismatch against older accelerator libraries. 

Milestones

TBD

References