This document was originally written by Lin Yuan


The aim of this document to help developers quickly understand the overall design architecture as well as code structure of the MXNet backend. It targets readers who have programming experience in C/C++ and would like to contribute to the MXNet backend. We assume that the reader has already perused the 60min Gluon Crash Course, and followed the Handwritten Digit Recognition Tutorial to completion.

This guide is not a replacement for the existing How To Implement an Operator tutorial. Rather, it aims to provide readers a comprehensive view of the backend architecture and an understanding of how the individual components are linked together. It is organized to give a complete view at different levels of detail and is not intended to dive straight into a particular area. This guide also provides further references should readers want more details on an individual component.

MXNet System Overview

MXNet can be many things to different people, but the underlying novelty is that it is a heterogeneous parallel processing framework with a sophisticated runtime engine that dynamically manages data dependencies, memory, data transfers, and much more. The figure below gives a high level overview of the user's interaction with MXNet. At the most basic level, users write an app that calls MXNet APIs and defines a context within which their app will be executed. The context defines the number and type of processors (currently CPUs & GPUs are supported). The app defines what computations will be performed. Since MXNet is for deep learning applications, it has APIs for computations like convolution, pooling, and gradient calculation. But it also has APIs for general numerical computation on arrays such as transpose, inner product, diagonal etc. Each computation is referred to as an “Operator”. MXNet has an extensive library of operators that are accessible through either an imperative or symbolic execution model. In either model, the execution engine always represents the set of work as a computational graph which may be optimized to combine operators to improve performance. And that is where the magic happens.

Figure 1: System overview

MXNet Engine Architecture

The MXNet engine takes a computation graph and operator library and manages execution as defined by the dependencies and other constraints (like memory, data transfer, etc.). The computation graph represents the work that the user defined in their application (the parts/layers of their neural network). The engine hands out work to the set of processors defined in the context according to the dependencies in the graph. The operator implementation defines the low level implementation detail of different operators on various computing platforms.

At a high-level, there are two execution flows based on the MXNet execution model: Symbolic and Imperative. They both create a computation graph, allocate required memory, set up the context and invoke the execution engine through the computation graph. Figure 2 shows the sequence of execution in these two modes:

Figure 2: Execution flow

Computation Graph

The computation graph models the dependencies between operators at a high level. Nodes can represent inputs, outputs and operators; edges represent the data dependencies between nodes. Figure 3 below shows an example of computation graph for an operation containing only a single Softmax operator. Notice that the input data is operated on by the softmax node (this is the forward pass) and results in output 1. Then this output 1 is passed to the backward_softmax operator along with ograd (which is the gradient from any downstream operators, there arent any in this example though) and results in output 3. This final output is the igrad, or input gradient, that would be passed upstream to the next operator (if there were any).

Figure 3: computation graph of softmax

In the code, the Graph data structure is defined in the nnvm package from DMLC/tvm project. The goal of the Neural Network Virtual Machine (nnvm) is to take a graph representation of workloads from different frameworks and translate the high-level graphs into execution graphs.

The symbolic and imperative execution modes have different routines to create and execute the graph. In symbolic mode, the graph is created and executed using routines defined in the GraphExecutor class. In imperative mode, the routine is defined in the Imperative namespace.

Furthermore, the graph is manipulated and transformed through a series of passes defined in nnvm::pass_functions.h. The following figure shows an overview of this process.

figure 4: Graph pass function call

The graph pass functions are defined in tvn/nnvm/src/pass.

Execution Engine

When the engine executes the graph, it invokes the appropriate operators from the library and according to their dependencies. The core interface for the execution engine is:

virtual void PushSync(SyncFn exec_fn, Context exec_ctx,
std::vector<VarHandle> const& const_vars,
std::vector<VarHandle> const& mutable_vars,
FnProperty prop = FnProperty::kNormal,
int priority = 0,
const char* opr_name = nullptr);

The engine invokes operators by calling the exec_fn. Each operator's implementation is defined for both CPU and GPU under src/operator. A detailed example of implementing an operator can be found here:

Other Important Components

Higher Level Language Interfaces

The MXNet backend is written in high performance C/C++. This lends well to performance and maintainability and can be bound to higher level languages through extensions or wrappers. For example, Python supports C/C++ extensions that are Python functions but implemented in C/C++. MXNet provides native Python support by generating custom wrappers for the various supported operators.

Python wrapper generation

In this section we describe the implementation of the imperative NDArray use model of MXNet in Python. As mentioned before, operators are implemented in C/C++ and registered through mechanisms defined in the dmlc-core package. DMLC-core is a common bricks library for building scalable and portable distributed machine learning. It defines a common way to register operators, their input arguments, output arguments, memory requirements, etc. so that they can all be integrated into a common library. This provides the engine a streamlined interface for executing the operators.

At runtime, when the mxnet package is imported in a user's Python program (i.e. 'import mxnet as mx') it calls some initialization routines to create Python wrappers for the C/C++ operators on the fly from the registration information described above. This has a few benefits such as minimal code overhead and reducing the number of places where changes have to be made. The _make_ndarray_function creates the python code text on the fly when the mxnet module is loaded (i.e. “import mxnet”).

If you're not familiar with this concept, it can be quite magical initially. But when you remember that Python is a dynamic, interpreted language, you realize that its not so magical. Rather than compiling all the code ahead of time, it is parsed as the program executes. This allows code to be created as needed (when the package is imported “import mxnet” but before you call any functions “mxnet.nd.conv(a)”).

The Python wrapper is literally generated from scratch on the fly, it does not exist anywhere on disk in the MXNet sources. A Python function registers the new function in the global namespace by getting the info from the operator registration in C via the getSymbolInfo function. The code snipped below shows a string (“def %s(%s):”) being created that looks oddly like a Python function definition. Thats because it is! Based on the operator registration info, the Python wrapper for that operator is generated by appending text together.

def %s(%s):""" % (func_name, ", ".join(signature)))
if not signature_only:
ndargs = []
keys = list(kwargs.keys())
vals = list(kwargs.values())""")
for name in ndarg_names:
if {name} is not None:
assert isinstance({name}, NDArrayBase), 'Argument {name} must have NDArray type, but got %s' % str({name})

Attribute Inference

Another convenient benefit of defining a neural network in MXNet is that a user does not need to specify data types and/or shapes for all its inputs/outputs. The MXNet engine will deduce the properties of NDArrays in the network based on user provided information. This process is called attribute inference. The following code snippet shows an example of MXNet performing shape inference for the rest of the arguments based on the shape specified by only one of the inputs: a.

>>> import mxnet as mx
>>> a = mx.sym.Variable('a', shape=(2, 3), dtype='float64')
>>> b = mx.sym.Variable('b')
>>> c = mx.sym.Variable('c')
>>> d = mx.sym.Variable('d')
>>> e = a * b + c * d
>>> print e.infer_shape()
([(2, 3), (2, 3), (2, 3)], [(2, 3)], [])
>>> print(e.infer_type())
([<class 'numpy.float64'>, <class 'numpy.float64'>, <class 'numpy.float64'>, <class 'numpy.float64'>], [<class 'numpy.float64'>], [])

The shape inference in the engine is done through multiple topological traversals of the computation graph as shown below.

Figure 5: Attribute inference in computation graph

In each traversal, the shape and data type of each node will be inferred from its input and/or output nodes; if the attribute of a node cannot be inferred during this pass, it's attribute will be passed on as unknown initially. Then, graph traversal repeats until all the nodes' attribute have been inferred or the number of unknown attributes cannot be further reduced, in which case an exception will be thrown. Consider the example in Figure 5. This computation graph defines a linear algebra operation:

e = a * b + c * d

In this operation, the user only specifies the shape and data type for input a and MXNet engine will derive the shape and data type for all the remaining nodes in the computation graph and performs the required computation on it.

In the first pass, the shape and data type of node b can be inferred from the shape and data type of node a based on the operator MUL1. The output of MUL1 can be inferred based on node a and b. Subsequently the output node e of operator ADD can also be inferred. After this forward pass, however, the shape and data type of node c and d are still unknown because only based on operator MUL2 no information can be inferred for either of them. Nor can the output of MUL2 be inferred. In the second pass, the graph is traversed from the output to its input. During this backward traversal, the attributes of the output of MUL2 (which is input to ADD) can be inferred from the shape and data type of node e and subsequently the attributes of node c and d can be inferred. After these two passes, the shape and data type of all the nodes in the graph become known. The attribute inference function is defined in src/executor/

  • No labels