This document defines a proposal to use a new implementation of the operator graph for MXNet which will supersede the current implementation using NNVM. It will provide a more powerful and extensible implementation that will be easier to maintain and extend, cutting an external dependency to the project, (which we are thankful and have made contributions to). The operator graph for a deep learning framework is akin to the abstract syntax tree for a compiler, is the main data structure holding the computations that need to be performed and scheduled. 

Current status:

MXNet, as other deep learning frameworks uses a variable and operator graph to schedule computations, track variables and data flow and perform back-propagation and calculate gradients.

Currently MXNet uses "NNVM" which is a module imported through a git subrepository. 

Before, NNVM was a separate library, and now is part of the TVM project.

NNVM has served MXNet well, now it's time to evolve how we process the graph to make it easier to maintain, more powerful so more complex optimizations such as operator fusion can be done, and to enable future applications such as synthesizing graph operations to compiled code or to hardware description languages.

NNVM Graph limitations that will be addressed by these changes:

  • External project / subrepository, not part of the core MXNet source.
  • Heavy use of type erasure using strings [1], [2] and dmlc any, which is a C++ class that can hold any type. Heavy use of any makes code very difficult to debug as there is no type information that can be used by the debugger, and data is serialized in the heap or the stack in binary.  
  • Multiple implementations to represent Nodes 1, 2, with the same name and different semantics.
  • Unclear data ownership semantics, heavy use of std::shared_ptr. 1 which has memory overhead over unique_ptr and performance impacts on the atomic reference counting mechanism in parallel applications (contention). This also causes difficult to diagnose nested data dependencies between Nodes, NDArray, Engine Variables, NodePtr etc. There's a big potential for simplification and streamlining.
  • Untyped fields such as "info" which is used for Autograd doesn't have type nor semantic. There's also all the untyped node attributes.
  • We can't add additional fields and structures to the graph, as it's not extensible due to the reasons above.
  • Duality of nnvm Indexed graph (indexed by integer key) and the nnvm graph using memory pointers to maintain the graph structure could be improved. On top of that, there's two more additional  representations with respect to NDArrays, autograd info and variables belonging to the graph which could be all simplified and coalesced.
  • Node index is harcoded to uint32_t. While this seems inofensive, many algorithms are iterating on vector sizes which are of type size_t. These narrowing conversions from unsigned int 64 to unsigned int 32 are most of the time not checked for overflows, and while often we are far below 2^32 nodes, abstracting the type for the indexed graph key allows for an overflow check in a single location and also flexbility for the size type, so we can choose narrower or wider types depending on the application, for our case uint16_t could be enough, and this would be a single change in one file with the proposed changes.

Project goals:

  • Have an internalized graph library that can be maintained as part as MXNet. This makes contributions easier and helps evolve the graph without potential design conflicts for a library that is now part of other project with different goals and use cases than MXNet (TVM). In this way we can evolve the library and fit it for MXNet use cases.
  • Use a class passed as template argument to the graph for node and edge attributes, where the graph attributes are stored in a maintainable, type safe and compile-time checked way.
  • The nodes of the graph also have templated Node data and attributes class which decouple the implementation of generic graph algorithms, traversal etc, from their usage. This will allow for more data fields and structured data and classes to be added to the graph nodes and edges in a propper way.
  • The graph is type safe,extensible, reusable, and is debugging friendly.
  • Developers can easily extend the graph implementing logic to dump and process the graph in a generic way without having to deal with too many implementation details.
  • Generic facilities can be added such as serialization, optimization passes, operation fusion passes etc. which will add tremendous value to MXNet as a flexible framework going forward in the next steps of deep learning developments.
  • AGInfo is simplified and such a vital part can be part of the Node attributes class (NodeData in the example below).
  • This change would help support higher order gradient calculations, since it needs to do passes on the graph that generate additional backward nodes recursively.
  • Better framework for advanced graph processing. (See other proposals affecting graph processing in this wiki node).
  • Eliminate the need for an additional indexed graph data structure, saving on memory, CPU cost and complexity. Eliminate the need for equivalent data structures such as Node, NodePtr, NodeEntry and coalesce into a single one representing either a variable or an op.
  • Pave the way for powerful autodiferentiation and hybridization of tensor operations using MXNet for statically compiled languages and further improve autograd in dynamic languages such as Python.
  • Pave the way for powerful graph optimization and operator fusion such as what's done in Tensorflow XLA.

Examples of design changes:

A graph is a data structure that holds relationships between nodes which are connected by edges, there are different ways in which we can abstract and code such a data structure so we can then use the generic data structure to specific purposes, the same as a linked list is implemented with pointers to the next item, but can be exposed to users either as a templated class such as std::list, or the pointer can be internal in the user data structure as is commonly done in C.

In our case we need to abstract the data fields from the Nodes which represents variables and operators, we can call this abstraction "NodeData" which will be a class defined by the user and which will be passed by a template argument.

This can be done in different ways, one is to pass the NodeData to the Graph class.

template<class NodeData, class EdgeData, typename NodeKey, typename EdgeKey> class Graph;

When we instantiate the graph, we define a node data structure which holds the structured data fields needed for a node, for example:

class MXNetNodeData {
vector<NDArray*> inputs;
vector<NDArray*> outputs;
Autograd autograd;
Operator* op;
string name;
bool is_operator();
bool is_variable();
bool has_autograd();

class EdgeData {};

Graph<MXNetNodeData, EdgeData, size_t, uint16_t> g;

g.DFSVisit([](const MXNetNodeData& x) { ... });

Diferentiation, and optimization would be implemented as modular and flexible passes to the graph, which will be encapsulated and maintainable.

For example, a piece of code that would be greatly improved by this would be Imperative::Backward. This code as of today is breaking encapsulation of the classes that it's using. Holding indices to internal data structures in the graph as well as using node ids, pointers to arrays and variables. It's extremely complex to follow and reason. It can be broken into different passes that augment the graph and add nodes or mutate it when doing shape inference, adding backward nodes etc.

While ideally for correctness one would prefer immutable data structures, which make the code much easier to reason about and verify correctness, this is not always possible with regards to performance considerations. But still this is a good principle to keep in mind when laying out the design whenever is possible. 

In the same fashion, common functions such as DFSVisit, would be part of the graph interface and take a NodeData class as argument for the visiting function. This would be similar as it's now, not requiring any major change.

Internally inside the Graph class, we extend NodeData to add the NodeKey index used to maintain the adjacency list. There are different implementations for the adjacency list such as edge buckets, depending on if we plan to do frequent additions or deletion of edges to the graph. In our case we create the graph and then we modify it adding the backward nodes during gradient calculation, but this is unlikely to be a bottle neck for our usecase since the number of nodes is not high and the core of the computation time is spent running the graph and not maintaining the graph data structure itself. So our main concern is maintainability, type safety, semantics and extensibility. This approach would still be as efficient as needed depending on the internal implementation of the adjacency list.

Additional implementation details of the graph can be discussed further, either by extending this document or by additional more detailed ones. 

  • No labels