Link to dev List discussion
Currently only a very limited number of operators (such as exp) support second or higher order gradient calculation. For other operators, if users try to get the second order gradient of an operator, MXNet would issue an error message such as "mxnet.base.MXNetError: [23:15:34] src/pass/gradient.cc:192: Operator _backward_XXX is non-differentiable because it didn't register FGradient attribute." This is because MXNet backend does not implement the FGradient function for the backward node of the operator and therefore cannot support second (and higher) order of gradient. See the example of sin operator below. The gradient function is registered through a hidden operator _backward_sin. However, the _backward_sin operator does not register FGradient function.
Higher order gradient calculation is required for many applications such as adaptive learning rate optimization , WGAN-GP network , neural architecture search , etc. Implementing higher order gradient can help unlock these applications and improve the usability and popularity of Apache MXNet framework.
We will support second order gradient calculation in the autograd package as shown in the example below:
import mxnet.ndarray as nd
from mxnet import autograd
x = nd.array([1, 2, 3])
y = nd.sin(x)
# y_grad is first order gradient of y and should be cos(x)
y_grad = autograd.grad(y, x, create_graph=True, retain_graph=True)
# this call should calculate the second order of y w.r.t x which should be -sin(x)
print(x.grad) # Should be -sin(x)
This project will implement a set of operators to support adaptive learning rate optimization proposed by  . The following operators will be supported in this project. We will also keep an up-to-date document of the supported operators in MXNet repo.
|Operator||Second order support?||ETA|
Operators in MXNet backend are implemented using C++ with various libraries such as mshadow, nnvm, cblas, mkldnn etc. For each operator registered in MXNet, the gradient calculation is performed through a callback function registered as FGradient. Depending on the type of operators, the FGradient function is implemented differently. To support higher order gradient, we need to analytically sketch out the derivative of the operator and update its FGradient function individually. The FGradient function operates on the current node on the graph which contains the operator inputs and takes the head gradients as inputs.
Case I: Operators with simple FGradient function: many simple tensor arithmetic operators are implemented using simple backward functions. E.g. the sin(x) operator shown below
To support higher order gradient, we will replace the “_backward_sin” operator with an explicit cos(x) operator. Because cos(x) operator will have its own FGradient function, by calling the FGradient function recursively we can achieve many order of derivative of the sin(x) operator. A preliminary implementation as proof-of-concept is here.
Case II: Operators with complex FGradient function: in some other operators the derivative is not so straightforward or can be easily represented analytically. E.g. the softmax operator. The FGradient function is registered using a class:
For this type of operators, we have two options.
- Replace the hidden “_backward_xxx” with a simple function with other forward operators just like we did with sin(x) operator. This needs efficient representation of the analytical derivative function for each operator and requires a lot of expertise on a per operator base.
- Register a FGradient function inside the “_backward_xxx” operator. This can be simpler in terms of implementation and less destructive to current code design pattern. However, it limits the gradient to second order and cannot automatically extend to higher order of gradients.
Case III: Operators implemented using third party library such as cuDNN and MKLDNN: many neural network specific operators are using implementations directly from cuDNN and/or MKLDNN libraries if available, such as batchnorm, convolution, pooling etc. An example of convolution operator using MKLDNN library is shown below:
In such cases, we will have to depend on the third party library to provide higher order gradient support for these operators if users install MXNet library with other third party packages. For the first phase of this project, since we are dealing with multilayer perceptron network, we do not have to implement higher order derivative for these operators. We will survey the related third party packages and incorporate their higher order gradient implementation in MXNet.
Addition of New APIs
No new APIs needed
This should not affect the current first order gradient calculation in autograd
There should be no performance impact. We will track the performance through nightly performance test.
Add unit tests for each operator that have the second order gradient support
1) Use finite-difference approximation to calculate second order gradients. The pro is its generic for almost all operators and developers do not need to implement second order gradient calculation one by one. The con is picking the finite-difference value becomes another hyperparameter to tune and requires expertise from users.
2) Implement a code generation mechanism which will make it easier to create the differentiable backward operators. Currently Pytorch is using a code generation mechanism in which backward pass is implemented with Variables and defined in a yaml file, thus fully differentiable. We need to think big and provide a scalable implementation that allows for easy extensibility and external contributions. Moreover, if implementation of a differentiable backward pass is easy to code and review, it will multiply developer productivity and maintainability.
Coding NNVM graphs in C++ is extremely tedious, and we are not sure if we will run into limitations for some more complex operators like softmax or others where the gradient is more complex or has control flow.
From this regard this approach doesn't scale well as it requires modifications to every operator we need to support. We want to study how this is done in Pytorch to get ideas if it would be possible to build a more generic approach, and if this would be possible with the current MXNet architecture.
1) Support a few simple operators in Case I as proof of concept
2) Support a subset of operators that are needed by the adaptive learning rate optimization algorithm in a multi-level perceptron network 
3) Support more complex operators in Case II
4) Working with library vendors to support higher order gradient for operators using third party libraries such as MKLDNN, cuDNN etc.