Problem Statement
Currently, RNN layers are widely used in neural networks for NLP and Seq2Seq learning domain, because of its outstanding ability on handling temporal dependencies. MXNet has already implemented feature-rich and flexible RNN layers with which end users can build their NLP and Seq2Seq models more easily. Besides that, MXNet also provides fused RNN operator for users who care more about performance or have more fixed RNN architectures. But unfortunately, in MXNet, fused RNN operator is only implemented for GPU with CuDNN interfaces. This will cause several problems for MXNet users and developers:
- Many models which are using fused RNN operators cannot run on CPU. It will severely prevent users from migrating their models from GPU to CPU or deploying their pre-trained models on CPUs.
- The function disparity between CPU and GPU makes it difficult to maintain MXNet code and models. Developers need design code and unit tests carefully to handle the function disparity.
- CPU RNN layers are implemented with FC operator and other activations. These sequential GEMMs and tanh/sigmoid computations cannot fully utilize modern CPU resources and will be the performance bottleneck for most RNN models.
The strength of fused RNN operator for CPU
Optimization
To maximum the computation efficiency of RNN layers, several optimization methods are applied to the fused RNN operator for CPU:
(1) Different GEMM modes
According to the design of RNN variants, one or several gates are used to keep and propagate the dependencies among different inputs on different time steps. These gates are implemented by applying fully connected layers to inputs, hidden states or cell memories. Take LSTM as an example, below are the definitions of 4 gates of LSTM for one time step: forget gate, input gate, update gate and output gate. From the formulas of LSTM, there are 8 general matrix multiplications (GEMM) for one time step. For an 80-word sentence, if we try to translate it into another language by a LSTM model, there will be 640 matrix multiplications for one LSTM layer. They would be the main computation for RNN models.
* Formula picture is from Wikipedia: Long short-term memory
Fortunately, most of these GEMMs are independent and can be parallelized by some optimization methods:
- Combine independent small GEMMs into one big GEMM: big GEMM always shows better performance compared with small GEMM. More computation can hide the memory access latency and leverage the concurrent threads supported by modern multi-core architectures. However, if inputs or weights for RNN layer are not contiguous in memory, combining GEMMs need allocate more memory and copy tensors in between memories. This would be a severe overhead in some cases.
- Batch GEMM: For some high-efficiency BLAS libraries, saying Intel MKL, a new GEMM feature called "Batch GEMM" is introduced. With it, users can specify multiple independent GEMM operations, which can be of different matrix sizes, different parameters and at different memory addresses, through a single call to the "Batch GEMM" API. At runtime, Intel MKL BLAS will intelligently execute all of the GEMMs so as to optimize overall performance. Since "Batch GEMM" is not well supported by all BLAS libraries and that is limiting the usage of it.
- Pack GEMM: For one gate in one RNN layer, the weight is shared for all inputs on different time steps and it will be the common factor for many GEMMs. In most GEMM implementations, input matrices will be converted into an internal packed format for better performance. Intel MKL also introduced GEMM packed interfaces that allow users to explicitly transform the common matrix into an internal packed format once and pass the packed matrix to multiple GEMM calls. With this approach, the packing costs can be amortized over multiple GEMM calls if the input matrix is reused among these calls. Same as "Batch GEMM", "Pack GEMM" is not provided by some BLAS libraries either.
(2) Vectorization
Besides GEMM computations, there still have many elementwise operations in RNN layers, like matrix elementwise add, elementwise multiplication and activation function sigmoid/tanh. These elementwise operations can be well vectorized to utilize modern instructions of CPU, like AVX2 and AVX512. Intel MKL also provides a set of highly optimized functions for arithmetic, power, trigonometric, exponential, hyperbolic, special, and rounding, and compute them on each of input vector elements.
(3) Intermediate result reuse
For RNN training, many intermediate results of forwarding computation can be reused in its backward computation. We can keep workspace or reserved memory and share it between forward and backward path to improve the performance of backward computation. There are 2 ways to keep workspace for computation. First, we can save these intermediate results into an operator state and make the operator stateful. Second, we can output these intermediate results as an additional output and send it into the backward function.
Usability
Fused RNN operator for CPU shares the same interfaces and parameter definitions with the existing `sym.RNN` operator for GPU. End users can migrate their models from GPU to CPU by only changing the running context. Need to note that, some features in CuDNN interfaces, like dropout and linear input, are not supported in CPU implementation currently. These cases will fall back to FC-layer-based RNN cells or a proper error message will be thrown out.
Scalability
There are variant RNN layers to solve different problems in NLP and Seq2Seq learning field. Vanilla RNN, GRU and LSTM are the most popular three variants in them. For a specific RNN cell, it also can be extended to the bi-directional and multi-layer scenarios. We have implemented unified user interfaces and architecture for fused RNN operator, which can be easily extended for other RNN variants. GRU and LSTM are already supported under this design for both unidirectional and bi-directional computation. Multi-layer GRU and LSTM are also provided for users who want to build deep RNN models.
Operator Design
Operator Execution Flow
Below graph demonstrates the design and execution flow of the fused RNN operator in MXNet. Green blocks have already been implemented for NVidia GPU with CuDNN interfaces. Yellow blocks are recently integrated in by PR#9977 just for LSTM inference path. Blue blocks are going to be added for extending PR#9977 to training and other RNN variants. Currently, PR#10104 is submitted for fused LSTM implementation and PR#10311 for fused GRU implementation. Vanilla RNN is planned and will be provided in the future.
Operator Registration
Currently, `sym.RNN` is registered into MXNet with legacy DMLC interfaces. We are trying to refactor this part of code with NNVM interfaces. Regarding to the design of NNVM registration, the operator creating, caching and workspace sharing should be redesigned for passing these information between forward and backward path and among iterations.
(1) Operator caching
A static thread-local hash map is defined and cached in each operator computing thread. The key of the hash map is generated according to each RNNOp parameters and inputs shape. The value of the hash map is a shared pointer of RNNOp instance. With this mechanism, we can remove the overhead of creating operator instances and share them among all iterations.
(2) Workspace sharing
As described above, reusing forward intermediate results during backward computation will reduce the computation amount and improve backward performance a lot. A reserved workspace buffer is defined as a private member of RNNOp class. This buffer will store the intermediate results during forward computation and be reused during backward computation. If the operator instance is cached and reused in other iterations, this workspace buffer will also be reused. Workspace buffer will be released when the operator instance is destructed.
Performance
Single layer and single direction test as below:
T, N, I, H = 300, 20, 800, 800 layer=1 bidirection=False samples/sec | SKX8180 | Speedup | |
Non-FusedRNN | FusedRNN | FusedRNN/Non-FusedRNN | |
LSTM-Inference | 187.09 | 394.73 | 210.98% |
LSTM-Training(fwd+bwd) | 73.23 | 153.53 | 209.65% |
GRU-Inference | 128.21 | 392.16 | 305.87% |
GRU-Training(fwd+bwd) | 80.32 | 171.91 | 214.03% |
vRNN(Relu)-Inference | 518.13 | 1538.46 | 296.92% |
vRNN(Relu)-Training(fwd+bwd) | 202.02 | 357.14 | 176.79% |
vRNN(Tanh)-Inference | 492.61 | 952.38 | 193.33% |
vRNN(Tanh)-Training(fwd+bwd) | 198.02 | 318.98 | 161.08% |
5-layers (vRNN/LSTM/GRU) with bi-direction test as below:
T, N, I, H = 300, 20, 800, 800 layer=1 bidirection=False samples/sec | SKX8180 | Speedup | |
Non-FusedRNN | FusedRNN | FusedRNN/Non-FusedRNN | |
LSTM-Inference | 37.24 | 107.13 | 287.67% |
LSTM-Training(fwd+bwd) | 12.93 | 32.29 | 249.73% |
GRU-Inference | 26.67 | 88.9 | 333.33% |
GRU-Training(fwd+bwd) | 15.04 | 39.2 | 260.64% |
vRNN(Relu)-Inference | 40.73 | 134.23 | 329.53% |
vRNN(Relu)-Training(fwd+bwd) | 22.60 | 35.97 | 159.17% |
vRNN(Tanh)-Inference | 38.91 | 104.17 | 267.71% |
vRNN(Tanh)-Training(fwd+bwd) | 22.73 | 34.01 | 149.66% |
Upstream
MKL-DNN Integration
Intel MKL-DNN is an open source performance library for deep learning applications. The library accelerates deep learning applications and frameworks on Intel architecture. Recently, MKL-DNN has added RNN primitives to its master branch on GitHub. The RNN primitives are still experimental and don't have good enough performance. MKL-DNN team is collecting user experience suggestions and continue improving the performance of these primitives. Currently, vanilla RNN, LSTM and GRU, as well as their bidirectional and multi-layer computation, are supported by MKL-DNN.
We will integrate MKL-DNN RNN primitives into MXNet once it becomes mature and the application programming interfaces are settled. After MKL-DNN RNN primitives are integrated into MXNet, those fused RNN operators for CPU described in above sections will still exist in MXNet as a reference implementation for CPU. If MKL-DNN is not enabled by users during MXNet compilation, RNN layers from user's model can still run into our fused RNN operator for good performance.
We will keep working on the functionality parity and consistency among our fused RNN operator, MKL-DNN primitives and CuDNN implementation.