This document provides a detailed description of the MXNet-TensorRT runtime integration feature. This document covers advanced techniques, contains a roadmap reflecting the current state of the feature and future directions, and also contains up-to-date benchmarks. If you'd like a quick overview of the feature with a tutorial describing a simple use-case please refer to this MXNet hosted tutorial. For more information you may also visit the original design proposal page.
Why is TensorRT integration useful?
TensorRT can greatly speed up inference of deep learning models. One experiment on a Titan V (V100) GPU shows that with MXNet 1.2, we can get an approximately 3x speed-up when running inference of the ResNet-50 model on the CIFAR-10 dataset in single precision (fp32). As batch sizes and image sizes go up (for CNN inference), the benefit may be less, but in general, TensorRT helps especially in cases which have:
- Many bandwidth-bound or latency-bound layers (e.g. pointwise operations) that benefit from GPU kernel fusion.
- Inference use cases which have tight latency requirements and where the client application can't wait for large batches to be queued up.
- Embedded systems, where memory constraints are tighter than on servers.
- When performing inference in reduced precision, especially for integer (e.g. int8) inference.
In the past, the main hindrance for the user wishing to benefit from TensorRT was the fact that the model needed to be exported from the framework first. Once the model got exported through some means (NNVM to TensorRT graph rewrite, via ONNX, etc.), one had to then write a TensorRT client application, which would feed the data into the TensorRT engine. Since at that point the model was independent of the original framework, and since TensorRT could only compute the neural network layers but the user had to bring their own data pipeline, this increased the burden on the user and reduced the likelihood of reproducibility (e.g. different frameworks may have slightly different data pipelines, or flexibility of data pipeline operation ordering). Moreover, since frameworks typically support more operators than TensorRT, one could have to resort to TensorRT plugins for operations that aren't already available via the TensorRT graph API.
The current experimental runtime integration of TensorRT with MXNet resolves the above concerns by ensuring that:
- The graph is still executed by MXNet.
- The MXNet data pipeline is preserved.
- The TensorRT runtime integration logic partitions the graph into subgraphs that are either TensorRT compatible or incompatible.
- The graph partitioner collects the TensorRT-compatible subgraphs, hands them over to TensorRT, and substitutes the TensorRT compatible subgraph with a TensorRT library call, represented as a TensorRT node in NNVM.
- If a node is not TensorRT compatible, it won't be extracted and substituted with a TensorRT call, and will still execute within MXNet.
The above points ensure that we find a compromise between the flexibility of MXNet, and fast inference in TensorRT. We do this with no additional burden to the user. Users do not need to learn how TensorRT APIs work, and do not need to write their own client application or data pipeline.
How do I use TensorRT integration?
A full tutorial is provided here but we'll summarize for a simple use case below.
Installing MXNet with TensorRT integration is an easy process. First ensure that you are running Ubuntu 16.04, that you have updated your video drivers, and you have installed CUDA 9.0 or 9.2. You’ll need a Pascal or newer generation NVIDIA gpu. You’ll also have to download and install TensorRT libraries instructions here. Once your these prerequisites installed and up-to-date you can install a special build of MXNet with TensorRT support enabled via PyPi and pip. Install the appropriate version by running:
To install with CUDA 9.0:
To install with CUDA 9.2:
If you are running an operating system other than Ubuntu 16.04, or just prefer to use a docker image with all prerequisites installed you can instead run:
Baseline MXNet Network Performance
TensorRT Integrated Network Performance
The output should be the same both when using an MXNet executor and when using a TensorRT executor. The performance speedup should be roughly 1.8x depending on the hardware and libraries used.
Initial integration has been completed and launched as of MXNet 1.3. We've tested this integration against a variety of models, including all the gluonCV models, Wavenet and some custom computer vision models. Performance is roughly in line with expectations, but we're seeing a few regressions over earlier measurements that require investigation.
Continuous Integration support is enabled and running continually for all active PRs opened with MXNet.
PIP packages and Docker images have been published along with the MXNet 1.3 release.
The current integration of TensorRT into MXNet supports only FP32 float values for tensors. Allowing FP16 values would enable many further optimizations on Jetson and Volta devices.
The new subgraph API is a natural fit for TensorRT. To help make the codebase consistent we'd like to port the current TensorRT integration to use the new API. The experimental integration into MXNet requires us to use contrib API calls. Once integration has moved to use the subgraph API users will be able to use TensorRT with a consistent API. Porting should also enable acceleration of gluon and module base models.
Conditional Checkout and Compilation of Dependencies
TensorRT integration required us to add a number of third party code sub-repositories to the project. This is not ideal for users who would like to checkout and build MXNet without using the TensorRT feature. In the future we should migrate the feature to be CMake only, and checkout the project at pre-compilation time to avoid forcing all users to checkout these subrepos. We can also model these dependencies using CMake such that they're automatically built and linked against when required, which would make building from scratch easier for those that do want to use TensorRT integration.
Make use of Cached TRT Engines
Similar to the cudnn auto-tuning feature we've received requests from users that we cache TensorRT engine compilations so that we avoid the delay of building the engine each time we start the process.
Increased Operator (/Layer) Coverage
The current operator coverage is fairly limited. We'd like to enable all models that TensorRT is able to work with.
Decouple NNVM to ONNX from NNVM to TensorRT in MXNet
The current nnvm_to_onnx classes are tightly coupled to TensorRT. We could extract all of the TensorRT specific functionality and have a proper separation between nnvm_to_onnx and onnx_to_tensorrt. When structuring nnvm_to_onnx we should make use of object hierarchy to convert to specific opsets of onnx to help us maintain compatibility with different toolsets. We should create a base class that performs generic onnx conversions. We should then specialized objects that inherit from the base onnx class and take care of the differences between opsets. We should also create unit tests on a per-op basis to make sure we're introducing regressions.
Currently supported operators:
|Operator Name||Operator Description||Status|
|Activation||relu, tanh, sigmoid||Complete|
Operators to be added:
|Operator Name||Operator Description||Status|
|Deconvolution Op||Required for several Computer Vision models.||In Progress|
|elemwise_div||Required for some Wavenet implementations.||In Progress|
TensorRT is still an experimental feature, so benchmarks are likely to improve over time. As of Oct 11, 2018 we've measured the following improvements which have all been run with FP32 weighted networks.
|Model Name||Relative TensorRT Speedup||Hardware|
|Resnet 18||1.8x||Titan V|
|Resnet 18||1.54x||Jetson TX1|
|Resnet 50||1.76x||Titan V|
|Resnet 101||1.99x||Titan V|