Page History

This document provides a detailed description of the MXNet-TensorRT runtime integration feature. This document covers advanced techniques, contains a roadmap reflecting the current state of the feature and future directions, and also contains up-to-date benchmarks. If you'd like a quick overview of the feature with a tutorial describing a simple use-case please refer to this MXNet hosted tutorial. For more information you may also visit the original design proposal page.

Table of Contents

Why is TensorRT integration useful?

...

A full tutorial is provided here but we'll summarize for a simple use case below.

Installation

Installing MXNet with TensorRT integration is an easy process. First ensure that you are running Ubuntu 16.04, that you have updated your video drivers, and you have installed CUDA 9.0 or 9.2. You’ll need a Pascal or newer generation NVIDIA gpu. You’ll also have to download and install TensorRT libraries instructions here. Once your these prerequisites installed and up-to-date you can install a special build of MXNet with TensorRT support enabled via PyPi and pip. Install the appropriate version by running:

...

Code Block
nvidia-docker run -ti mxnet/tensorrt bash

Model Initialization

Code Block

import mxnet as mx
from mxnet.gluon.model_zoo import vision
import time
import os

batch_shape = (1, 3, 224, 224)
resnet18 = vision.resnet18_v2(pretrained=True)
resnet18.hybridize()
resnet18.forward(mx.nd.zeros(batch_shape))
resnet18.export('resnet18_v2')
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet18_v2', 0)

Baseline MXNet Network Performance

Code Block

# Create sample input
input = mx.nd.zeros(batch_shape)

# Execute with MXNet
os.environ['MXNET_USE_TENSORRT'] = '0'
executor = sym.simple_bind(ctx=mx.gpu(0), data=batch_shape, grad_req='null', force_rebind=True)
executor.copy_params_from(arg_params, aux_params)

# Warmup
print('Warming up MXNet')
for i in range(0, 10):
	y_gen = executor.forward(is_train=False, data=input)
	y_gen[0].wait_to_read()

# Timing
print('Starting MXNet timed run')
start = time.process_time()
for i in range(0, 10000):
	y_gen = executor.forward(is_train=False, data=input)
	y_gen[0].wait_to_read()
end = time.time()
print(time.process_time() - start)

TensorRT Integrated Network Performance

Code Block

# Execute with TensorRT
print('Building TensorRT engine')
os.environ['MXNET_USE_TENSORRT'] = '1'
arg_params.update(aux_params)
all_params = dict([(k, v.as_in_context(mx.gpu(0))) for k, v in arg_params.items()])
executor = mx.contrib.tensorrt.tensorrt_bind(sym, ctx=mx.gpu(0), all_params=all_params,
                                             data=batch_shape, grad_req='null', force_rebind=True)
# Warmup
print('Warming up TensorRT')
for i in range(0, 10):
	y_gen = executor.forward(is_train=False, data=input)
	y_gen[0].wait_to_read()

# Timing
print('Starting TensorRT timed run')
start = time.process_time()
for i in range(0, 10000):
	y_gen = executor.forward(is_train=False, data=input)
	y_gen[0].wait_to_read()
end = time.time()
print(time.process_time() - start)

The output should be the same both when using an MXNet executor and when using a TensorRT executor. The performance speedup should be roughly 1.8x depending on the hardware and libraries used.

Roadmap

Finished Items

Initial integration has been completed and launched as of MXNet 1.3. We've tested this integration against a variety of models, including all the gluonCV models, Wavenet and some custom computer vision models. Performance is roughly in line with expectations, but we're seeing a few regressions over earlier measurements that require investigation.

Continuous Integration support is enabled and running continually for all active PRs opened with MXNet.

PIP packages and Docker images have been published along with the MXNet 1.3 release.

Future work

FP16 Integration

The current integration of TensorRT into MXNet supports only FP32 float values for tensors. Allowing FP16 values would enable many further optimizations on Jetson and Volta devices.

https://jira.apache.org/jira/browse/MXNET-1084

Subgraph Integration

The new subgraph API is a natural fit for TensorRT. To help make the codebase consistent we'd like to port the current TensorRT integration to use the new API. The experimental integration into MXNet requires us to use contrib API calls. Once integration has moved to use the subgraph API users will be able to use TensorRT with a consistent API. Porting should also enable acceleration of gluon and module base models.

https://jira.apache.org/jira/browse/MXNET-1085

Conditional Checkout and Compilation of Dependencies

TensorRT integration required us to add a number of third party code sub-repositories to the project. This is not ideal for users who would like to checkout and build MXNet without using the TensorRT feature. In the future we should migrate the feature to be CMake only, and checkout the project at pre-compilation time to avoid forcing all users to checkout these subrepos. We can also model these dependencies using CMake such that they're automatically built and linked against when required, which would make building from scratch easier for those that do want to use TensorRT integration.

Make use of Cached TRT Engines

Similar to the cudnn auto-tuning feature we've received requests from users that we cache TensorRT engine compilations so that we avoid the delay of building the engine each time we start the process.

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	MXNET-1152

Increased Operator (/Layer) Coverage

The current operator coverage is fairly limited. We'd like to enable all models that TensorRT is able to work with.

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	MXNET-1086

Decouple NNVM to ONNX from NNVM to TensorRT in MXNet

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	MXNET-1252

The current nnvm_to_onnx classes are tightly coupled to TensorRT. We could extract all of the TensorRT specific functionality and have a proper separation between nnvm_to_onnx and onnx_to_tensorrt. When structuring nnvm_to_onnx we should make use of object hierarchy to convert to specific opsets of onnx to help us maintain compatibility with different toolsets. We should create a base class that performs generic onnx conversions. We should then specialized objects that inherit from the base onnx class and take care of the differences between opsets. We should also create unit tests on a per-op basis to make sure we're introducing regressions.

Currently supported operators:

Operator Name	Operator Description	Status
Convolution		Complete
BatchNorm		Complete
elemwise_add		Complete
elemwise_sub		Complete
elemwise_mul		Complete
rsqrt		Complete
Pad		Complete
mean		Complete
FullyConnected		Complete
Flatten		Complete
SoftmaxOutput		Complete
Activation	relu, tanh, sigmoid	Complete

Operators to be added:

Operator Name	Operator Description	Status
Deconvolution Op	Required for several Computer Vision models.	In Progress
elemwise_div	Required for some Wavenet implementations.	In Progress

Benchmarks

TensorRT is still an experimental feature, so benchmarks are likely to improve over time. As of Oct 11, 2018 we've measured the following improvements which have all been run with FP32 weighted networks.

Model Name	Relative TensorRT Speedup	Hardware
cifar_resnet20_v2	1.21x	Titan V
cifar_resnext29_16x64d	1.26x	Titan V
Resnet 18	1.8x	Titan V
Resnet 18	1.54x	Jetson TX1
Resnet 50	1.76x	Titan V
Resnet 101	1.99x	Titan V
Alexnet	1.4x	Titan V

Page tree

Versions Compared

Old Version 17

New Version Current

Key

Why is TensorRT integration useful?

Installation

Model Initialization

Baseline MXNet Network Performance

TensorRT Integrated Network Performance

Roadmap

Finished Items

Future work

FP16 Integration

Subgraph Integration

Conditional Checkout and Compilation of Dependencies

Make use of Cached TRT Engines

Increased Operator (/Layer) Coverage

Benchmarks

Related articles