...
A full tutorial is provided here but we'll summarize for a simple use case below.
Installation
Installing MXNet with TensorRT integration is an easy process. First ensure that you are running Ubuntu 16.04, that you have updated your video drivers, and you have installed CUDA 9.0 or 9.2. You’ll need a Pascal or newer generation NVIDIA gpu. You’ll also have to download and install TensorRT libraries instructions here. Once your these prerequisites installed and up-to-date you can install a special build of MXNet with TensorRT support enabled via PyPi and pip. Install the appropriate version by running:
...
Code Block |
---|
nvidia-docker run -ti mxnet/tensorrt bash |
Model Initialization
Code Block |
---|
import mxnet as mx from mxnet.gluon.model_zoo import vision import time import os batch_shape = (1, 3, 224, 224) resnet18 = vision.resnet18_v2(pretrained=True) resnet18.hybridize() resnet18.forward(mx.nd.zeros(batch_shape)) resnet18.export('resnet18_v2') sym, arg_params, aux_params = mx.model.load_checkpoint('resnet18_v2', 0) |
Baseline MXNet Network Performance
Code Block |
---|
# Create sample input input = mx.nd.zeros(batch_shape) # Execute with MXNet os.environ['MXNET_USE_TENSORRT'] = '0' executor = sym.simple_bind(ctx=mx.gpu(0), data=batch_shape, grad_req='null', force_rebind=True) executor.copy_params_from(arg_params, aux_params) # Warmup print('Warming up MXNet') for i in range(0, 10): y_gen = executor.forward(is_train=False, data=input) y_gen[0].wait_to_read() # Timing print('Starting MXNet timed run') start = time.process_time() for i in range(0, 10000): y_gen = executor.forward(is_train=False, data=input) y_gen[0].wait_to_read() end = time.time() print(time.process_time() - start) |
TensorRT Integrated Network Performance
Code Block |
---|
# Execute with TensorRT print('Building TensorRT engine') os.environ['MXNET_USE_TENSORRT'] = '1' arg_params.update(aux_params) all_params = dict([(k, v.as_in_context(mx.gpu(0))) for k, v in arg_params.items()]) executor = mx.contrib.tensorrt.tensorrt_bind(sym, ctx=mx.gpu(0), all_params=all_params, data=batch_shape, grad_req='null', force_rebind=True) # Warmup print('Warming up TensorRT') for i in range(0, 10): y_gen = executor.forward(is_train=False, data=input) y_gen[0].wait_to_read() # Timing print('Starting TensorRT timed run') start = time.process_time() for i in range(0, 10000): y_gen = executor.forward(is_train=False, data=input) y_gen[0].wait_to_read() end = time.time() print(time.process_time() - start) |
Benchmarking
Roadmap
The output should be the same both when using an MXNet executor and when using a TensorRT executor. The performance speedup should be roughly 1.8x depending on the hardware and libraries used.
Roadmap
Finished Items
Initial Integration
Future work
FP16 Integration
The current integration of TensorRT into MXNet supports only FP32 float values for tensors. Allowing FP16 values would enable many further optimizations on Jetson and Volta devices.
https://jira.apache.org/jira/browse/MXNET-1084
Subgraph Integration
The new subgraph API is a natural fit for TensorRT. To help make the codebase consistent we'd like to port the current TensorRT integration to use the new API. The experimental integration into MXNet requires us to use contrib API calls. Once integration has moved to use the subgraph API users will be able to use TensorRT with a consistent API. Porting should also enable acceleration of gluon and module base models.
https://jira.apache.org/jira/browse/MXNET-1085
Increased Operator (/Layer) Coverage
The current operator coverage is fairly limited. We'd like to enable all models that TensorRT is able to work with.
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key MXNET-1086
Currently supported operators:
Operator Name | Operator Description | Status |
---|---|---|
Convolution | Complete | |
BatchNorm | Complete | |
elemwise_add | Complete | |
elemwise_sub | Complete | |
elemwise_mul | Complete | |
rsqrt | Complete | |
Pad | Complete | |
mean | Complete | |
FullyConnected | Complete | |
Flatten | Complete | |
SoftmaxOutput | Complete | |
Activation | relu, tanh, sigmoid | Complete |
Operators to be added:
Operator Name | Operator Description | Status |
---|---|---|
Deconvolution Op | Required for several Computer Vision models. | In Progress |
elemwise_div | Required for some Wavenet implementations. | In Progress |
Benchmarks
TensorRT is still an experimental feature, so benchmarks are likely to improve over time. As of Oct 11, 2018 we've measured the following improvements which have all been run with FP32 weighted networks.
...
https://mxnet.incubator.apache.org/tutorials/tensorrt/inference_with_trt.html
Runtime Integration with TensorRT
Content by Label | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...