Step-by-step guide
Info |
---|
MXNet-TensorRT Runtime Integration(authored by Kellen Sunderland) What is this?This document described how to use the MXNet-TensorRT runtime integration to accelerate model inference. Why is TensorRT integration useful?TensorRT can greatly speed up inference of deep learning models. One experiment on a Titan V (V100) GPU shows that with MXNet 1.2, we can get an approximately 3x speed-up when running inference of the ResNet-50 model on the CIFAR-10 dataset in single precision (fp32). As batch sizes and image sizes go up (for CNN inference), the benefit may be less, but in general, TensorRT helps especially in cases which have:
In the past, the main hindrance for the user wishing to benefit from TensorRT was the fact that the model needed to be exported from the framework first. Once the model got exported through some means (NNVM to TensorRT graph rewrite, via ONNX, etc.), one had to then write a TensorRT client application, which would feed the data into the TensorRT engine. Since at that point the model was independent of the original framework, and since TensorRT could only compute the neural network layers but the user had to bring their own data pipeline, this increased the burden on the user and reduced the likelihood of reproducibility (e.g. different frameworks may have slightly different data pipelines, or flexibility of data pipeline operation ordering). Moreover, since frameworks typically support more operators than TensorRT, one could have to resort to TensorRT plugins for operations that aren't already available via the TensorRT graph API. The current experimental runtime integration of TensorRT with MXNet resolves the above concerns by ensuring that:
The above points ensure that we find a compromise between the flexibility of MXNet, and fast inference in TensorRT. We do this with no additional burden to the user. Users do not need to learn how TensorRT APIs work, and do not need to write their own client application or data pipeline. How do I build MXNet with TensorRT integration?Building MXNet together with TensorRT is somewhat complex. The recipe will hopefully be simplified in the near future, but for now, it's easiest to build a Docker container with a Ubuntu 16.04 base. This Dockerfile can be found under the ci subdirectory of the MXNet repository. You can build the container as follows:
Next, we can run this container as follows (don't forget to install nvidia-docker):
After starting the container, you will find yourself in the /opt/mxnet directory by default. Running a "hello, world" model / unit test (LeNet-5 on MNIST)You can then run the LeNet-5 unit test, which will train LeNet-5 on MNIST using the symbolic API. The test will then run inference in MXNet both with, and without MXNet-TensorRT runtime integration. Finally, the test will display a comparison of both runtime's accuracy scores. The test can be run as follows:
You should get a result similar to the following:
Running more complex modelsThe unit test directory also provides a way to run models from the Gluon model zoo after slight modifications. The models that are tested are CNN classification models from the Gluon zoo. They are mostly based on ResNet, but include ResNeXtas well:
Please note that even those examples are based on CIFAR-10 due to the ease of accessing the dataset without formal registration and preprocessing, everything should work fine with models trained on ImageNet, using MXNet's ImageNet iterators, based on the RecordIO representation of the ImageNet dataset. The script can be run simply as
Here's some sample output, for inference with batch size 16 (TensorRT is especially useful for small batches for low-latency production inference):
As you can see, the speed-up varies by model. ResNet-110 has more layers that can be fused than ResNet-56, hence the speed-up is greater. Running TensorRT with your own models with the symbolic APIWhen building your own models, feel free to use the above ResNet-50 model as an example. Here, we highlight a small number of issues that need to be taken into account.
executor = sym.simple_bind(ctx=ctx, data = data_shape, softmax_label=sm_shape, grad_req='null', shared_buffer=all_params, force_rebind=True)
def merge_dicts(*dict_args): result = {} for dictionary in dict_args: result.update(dictionary) return result sym, arg_params, aux_params = mx.model.load_checkpoint(model_prefix, epoch) all_params = merge_dicts(arg_params, aux_params) This for idx, dbatch in enumerate(test_iter): data = dbatch.data[0] executor.arg_dict["data"][:] = data executor.forward(is_train=False) preds = executor.outputs[0].asnumpy() top1 = np.argmax(preds, axis=1)
def set_use_tensorrt(status = False): os.environ["MXNET_USE_TENSORRT"] = str(int(status)) Now, assuming that the logic to bind a symbol and run inference in batches of print("Running inference in MXNet") set_use_tensorrt(False) mx_pct = run_inference(sym, arg_params, aux_params, mnist, all_test_labels, batch_size=batch_size) print("Running inference in MXNet-TensorRT") set_use_tensorrt(True) trt_pct = run_inference(sym, arg_params, aux_params, mnist, all_test_labels, batch_size=batch_size) Simply switching the flag allows us to go back and forth between MXNet and MXNet-TensorRT inference. See the details in the unit test at Running TensorRT with your own models with the Gluon APINote: Please first read the previous section titled "Running TensorRT with your own models with the symbolic API" - it contains information that will also be useful for Gluonusers. Note: If the user wishes to use the Gluon vision models, it's necessary to install the
The above package is based on a separate repository. For Gluon models specifically, we need to add a data symbol to the model to load the data, as well as apply the softmax layer, because the Gluon models only present the logits that are to be presented for softmax. This is shown in net = gluoncv.model_zoo.get_model(model_name, pretrained=True) data = mx.sym.var('data') out = net(data) softmax = mx.sym.SoftmaxOutput(out, name='softmax') Since as in the symbolic API case, we need to provide the weights during the net = gluoncv.model_zoo.get_model(model_name, pretrained=True) all_params = dict([(k, v.data()) for k, v in net.collect_params().items()]) executor = softmax.simple_bind(ctx=ctx, data=(batch_size, 3, 32, 32), softmax_label=(batch_size,), grad_req='null', shared_buffer=all_params, force_rebind=True) Note that for Gluon-trained models, we should use Gluon's data pipeline to replicate the behavior of the pipeline that was used for training (e.g. using the same data scaling). Here's how to get the Gluon data iterator for the CIFAR-10 examples: gluon.data.DataLoader( gluon.data.vision.CIFAR10(train=False).transform_first(transform_test), batch_size=batch_size, shuffle=False, num_workers=num_workers) For more details, see the unit test examples at ExamplesThe sections above describe how to launch unit tests on pre-trained models as examples. For cross-reference, the launch shell scripts have also been added here. |
Related articles
Content by Label | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
Page properties | ||
---|---|---|
| ||
MXNet-TensorRT Runtime Integration(author: Kellen Sunderland) What is this?This document described how to use the MXNet-TensorRT runtime integration to accelerate model inference. Why is TensorRT integration useful?TensorRT can greatly speed up inference of deep learning models. One experiment on a Titan V (V100) GPU shows that with MXNet 1.2, we can get an approximately 3x speed-up when running inference of the ResNet-50 model on the CIFAR-10 dataset in single precision (fp32). As batch sizes and image sizes go up (for CNN inference), the benefit may be less, but in general, TensorRT helps especially in cases which have:
In the past, the main hindrance for the user wishing to benefit from TensorRT was the fact that the model needed to be exported from the framework first. Once the model got exported through some means (NNVM to TensorRT graph rewrite, via ONNX, etc.), one had to then write a TensorRT client application, which would feed the data into the TensorRT engine. Since at that point the model was independent of the original framework, and since TensorRT could only compute the neural network layers but the user had to bring their own data pipeline, this increased the burden on the user and reduced the likelihood of reproducibility (e.g. different frameworks may have slightly different data pipelines, or flexibility of data pipeline operation ordering). Moreover, since frameworks typically support more operators than TensorRT, one could have to resort to TensorRT plugins for operations that aren't already available via the TensorRT graph API. The current experimental runtime integration of TensorRT with MXNet resolves the above concerns by ensuring that:
The above points ensure that we find a compromise between the flexibility of MXNet, and fast inference in TensorRT. We do this with no additional burden to the user. Users do not need to learn how TensorRT APIs work, and do not need to write their own client application or data pipeline. How do I build MXNet with TensorRT integration?Building MXNet together with TensorRT is somewhat complex. The recipe will hopefully be simplified in the near future, but for now, it's easiest to build a Docker container with a Ubuntu 16.04 base. This Dockerfile can be found under the ci subdirectory of the MXNet repository. You can build the container as follows:
Next, we can run this container as follows (don't forget to install nvidia-docker):
After starting the container, you will find yourself in the /opt/mxnet directory by default. Running a "hello, world" model / unit test (LeNet-5 on MNIST)You can then run the LeNet-5 unit test, which will train LeNet-5 on MNIST using the symbolic API. The test will then run inference in MXNet both with, and without MXNet-TensorRT runtime integration. Finally, the test will display a comparison of both runtime's accuracy scores. The test can be run as follows:
You should get a result similar to the following:
Running more complex modelsThe unit test directory also provides a way to run models from the Gluon model zoo after slight modifications. The models that are tested are CNN classification models from the Gluon zoo. They are mostly based on ResNet, but include ResNeXtas well:
Please note that even those examples are based on CIFAR-10 due to the ease of accessing the dataset without formal registration and preprocessing, everything should work fine with models trained on ImageNet, using MXNet's ImageNet iterators, based on the RecordIO representation of the ImageNet dataset. The script can be run simply as
Here's some sample output, for inference with batch size 16 (TensorRT is especially useful for small batches for low-latency production inference):
As you can see, the speed-up varies by model. ResNet-110 has more layers that can be fused than ResNet-56, hence the speed-up is greater. Running TensorRT with your own models with the symbolic APIWhen building your own models, feel free to use the above ResNet-50 model as an example. Here, we highlight a small number of issues that need to be taken into account.
executor = sym.simple_bind(ctx=ctx, data = data_shape, softmax_label=sm_shape, grad_req='null', shared_buffer=all_params, force_rebind=True)
def merge_dicts(*dict_args): result = {} for dictionary in dict_args: result.update(dictionary) return result sym, arg_params, aux_params = mx.model.load_checkpoint(model_prefix, epoch) all_params = merge_dicts(arg_params, aux_params) This for idx, dbatch in enumerate(test_iter): data = dbatch.data[0] executor.arg_dict["data"][:] = data executor.forward(is_train=False) preds = executor.outputs[0].asnumpy() top1 = np.argmax(preds, axis=1)
def set_use_tensorrt(status = False): os.environ["MXNET_USE_TENSORRT"] = str(int(status)) Now, assuming that the logic to bind a symbol and run inference in batches of print("Running inference in MXNet") set_use_tensorrt(False) mx_pct = run_inference(sym, arg_params, aux_params, mnist, all_test_labels, batch_size=batch_size) print("Running inference in MXNet-TensorRT") set_use_tensorrt(True) trt_pct = run_inference(sym, arg_params, aux_params, mnist, all_test_labels, batch_size=batch_size) Simply switching the flag allows us to go back and forth between MXNet and MXNet-TensorRT inference. See the details in the unit test at Running TensorRT with your own models with the Gluon APINote: Please first read the previous section titled "Running TensorRT with your own models with the symbolic API" - it contains information that will also be useful for Gluonusers. Note: If the user wishes to use the Gluon vision models, it's necessary to install the
The above package is based on a separate repository. For Gluon models specifically, we need to add a data symbol to the model to load the data, as well as apply the softmax layer, because the Gluon models only present the logits that are to be presented for softmax. This is shown in net = gluoncv.model_zoo.get_model(model_name, pretrained=True) data = mx.sym.var('data') out = net(data) softmax = mx.sym.SoftmaxOutput(out, name='softmax') Since as in the symbolic API case, we need to provide the weights during the net = gluoncv.model_zoo.get_model(model_name, pretrained=True) all_params = dict([(k, v.data()) for k, v in net.collect_params().items()]) executor = softmax.simple_bind(ctx=ctx, data=(batch_size, 3, 32, 32), softmax_label=(batch_size,), grad_req='null', shared_buffer=all_params, force_rebind=True) Note that for Gluon-trained models, we should use Gluon's data pipeline to replicate the behavior of the pipeline that was used for training (e.g. using the same data scaling). Here's how to get the Gluon data iterator for the CIFAR-10 examples: gluon.data.DataLoader( gluon.data.vision.CIFAR10(train=False).transform_first(transform_test), batch_size=batch_size, shuffle=False, num_workers=num_workers) For more details, see the unit test examples at ExamplesThe sections above describe how to launch unit tests on pre-trained models as examples. For cross-reference, the launch shell scripts have also been added here.
|
...