Introduction

This page details benchmark results comparing MXNet 1.3.0 with MKLDNN vs without MKLDNN (integration proposal). The results clearly shows that MKL-DNN boosts inference throughput between 6x to 37x, latency reduced between 2x to 41x, while accuracy is equivalent up to an epsilon of  1e-8.

Inference Performance

This group of the performance test is gathered on AWS EC2 instance C5.18xLarge with 1 socket and 1 processor.

For the throughput, 2 sockets can provide about 2X speedup while latency will keep the constant.

Performance on Intel CPU with Intel MKL-DNN backend in release 1.3

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.

$ export KMP_AFFINITY=granularity=fine,compact,1,0

$ export OMP_NUM_THREADS=18

$ numactl --physcpubind=0-17 --membind=0 python …


CategoryModelLatency batchsize=1 (ms, small is better)Throughput batchsize=128 (fps, big is better)
w/o MKL-DNNw/ MKL-DNNspeedupw/o MKL-DNNw/ MKL-DNNspeedup
CNN/classificationResNet-50 v197.1913.047.4510.29163.5215.90
ResNet-50 v298.6913.027.589.94154.1715.51
Inception v3175.1716.7710.445.74135.3323.57
Inception v4330.9331.4010.543.0469.6022.87
DenseNet111.6618.905.918.52149.8817.60
MobileNet38.564.428.7324.87512.2520.60
VGG16406.5020.0720.252.9170.8424.31
AlexNet64.603.8017.0026.58965.2036.32
inception-resnet v2181.1049.403.675.4882.9715.14
CNN/object detectionFaster R-CNN1175.74118.629.910.858.5710.08
SSD-VGG16721.0347.6215.141.43(batchsize=224)28.90(batchsize=224)19.13
SSD-MobileNet239.4028.338.454.07(batchsize=256)69.97(batchsize=256)14.18
RNNGNMT683.4394.007.271.46(batchsize=64)10.63(batchsize=64)6.83
GANDCGAN8.940.2437.85109.134249.3638.94

Performance AMD CPU with Intel MKL-DNN backend in release 1.3

The m5a.24xlarge offers 96 vCPUs using the AMD EPYC processors (AVX2)


CategoryModelThroughput batchsize=32 (fps, bigger is better)
w/o MKL-DNNw/ MKL-DNNspeedup
CNN/classificationResNet-50 v12.4438.57x15.8
MobileNet5.03194.7x38.7

Inference Accuracy

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.

The model is from gluon model zoo by pre-trained parameters. The top1 and top5 accuracy are verified by MKL-DNN backend. 

As below table shown, the accuracy from MXNet 1.3 without and with MKL-DNN got the exact same results with 10e-8.

Note: The dataset used ImageNet1k valdata/ are generated by imagenet1k-val.sh

Inference Accuracy Comparison
AliasNetworkCPU (without MKL-DNN)CPU (with MKL-DNN) BackendDelta
 top1 top5 top1 top5top1top5
alexnetAlexNet0.563125000.789921880.563125000.789921880.000000000.00000000
densenet121DenseNet-1210.742031250.919296880.742031250.919296880.000000000.00000000
densenet161DenseNet-1610.771953130.933906250.771953130.933906250.000000000.00000000
densenet169DenseNet-1690.757109380.928281250.757109380.928281250.000000000.00000000
densenet201DenseNet-2010.769062500.930937500.769062500.930937500.000000000.00000000
inceptionv3Inception V3 299x2990.776093750.936640630.776093750.936640630.000000000.00000000
mobilenet0.25MobileNet 0.250.510390630.756875000.510390630.756875000.000000000.00000000
mobilenet0.5MobileNet 0.50.618515630.837890630.618515630.837890630.000000000.00000000
mobilenet0.75MobileNet 0.750.665468750.870703130.665468750.870703130.000000000.00000000
mobilenet1.0MobileNet 1.00.700937500.891093750.700937500.891093750.000000000.00000000
mobilenetv2_1.0MobileNetV2 1.00.699765630.892812500.699765630.892812500.000000000.00000000
mobilenetv2_0.75MobileNetV2 0.750.682109380.880078130.682109380.880078130.000000000.00000000
mobilenetv2_0.5MobileNetV2 0.50.644531250.849296880.644531250.849296880.000000000.00000000
mobilenetv2_0.25MobileNetV2 0.250.508906250.745468750.508906250.745468750.000000000.00000000
resnet18_v1ResNet-18 V10.708125000.894531250.708125000.894531250.000000000.00000000
resnet34_v1ResNet-34 V10.739609380.916093750.739609380.916093750.000000000.00000000
resnet50_v1ResNet-50 V10.760625000.930468750.760625000.930468750.000000000.00000000
resnet101_v1ResNet-101 V10.779375000.936171880.779375000.936171880.000000000.00000000
resnet152_v1ResNet-152 V10.783203130.938671880.783203130.938671880.000000000.00000000
resnet18_v2ResNet-18 V20.710468750.896718750.710468750.896718750.000000000.00000000
resnet34_v2ResNet-34 V20.740859380.915781250.740859380.915781250.000000000.00000000
resnet50_v2ResNet-50 V20.767500000.931875000.767500000.931875000.000000000.00000000
resnet101_v2ResNet-101 V20.781250000.940156250.781250000.940156250.000000000.00000000
resnet152_v2ResNet-152 V20.785546880.941406250.785546880.941406250.000000000.00000000
squeezenet1.0SqueezeNet 1.00.572734380.795546880.572734380.795546880.000000000.00000000
squeezenet1.1SqueezeNet 1.10.570234380.796015630.570234380.796015630.000000000.00000000
vgg11VGG-110.670625000.875312500.670625000.875312500.000000000.00000000
vgg13VGG-130.681328130.879843750.681328130.879843750.000000000.00000000
vgg16VGG-160.720625000.905859380.720625000.905859380.000000000.00000000
vgg19VGG-190.734687500.910000000.734687500.910000000.000000000.00000000
vgg11_bnVGG-11 with batch normalization0.689531250.888828130.689531250.888828130.000000000.00000000
vgg13_bnVGG-13 with batch normalization0.698359380.889531250.698359380.889531250.000000000.00000000
vgg16_bnVGG-16 with batch normalization0.722265630.903906250.722265630.903906250.000000000.00000000
vgg19_bnVGG-19 with batch normalization0.729921880.909921880.729921880.909921880.000000000.00000000


CMD for Reproducing Result

Please access the script and model from the link below.

https://drive.google.com/open?id=17JenLnZKsmPoZIIyktINFfMjZtDY2Ehc 

(Note: select the parent folder and click download in the drop-down menu)

You can refer to launch_benchmark_aws.sh for reproducing.

  • No labels

3 Comments

  1. Patric Zhao could you add information about the tooling you were using?

    1. Anton Chernov sure, I have added a section for reproducing the results (smile)