Introduction
This page details benchmark results comparing MXNet 1.3.0 with MKLDNN vs without MKLDNN (integration proposal). The results clearly shows that MKL-DNN boosts inference throughput between 6x to 37x, latency reduced between 2x to 41x, while accuracy is equivalent up to an epsilon of 1e-8.
Inference Performance
This group of the performance test is gathered on AWS EC2 instance C5.18xLarge with 1 socket and 1 processor.
For the throughput, 2 sockets can provide about 2X speedup while latency will keep the constant.
Performance on Intel CPU with Intel MKL-DNN backend in release 1.3
The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.
$ export KMP_AFFINITY=granularity=fine,compact,1,0
$ export OMP_NUM_THREADS=18
$ numactl --physcpubind=0-17 --membind=0 python …
Category | Model | Latency batchsize=1 (ms, small is better) | Throughput batchsize=128 (fps, big is better) | ||||
---|---|---|---|---|---|---|---|
w/o MKL-DNN | w/ MKL-DNN | speedup | w/o MKL-DNN | w/ MKL-DNN | speedup | ||
CNN/classification | ResNet-50 v1 | 97.19 | 13.04 | 7.45 | 10.29 | 163.52 | 15.90 |
ResNet-50 v2 | 98.69 | 13.02 | 7.58 | 9.94 | 154.17 | 15.51 | |
Inception v3 | 175.17 | 16.77 | 10.44 | 5.74 | 135.33 | 23.57 | |
Inception v4 | 330.93 | 31.40 | 10.54 | 3.04 | 69.60 | 22.87 | |
DenseNet | 111.66 | 18.90 | 5.91 | 8.52 | 149.88 | 17.60 | |
MobileNet | 38.56 | 4.42 | 8.73 | 24.87 | 512.25 | 20.60 | |
VGG16 | 406.50 | 20.07 | 20.25 | 2.91 | 70.84 | 24.31 | |
AlexNet | 64.60 | 3.80 | 17.00 | 26.58 | 965.20 | 36.32 | |
inception-resnet v2 | 181.10 | 49.40 | 3.67 | 5.48 | 82.97 | 15.14 | |
CNN/object detection | Faster R-CNN | 1175.74 | 118.62 | 9.91 | 0.85 | 8.57 | 10.08 |
SSD-VGG16 | 721.03 | 47.62 | 15.14 | 1.43(batchsize=224) | 28.90(batchsize=224) | 19.13 | |
SSD-MobileNet | 239.40 | 28.33 | 8.45 | 4.07(batchsize=256) | 69.97(batchsize=256) | 14.18 | |
RNN | GNMT | 683.43 | 94.00 | 7.27 | 1.46(batchsize=64) | 10.63(batchsize=64) | 6.83 |
GAN | DCGAN | 8.94 | 0.24 | 37.85 | 109.13 | 4249.36 | 38.94 |
Performance AMD CPU with Intel MKL-DNN backend in release 1.3
The m5a.24xlarge offers 96 vCPUs using the AMD EPYC processors (AVX2)
Category | Model | Throughput batchsize=32 (fps, bigger is better) | ||
---|---|---|---|---|
w/o MKL-DNN | w/ MKL-DNN | speedup | ||
CNN/classification | ResNet-50 v1 | 2.44 | 38.57 | x15.8 |
MobileNet | 5.03 | 194.7 | x38.7 |
Inference Accuracy
The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.
The model is from gluon model zoo by pre-trained parameters. The top1 and top5 accuracy are verified by MKL-DNN backend.
As below table shown, the accuracy from MXNet 1.3 without and with MKL-DNN got the exact same results with 10e-8.
Note: The dataset used ImageNet1k valdata/ are generated by imagenet1k-val.sh
Inference Accuracy Comparison | |||||||
---|---|---|---|---|---|---|---|
Alias | Network | CPU (without MKL-DNN) | CPU (with MKL-DNN) Backend | Delta | |||
top1 | top5 | top1 | top5 | top1 | top5 | ||
alexnet | AlexNet | 0.56312500 | 0.78992188 | 0.56312500 | 0.78992188 | 0.00000000 | 0.00000000 |
densenet121 | DenseNet-121 | 0.74203125 | 0.91929688 | 0.74203125 | 0.91929688 | 0.00000000 | 0.00000000 |
densenet161 | DenseNet-161 | 0.77195313 | 0.93390625 | 0.77195313 | 0.93390625 | 0.00000000 | 0.00000000 |
densenet169 | DenseNet-169 | 0.75710938 | 0.92828125 | 0.75710938 | 0.92828125 | 0.00000000 | 0.00000000 |
densenet201 | DenseNet-201 | 0.76906250 | 0.93093750 | 0.76906250 | 0.93093750 | 0.00000000 | 0.00000000 |
inceptionv3 | Inception V3 299x299 | 0.77609375 | 0.93664063 | 0.77609375 | 0.93664063 | 0.00000000 | 0.00000000 |
mobilenet0.25 | MobileNet 0.25 | 0.51039063 | 0.75687500 | 0.51039063 | 0.75687500 | 0.00000000 | 0.00000000 |
mobilenet0.5 | MobileNet 0.5 | 0.61851563 | 0.83789063 | 0.61851563 | 0.83789063 | 0.00000000 | 0.00000000 |
mobilenet0.75 | MobileNet 0.75 | 0.66546875 | 0.87070313 | 0.66546875 | 0.87070313 | 0.00000000 | 0.00000000 |
mobilenet1.0 | MobileNet 1.0 | 0.70093750 | 0.89109375 | 0.70093750 | 0.89109375 | 0.00000000 | 0.00000000 |
mobilenetv2_1.0 | MobileNetV2 1.0 | 0.69976563 | 0.89281250 | 0.69976563 | 0.89281250 | 0.00000000 | 0.00000000 |
mobilenetv2_0.75 | MobileNetV2 0.75 | 0.68210938 | 0.88007813 | 0.68210938 | 0.88007813 | 0.00000000 | 0.00000000 |
mobilenetv2_0.5 | MobileNetV2 0.5 | 0.64453125 | 0.84929688 | 0.64453125 | 0.84929688 | 0.00000000 | 0.00000000 |
mobilenetv2_0.25 | MobileNetV2 0.25 | 0.50890625 | 0.74546875 | 0.50890625 | 0.74546875 | 0.00000000 | 0.00000000 |
resnet18_v1 | ResNet-18 V1 | 0.70812500 | 0.89453125 | 0.70812500 | 0.89453125 | 0.00000000 | 0.00000000 |
resnet34_v1 | ResNet-34 V1 | 0.73960938 | 0.91609375 | 0.73960938 | 0.91609375 | 0.00000000 | 0.00000000 |
resnet50_v1 | ResNet-50 V1 | 0.76062500 | 0.93046875 | 0.76062500 | 0.93046875 | 0.00000000 | 0.00000000 |
resnet101_v1 | ResNet-101 V1 | 0.77937500 | 0.93617188 | 0.77937500 | 0.93617188 | 0.00000000 | 0.00000000 |
resnet152_v1 | ResNet-152 V1 | 0.78320313 | 0.93867188 | 0.78320313 | 0.93867188 | 0.00000000 | 0.00000000 |
resnet18_v2 | ResNet-18 V2 | 0.71046875 | 0.89671875 | 0.71046875 | 0.89671875 | 0.00000000 | 0.00000000 |
resnet34_v2 | ResNet-34 V2 | 0.74085938 | 0.91578125 | 0.74085938 | 0.91578125 | 0.00000000 | 0.00000000 |
resnet50_v2 | ResNet-50 V2 | 0.76750000 | 0.93187500 | 0.76750000 | 0.93187500 | 0.00000000 | 0.00000000 |
resnet101_v2 | ResNet-101 V2 | 0.78125000 | 0.94015625 | 0.78125000 | 0.94015625 | 0.00000000 | 0.00000000 |
resnet152_v2 | ResNet-152 V2 | 0.78554688 | 0.94140625 | 0.78554688 | 0.94140625 | 0.00000000 | 0.00000000 |
squeezenet1.0 | SqueezeNet 1.0 | 0.57273438 | 0.79554688 | 0.57273438 | 0.79554688 | 0.00000000 | 0.00000000 |
squeezenet1.1 | SqueezeNet 1.1 | 0.57023438 | 0.79601563 | 0.57023438 | 0.79601563 | 0.00000000 | 0.00000000 |
vgg11 | VGG-11 | 0.67062500 | 0.87531250 | 0.67062500 | 0.87531250 | 0.00000000 | 0.00000000 |
vgg13 | VGG-13 | 0.68132813 | 0.87984375 | 0.68132813 | 0.87984375 | 0.00000000 | 0.00000000 |
vgg16 | VGG-16 | 0.72062500 | 0.90585938 | 0.72062500 | 0.90585938 | 0.00000000 | 0.00000000 |
vgg19 | VGG-19 | 0.73468750 | 0.91000000 | 0.73468750 | 0.91000000 | 0.00000000 | 0.00000000 |
vgg11_bn | VGG-11 with batch normalization | 0.68953125 | 0.88882813 | 0.68953125 | 0.88882813 | 0.00000000 | 0.00000000 |
vgg13_bn | VGG-13 with batch normalization | 0.69835938 | 0.88953125 | 0.69835938 | 0.88953125 | 0.00000000 | 0.00000000 |
vgg16_bn | VGG-16 with batch normalization | 0.72226563 | 0.90390625 | 0.72226563 | 0.90390625 | 0.00000000 | 0.00000000 |
vgg19_bn | VGG-19 with batch normalization | 0.72992188 | 0.90992188 | 0.72992188 | 0.90992188 | 0.00000000 | 0.00000000 |
CMD for Reproducing Result
Please access the script and model from the link below.
https://drive.google.com/open?id=17JenLnZKsmPoZIIyktINFfMjZtDY2Ehc
(Note: select the parent folder and click download in the drop-down menu)
You can refer to launch_benchmark_aws.sh for reproducing.
3 Comments
Anton Chernov
Patric Zhao could you add information about the tooling you were using?
Patric Zhao
Anton Chernov sure, I have added a section for reproducing the results
Anton Chernov
Great, thank you!