Benchmarking MXNet with different OpenMP implementations

Summary

We have gathered information on how MXNet performance changes when linked to different implementations of OpenMP and complied with different compilers.

Primary goal: Provide performance data points on different OpenMP implementations.

Secondary goal: Compare performance when compiled with different compilers.

Tl;dr

The difference between different compilers is insignificant. Native OpenMP implementations (more or less recent) perform equally (<5% difference). Small batch sizes during inference can indeed make difference for "narrow" networks like AlexNet, which is still minor (~5%).

Current state

LLVM OpenMP explicitly used in all builds

At the time of writing MXNet uses a bundled as a submodule version of OpenMP which is from 11/2017. It's pull from a specific revision and built, then explicitly linking to it. The proposed by the compiler library is not removed. When built with MKLML the intel version is explicitly removed from linked libraries.

Thus, an application can include multiple OpenMP implementations. The explicitly built and linked, the one linked implicitly by the compiler and the one provided with mklml_intel.

As stated here:

Having more than one OpenMP runtime initialised may lead to undefined behaviour including incorrect results or crashes.

A discussion has been started on the dev list to review a possible solution to the problem.

Currently, we assume these issues might be related:

Failed OpenMP assertion when loading MXNet compiled with DEBUG=1
https://github.com/apache/incubator-mxnet/issues/10856
libomp.so dependency (need REAL fix)
https://github.com/apache/incubator-mxnet/issues/11417
mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL
https://github.com/apache/incubator-mxnet/issues/8532
Performance regression when OMP_NUM_THREADS environment variable is not set
https://github.com/apache/incubator-mxnet/issues/9744
Poor concat CPU performance on CUDA builds
https://github.com/apache/incubator-mxnet/issues/11905
Poor performance of the libmxnet if OMP_PLACES environment variable is present
https://github.com/apache/incubator-mxnet/issues/14087

Although, this setup is reproducible, we also lose the benefits of newer OpenMP versions, which are released with latest compiler releases.

Make vs CMake

As of now (1/2019) we have 2 build systems: make and cmake. Current production binaries are delivered by make and the compiler optimization flags are more aggressive. CMake is under development, some settings are behind that of Make (like sse2 vs sse3). More than that, current cmake produces critically slower binaries. See:

!!!GPU performance of Cmake built mxnet is worse than Make built one
https://github.com/apache/incubator-mxnet/issues/6685

One of the reasons with CPU (e.g. not CUDA version) is the OpenBLAS preceding MKL ML in the linker commands. See:

Default cmake build uses openblas instead of MKL
https://github.com/apache/incubator-mxnet/issues/14085

PIP distributed version contains mklml_intel.so and is built by make.

Intel Compiler Issues

Currently, there are several problems with MXNet compilation if compiled with ICC (Intel C++ Compiler). See:

Intel Compiler fails to build mxnet
https://github.com/apache/incubator-mxnet/issues/14086

Experiment setup

We have measured the performance of the code under following conditions:

Hardware

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs. We have not limited the usage of the cores/sockets.

Build

We use the current CMake, considering most of the deviating flags (like SSE or explicit loop unrolling) to be insignificant for our experiments.

cmake \
    -DUSE_CUDA=OFF \
    -DWITH_TESTS=OFF \ 
    -DWITH_EXAMPLES=OFF \
    -DCMAKE_CXX_COMPILER=$CXXCOMP \
    -DCMAKE_C_COMPILER=$CCOMP \
    -DMKLDNN_THREADING=$THREADING \
    $LD_ARG ..

See details in the attached benchmark.sh file.

MXNET Source code

MXNet 35c33832 was branched.
We have fixed an issue with OpenBlas hiding MKL (eebc17b0)
We have fixed the issues preventing us from compiling on ICC. (52648d42)
We have applied the changes from this pull request to get a treatment group (24a2a8c8)
As a control group we use binaries with just 2 previous changes.

Compilers and OpenMP implementations

Treatment groups

	ID	Compiler	OpenMP	MKL
1	clang3_gnu	Clang 3.8.0	Native OMP	mklml_gnu
2	clang3_intel	Clang 3.8.0	Intel OMP	mklml_intel
3	gcc5_gnu	GCC 5.4.0	Native GOMP	mklml_gnu
4	gcc5_intel	GCC 5.4.0	Intel OMP	mklml_intel
5	clang7_gnu	Clang 7.0.1	Native OMP	mklml_gnu
6	clang7_intel	Clang 7.0.1	Intel OMP	mklml_intel
7	gcc8_gnu	GCC 8.1.0	Native GOMP	mklml_gnu
8	gcc8_intel	GCC 8.1.0	Intel OMP	mklml_intel
9	intel19_intel	Intel Compiler 19.0.1	Native Intel OMP	mklml_intel

Control groups

	ID	Compiler	OpenMP	MKL
1	clang3_omp	Clang 3.8.0	Provided OMP	mklml_gnu
2	gcc5_omp	GCC 5.4.0	Provided OMP	mklml_gnu
3	clang7_omp	Clang 7.0.1	Provided OMP	mklml_gnu
4	gcc8_omp	GCC 8.1.0	Provided OMP	mklml_gnu
5	intel19_omp	Intel Compiler 19.0.1	Native Intel OMP	mklml_gnu

Please note, that LLVM OpenMP runtime and Intel OpenMP are highly likely just different versions of Intel OpenMP runtime, therefore we don't expect any significant differences.

Benchmark code

We have followed:

As in the both mentioned documents we use image-classification/benchmark_score.py (we will call it convolutional benchmark). Additionally, we used faster-rcnn benchmark from the second document.

We have not limited the usage of the sockets contrary to the second source.

Environment

	Variable	Value
1	KMP_AFFINITY	granularity=fine,noduplicates,compact,1,0
2	OMP_NUM_THREADS	36
3	GOMP_CPU_AFFINITY	0-71

General score

To calculate the general score, we measure the improvement ratio vs clang3_gnu in terms of throughput. Each test was repeated 5 times.

Results discussion

Results match in their order the numbers from the mentioned source.

Obviously, two factors contribute to the performance values:

OpenMP implementation
Quality of generated machine code

As we mentioned the impact of the later factor is limited by the precompiled BLAS libraries. We expect the overhead of the OpenMP to be significant for small models/batch sizes.

With increasing models/batch sizes we expect it to be dominated by the actual matrix operations.

Convolutional benchmark

AlexNet

Let's take a look at the smaller AlexNet, since it's expected to show the most differences.

Control group shows as expected almost no difference between different setups – again, we use same OpenMP and precompiled MKL.

We are not able to explain the +20%/-10% swing of the both GCC compilers.

Same behaviour we see in the treatment group no matter which OpenMP is used.

Control group

Treatment group shows no difference other than that "GCC-swing". Normalizing the data gives us average scores with ~1% difference, which is close to standard error.

Treatment group

ResNet152

Now we can observe a beautiful saturation of the throughput. Optimal batch size is between 16 and 32.

With resnet-152 we see no more interesting swings. The maximal difference to base line for a single batch size is ~7% in both groups, ~4% if averaged for all batch sizes. For example, we found Clang7 with native OMP to be 4% faster. Consider the standard error of 2%.

Very similar data we get for other models.

Total scores

Control group shows the following pretty close numbers:

	ID	Score	Std.err
1	clang3_omp	1	0
2	clang7_omp	1.01157	0.02027
3	gcc5_omp	1.00581	0.01914
4	gcc8_omp	1.00795	0.0192
5	intel19_omp	1.0093	0.0192

Combining the treatment group with clang7_omp which is the best performer (again, with devastating margin of 1%) of the control group we have the following data.

	ID	Score	Std. err
1	clang3_gnu	1	0
2	clang3_intel	1.00051	0.01739
3	clang7_gnu	1.014	0.02055
4	clang7_intel	1.01186	0.01899
5	gcc5_gnu	0.98937	0.01913
6	gcc5_intel	1.0083	0.01696
7	gcc8_gnu	0.98195	0.01961
8	gcc8_intel	1.00822	0.01723
9	intel19_intel	1.00486	0.01756
10	clang7_omp	1.01215	0.01777

We can see pretty obvious patterns.

Newer compilers perform better than the older.
GOMP is slower than IOMP.

But the overall differences are pretty close to standard error and don't even reach 2%.

faster-rcnn Benchmark

As we can see, GOMP delivers ~3-5% worse performance than OMP.

Conclusion

We interpret the results as a suggestion, that the current state should be simplified. The benchmarking shows that we get at most 5% improvements vs worst case setup (e.g. older GCC). On the other side, we are explicitly discouraged by MKL maintainers to use this approach, as it can (and does) lead to hard-to-find issues.

Further tasks and open questions

Can we achieve better performance with GOMP using other environment variables?
Can we get more info about dominating factors (code quality vs OpenMP) with a profiler
Repeat the benchmarking on other instance types than c5.18xlarge
Include windows compilers

Acronyms

OMP - LLVM OpenMP implementation

IOMP - Intel OpenMP implementation

GOMP - GCC OpenMP implementation

ICC - Intel C Compiler

GCC - GNU C Compiler

Page tree