We have gathered information on how MXNet performance changes when linked to different implementations of OpenMP and complied with different compilers.
Primary goal: Provide performance data points on different OpenMP implementations.
Secondary goal: Compare performance when compiled with different compilers.
The difference between different compilers is insignificant. Native OpenMP implementations (more or less recent) perform equally (<5% difference). Small batch sizes during inference can indeed make difference for "narrow" networks like AlexNet, which is still minor (~5%).
LLVM OpenMP explicitly used in all builds
At the time of writing MXNet uses a bundled as a submodule version of OpenMP which is from 11/2017. It's pulled from a specific revision and built, then explicitly linking to it. The proposed by the compiler library is not removed. When built with MKLML the intel version is explicitly removed from linked libraries.
Thus, an application can include multiple OpenMP implementations. The explicitly built and linked, the one linked implicitly by the compiler and the one provided with mklml_intel.
As stated here:
Having more than one OpenMP runtime initialised may lead to undefined behaviour including incorrect results or crashes.
A discussion has been started on the dev list to review a possible solution to the problem.
Currently, we assume these issues might be related:
- Failed OpenMP assertion when loading MXNet compiled with DEBUG=1
- libomp.so dependency (need REAL fix)
- mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL
- Performance regression when OMP_NUM_THREADS environment variable is not set
- Poor concat CPU performance on CUDA builds
- Poor performance of the libmxnet if OMP_PLACES environment variable is present
Although, this setup is reproducible, we also lose the benefits of newer OpenMP versions, which are released with latest compiler releases.
Make vs CMake
As of now (1/2019) we have 2 build systems: make and cmake. Current production binaries are delivered by make and the compiler optimization flags are more aggressive. CMake is under development, some settings are behind that of Make (like sse2 vs sse3). More than that, current cmake produces critically slower binaries. See:
- !!!GPU performance of Cmake built mxnet is worse than Make built one
One of the reasons with CPU (e.g. not CUDA version) is the OpenBLAS preceding MKL ML in the linker commands. See:
- Default cmake build uses openblas instead of MKL
PIP distributed version contains mklml_intel.so and is built by make.
Intel Compiler Issues
Currently, there are several problems with MXNet compilation if compiled with ICC (Intel C++ Compiler). See:
- Intel Compiler fails to build mxnet
We have measured the performance of the code under following conditions:
The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs. We have not limited the usage of the cores/sockets.
We use the current CMake, considering most of the deviating flags (like SSE or explicit loop unrolling) to be insignificant for our experiments.
See details in the attached benchmark.sh file.
MXNET Source code
- MXNet 35c33832 was branched.
- We have fixed an issue with OpenBlas hiding MKL (eebc17b0)
- We have fixed the issues preventing us from compiling on ICC. (52648d42)
- We have applied the changes from this pull request to get a treatment group (24a2a8c8)
- As a control group we use binaries with just 2 previous changes.
Compilers and OpenMP implementations
Intel Compiler 19.0.1
Native Intel OMP
Intel Compiler 19.0.1
Native Intel OMP
Please note, that LLVM OpenMP runtime and Intel OpenMP are highly likely just different versions of Intel OpenMP runtime, therefore we don't expect any significant differences.
We have followed:
As in the both mentioned documents we use image-classification/benchmark_score.py (we will call it convolutional benchmark). Additionally, we used faster-rcnn benchmark from the second document.
We have not limited the usage of the sockets contrary to the second source.
To calculate the general score, we measure the improvement ratio vs clang3_gnu in terms of throughput. Each test was repeated 5 times.
Results match in their order the numbers from the mentioned source.
Obviously, two factors contribute to the performance values:
- OpenMP implementation
- Quality of generated machine code
As we mentioned the impact of the later factor is limited by the precompiled BLAS libraries. We expect the overhead of the OpenMP to be significant for small models/batch sizes.
With increasing models/batch sizes we expect it to be dominated by the actual matrix operations.
Let's take a look at the smaller AlexNet, since it's expected to show the most differences.
Control group shows as expected almost no difference between different setups – recall, we use same OpenMP and precompiled MKL.
We are not able to explain the +20%/-10% swing of the both GCC compilers.
Same behaviour we see in the treatment group no matter which OpenMP is used.
Treatment group shows no difference other than that "GCC-swing". Normalizing the data gives us average scores with ~1% difference, which is close to standard error.
Now we can observe a beautiful saturation of the throughput. Optimal batch size is between 16 and 32.
With resnet-152 we see no more interesting swings. The maximal difference to base line for a single batch size is ~7% in both groups, ~4% if averaged for all batch sizes. For example, we found Clang7 with native OMP to be 4% faster. Consider the standard error of 2%.
Very similar data we get for other models.
Control group shows the following pretty close numbers:
Combining the treatment group with clang7_omp which is the best performer (again, with devastating margin of 1%) of the control group we have the following data.
We can see pretty obvious patterns.
- Newer compilers perform better than the older.
- GOMP is slower than IOMP.
But the overall differences are pretty close to standard error and don't even reach 2%.
As we can see, GOMP delivers ~3-5% worse performance than OMP.
We interpret the results as a suggestion, that the current state should be simplified. The benchmarking shows that we get at most 5% improvements vs worst case setup (e.g. older GCC). On the other side, we are explicitly discouraged by MKL maintainers to use this approach, as it can (and does) lead to hard-to-find issues.
Further tasks and open questions
- Can we achieve better performance with GOMP using other environment variables?
- Can we get more info about dominating factors (code quality vs OpenMP) with a profiler
- Repeat the benchmarking on other instance types than c5.18xlarge
- Include windows compilers
OMP - LLVM OpenMP implementation
IOMP - Intel OpenMP implementation
GOMP - GCC OpenMP implementation
ICC - Intel C Compiler
GCC - GNU C Compiler