This document describes the design of the MKLDNN integration in a high level. It describes various mechanisms and interfaces we implemented and gives a reason why we choose such a design. We describe them in three parts:

 

Integration with NDArray and MXNet executor.

MKLDNN uses different layouts to support computation acceleration for some of the operations (e.g., convolution and matrix multiplication). The default layout in this document refers to the layouts used by MXNet (e.g., NC for 2D arrays and NCHW for 4D arrays). MKLDNN layouts may add padding to arrays and may increase the memory size of the arrays. Previously, we attempted to create a different storage type (MKLDNNStorage) for MKLDNN memory. This design caused difficulty in converting data layouts of weight arrays to avoid overhead of layout conversion during inference (the weight arrays are usually given by users and use the default storage). As such, we eventually decided to use the default storage type of NDArray for MKLDNN memory to completely hide MKLDNN from users. Now we can change the layouts of NDArray freely. This design hides the layout conversion completely from users and provides the full compatibility with the previous the MKLML integration.

Given such a design, we now store MKLDNN memory in Mkl_mem_ in Chunk. If Mkl_mem_ doesn’t exist, the data stored in shandle uses the default layout. Even if Mkl_mem_ exists, the data in the NDArray may still use the default layout. To simplify data access and enable memory reuse, the memory is still managed through shandle by MXNet. In other words, we never allocate memory from MKLDNN and Mkl_mem_ always references to the memory in shanlde, if Mkl_mem_ is created.

Such a design, however, increases the complexity of the integration, mainly because now an NDArray with MKLDNN memory needs to support array reuse (a mechanism by the MXNet executor to reduce memory consumption) and slicing.

 

Access MKLDNN memory from NDArray

MKLDNN is very strict with the memory layout. Providing unexpected layouts and memory alignments to a MKLDNN operator may cause significant performance degradation, incorrect outputs and even segmentation fault. Even for layouts that are compatible (NCHW for input data arrays and OIHW for weight arrays), such a problem still exists. As such, we provide four methods in NDArray for accessing data through MKLDNN memory:

In addition to the methods above, we also provide a wrapper function GetWeights() to get MKLDNN memory with the specified layout. The main reason is that MKLDNN uses 5D arrays to store weights if the number of groups is larger than one in convolution and MKLDNN reorder doesn’t work if the number of dimensions of arrays doesn’t match. GetWeights() is guaranteed to return a MKLDNN memory with the required number of dimensions and the required layout.

The MKLDNN memory objects returned by the functions above are referenced by raw pointers and are valid only inside the operator invocation. They will be invalid in the next invocation. Therefore, we cannot hold these memory objects somewhere and use them across mini-batches.

Whenever an NDArray is accessed through MKLDNN memory (e.g., GetMKLDNNData and GetMKLDNNDataReorder), we create MKLDNN memory with the default layouts and attach it to Mkl_mem_ with SetMKLMem, if Mkl_mem_ doesn’t exist. Even though MKLDNN has two basic layouts for 4D arrays, we always use NCHW in SetMKLMem.

Some of the methods in NDArray can converts the data layout of an NDArray and should be avoided whenever possible. For example, Reorder converts the data in NDArray to a specified layout and Reorder2Default converts data to the default layout. data triggers Reorder2Default, so it should be avoided if possible.

 

MKLDNN operator

MXNet now switches to the NNVM interface, which is a stateless interface. As such, all data structures that survive after an operator invocation need to be stored in thread-local data structures.

The layout of weight arrays:
The current implementation uses the default layout for weight arrays and always passes weight gradients in the default layout to the kvstore. During inference, we may change the layout of weight arrays for better performance. The layout conversion only occurs once for each operator.

Temporary memory management:
Some of MKLDNN operators require to reorder the data layout in an NDArray and we use temporary memory provided by MXNet to store the reordered data. MXNet only provides a single piece of temporary memory inside an operator. However, MKLDNN operators may need multiple pieces of memory for reordered data. The number and the size of required memories depends on the operator and its input arrays. To simplify the calculation, we estimate them with TmpMemMgr during computation. Every MKLDNN operator that requires temporary memory needs to call TmpMemMgr::Init() to reset the estimation and resizes the temporary memory. Later on, the memory is allocated from the single piece of temporary memory.

Fallback:
MKLDNN operators support a subset of cases required by MXNet (certain data types, certain input shapes, certain parameters, etc). Many of the cases can only be detected when the operator is invoked. Therefore, we need to provide a fallback mechanism, implemented by FallBackCompute, inside each operator that has MKLDNN support.

MKLDNN output:
We create three functions to assist in writing outputs from MKLDNN operators to NDArray: CreateMKLDNNMem(), CreateMKLDNNWeightGrad() and CommitOutput(). For kAddTo and kWriteInplace, we create a new MKLDNN memory allocated from the temporary memory to store outputs from MKLDNN operators. Otherwise, data can be written to NDArray directly. The reason that kWriteInplace requires to allocate a new memory is that MKLDNN operators don’t support the case where inputs and outputs use the same memory. We create CreateMKLDNNWeightGrad() explicitly for weight gradients because the kvstore currently works on the default layout and weight gradients use the default layout. CommitOutput() is invoked to copy or add data back to the output NDArray.

MKLDNN operator caching:
We cache MKLDNN operator primitives to avoid their creation overhead because it is very expensive to create MKLDNN operator primitives. The overhead accounts for a large amount of overall runtime during inference because we usually use small batch sizes during inference. To cache operator primitives, we maintain a per-thread hashtable, where the key is MKLDNNParamOpSign and the value is the operator primitives. MKLDNNParamOpSign contains the information of input and output arrays as well as operator parameters. An operator primitive is reused whenever the input and output arrays and operator parameters match.

Weight layout conversion:
We store the weight arrays in the default layout during training and store them in the MKLDNN layout during inference. If the input weight arrays do not use the expected layout, we will modify their layout inside the NDArrays, so we don't need to modify their layout again next time. This can significantly increase the performance during inference. Currently, we call Reorder2Default() and MKLDNNDataReorder() of NDArray to invoke in-place data layout conversion.