This page is a work in progress to document how MXNet operates at runtime.
To get a first look at whats happens at runtime, we can start by leveraging the MXNet Profiler and looking at a sample model.
This code creates a 3x3 tensor, computes the diagonal and then sum's along the diagonal (to compute the “trace”). Using the mxnet profiler, we capture internal MXNet behavior and dump it to a string and print it (“dumps()”) and also dump it to a file (“dump()”). Then we can import that file in Chrome and view it graphically.
import mxnet as mx import numpy as np from mxnet import profiler #configure the profiler profiler.set_config(profile_all=True, aggregate_stats=True, filename='trace_profile.json') #start the profiler collecting data profiler.set_state('run') ########################################################### #1. create our data data = np.linspace(1,9,9).reshape((3,3)) #2. create an MXNet ndarray a = mx.nd.array(data) #3. compute on our data and produce results b = mx.nd.diag(a) c = mx.nd.sum(b,-1) #4. wait for computation to finish mx.nd.waitall() ########################################################### #stop the profiler profiler.set_state('stop') #dump the profiling data as a string print(profiler.dumps()) #dump the profiling data as a json file that can be viewed graphically profiler.dump()
When running this code, the dumps function dumps the profiling data to a string and returns it (which we promptly print). This statistical info is shown below.
Profile Statistics. Note that counter items are counter values and not time units. Device Storage ================= Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms) ---- ----------- --------- ------------- ------------- ------------- Memory: cpu/0 3 0.0520 0.0360 0.0520 0.0080 MXNET_C_API ================= Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms) ---- ----------- --------- ------------- ------------- ------------- MXNDArraySyncCopyFromCPU 1 0.1600 0.1600 0.1600 0.1600 MXNDArrayGetDType 1 0.0010 0.0010 0.0010 0.0010 MXNet C API Calls 8 0.0080 0.0010 0.0080 0.0035 MXImperativeInvokeEx 2 0.2210 0.0720 0.1490 0.1105 MXNDArrayGetShape 2 0.0020 0.0010 0.0010 0.0010 MXNet C API Concurrency 16 0.0000 0.0000 0.0010 0.0005 MXNDArrayWaitAll 1 10.9030 10.9030 10.9030 10.9030 MXNDArrayCreateEx 1 0.0200 0.0200 0.0200 0.0200 operator ================= Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms) ---- ----------- --------- ------------- ------------- ------------- sum 2 0.0530 0.0250 0.0280 0.0265 ResourceParallelRandomSetSeed 2 9.2730 4.6350 4.6380 4.6365 diag 2 12.8620 6.4300 6.4320 6.4310 WaitForVar 2 0.0160 0.0060 0.0100 0.0080
The dump function writes out the same data in a different format to a file that can be opened in Chrome and displayed visually. This can be seen in the diagram below.
The profiling data has captured info about interesting functions that have executed which your program was running. Here are some explanations about what each one does.
The functions in the C_API are:
Function Name | Description |
---|---|
MXImperativeInvokeEx | invokes an operator to perform the computation |
MXNDArrayCreateEx | creates an ndarray |
MXNDArrayGetDType | returns the data type of the ndarray |
MXNDArrayGetShape | returns the shape of the ndarray (as a tuple where each element is the size of a dimension) |
MXNDArraySyncCopyFromCPU | called when data is initially residing outside of an MXNet data structure (ie. numpy.ndarry rather than mxnet.numpy.ndarray). Data is copied into the mxnet data structure |
MXNDArrayWaitAll | wait for all asynchronous operations to finish in MXNet. This function is only used in benchmarking to wait for work to happen. In a real program there is no waiting, data dependencies are evaluated and computation executed as needed in a As Late As Possible (ALAP) way |
The function in the Engine API are:
Function Name | Description |
---|---|
WaitForVar | Takes a variable reference as input and waits until that variable has been computed before returning |
Other API functions:
Function Name | Description |
---|---|
ResourceParallelRandomSetSeed | sets the random number generator seed |
Operators we intended to call in the code:
Operator Name | Description |
---|---|
sum | sum a tensor along a particular axis |
diag | compute the diagonal of the tensor |
Whats happening here...
From the code, we can identify the major events in our test application
- Initialize our input data
- Creating a new MXNet ndarray using our existing data values
- Compute on our data
- produce the diagonal of the input data
- sum along the diagonal to compute the “trace” of the matrix
- wait for computation to finish (only needed when profiling)
Nothing MXNet related happens before #2, those calls are all regular numpy functions. When #2 happens and we create an MXNet ndarray quite a few things happen. The screenshot below shows a zoomed in portion of the timeline.
Here, the four red arrows show the important events in this sequence.
- First, the MXNDArrayCreateEx is called to physically allocate space to store the data and other necessary attributes in the ndarray class
- Then some support functions are called (MXNDArrayGetShape, MXNDArrayGetDType) while initialing the data structure
- Finally the data is copied from the non-MXNet ndarray into the newly prepared MXNet ndarray by the MXNDArraySyncCopyFromCPU function
Next, #3 begins the computing process to produce our output data. The screenshot below shows this behavior.
Here you can see that the following sequence of events happen:
- MXImperativeInvokeEx is called the first time to launch the diagonal operator from #3
- Soon after that the actual diag operator begins executing in another thread
- While that is happening, our main thread moves on and calls MXImperativeInvokeEx again to launch the sum operator. Just like before this returns without actually executing the operator and continues.
- Lastly, the MXNDArrayWaitAll is called as the main thread has progressed to #4 in our app. It will wait here while all the computation finishes
Next lets look at a view of the part of the timeline zoomed to the actual operator execution.
Here there are 3 main events happening:
- The diag operator is executing first
- Then the ResourceParallelRandomSetSeed runs
- And finally the sum operator executes (for a very short time as shown by the big red arrow)
The diag operator running makes sense (although seems to take a little longer than we'd like). And at the end the sum operator runs (very quickly!). But the weird part in the middle is this ResourceParallelRandomSetSeed thing running. This is part of the MXNet resource manager. The resource manager handles temporary space and random number generators needed by the operators. The sum operator requests temporary space in order to compute the sum, and therefore launches the resource manager (for the first time) here. As part of its startup sequence the random number generator is initialized by setting the seed. So this is some initialization overhead. But lets try and run the app again, running the compute twice, and look at the 2nd run to try and remove this initialization from our profiling.
Heres the modified code:
import mxnet as mx import numpy as np from mxnet import profiler profiler.set_config(profile_all=True, aggregate_stats=True, filename='trace_profile.json') profiler.set_state('run') ################ # first run sdata = np.linspace(1,9,9).reshape((3,3)) sa = mx.nd.array(sdata) sb = mx.nd.diag(sa) sc = mx.nd.sum(sb,-1) mx.nd.waitall() ################ ################ # second run data = np.linspace(1,9,9).reshape((3,3)) a = mx.nd.array(data) b = mx.nd.diag(a) c = mx.nd.sum(b,-1) mx.nd.waitall() ################ profiler.set_state('stop') print(profiler.dumps()) profiler.dump()
Notice that we renamed the variables and made another copy after the waitall call. This is so that MXNet doenst have to worry about re-using variables, and to segment the 2nd half after the first time initialization.
Heres an overview of the new timeline:
The first red box is the first run, and the 2nd smaller one is the 2nd run. First off, we can see how much smaller the 2nd one is now without any of the initialization routines. Here is a zoomed in view of just the 2nd run.
We still have the same sequence of events at the beginning to initialize the MXNet ndarray (MXNDArrayCreateEx, MXNDArrayGetShape, MXNDArrayGetDType, MXNDArraySyncCopyFromCPU). Then the diag operator runs, followed by the sum operator, and finally the waitall. When you look at this, be careful about the assumptions that you make. In this version of the timeline it appears that the operator executes after the MXImperativeInvokeEx runs, and seems to imply an inherent ordering. But realize that there is no dependency between the diag operator finishing and the next MXImperativeInvokeEx launching the sum operator. In this case it just-so-happens that the diag operator finishes so quickly that it appears that way. But in reality the main thread is launching the operators and not waiting for them to finish. Lastly, keep in mind that in this case by the time we hit the MXNDArrayWaitAll everything is already done and we return immediately, but in other circumstances it may sit here waiting for everything to finish (like we saw earlier in the first run).
Summary
We ran the MXNet profiler on a simple application and tried to understand the runtime behavior of MXNet. We were able to see some underlying functionality that manages operator execution and data allocation and some unexpected initialization routines.
Known issues:
Number of times an operator is counted as executed in the profiler is 2x the real value - https://github.com/apache/incubator-mxnet/issues/10520