MXNet’s current threading model works well for most of our use cases, especially when it is accessed from single threaded Python. However, several users have asked for advice on how best to handle requests in parallel when doing things like hosting an MXNet based inference service (e.g. a neural machine translation service, an image classification service, etc.). Some users are surprised that MXNet isn’t natively thread-safe. This document hopes to describe what exactly the threading model looks like for MXNet when running inference. It also proposes a small, non-breaking change we can make to the C API that would ease the ability of service owners to run inference requests in parallel.
There are typically two mechanisms of handling parallel requests in a web service: either (1) run a thread pool or (2) use a single thread with asynchronous request processing. MXNet was not designed to be thread-safe, and imposing thread-safety on large libraries retroactively is often a challenging task. Therefore, supporting parallel request via method 1 is likely not a promising approach for MXNet in the short-term. However, most important actions in MXNet are lazily evaluated in an engine thread, and the API for MXNet is already mostly asynchronous, so supporting parallel requests via method 2 seems like it would be a natural fit. There’s only one blocking call that prevents us from writing services that would support parallel requests via asynchronous request processing, and this blocking call can be worked around quite easily.
The above diagram includes the current threading model during inference as I understand it. The most important thing to note here is that when we run inference or read a result we must do it on a single main thread (aka dispatcher thread). It’s therefore very important that we block this thread as little as possible. This is not a problem when we’re submitting predictions, we can call forward on a symbol as many times as we like. Each forward call will quickly kick off a computation graph that will eventually be executed in the engine.
We can eventually read the results of running forward in our output NDArray by calling asnumpy() (which in turn calls wait_to_read), but at this point we run into a potential problem. If we submit for example four requests of various sizes, we are forced to choose one of them to read first, which will block our main thread for a potentially significant amount of time. We can’t submit any more work until the main thread is unblocked.
Another consideration is that for many models the order of results is not guaranteed. This means if we’re waiting for requests in the order in which they were issued, we would likely have higher average latency than if we could read results in the order they complete. What would be a better solution would be to only call asnumpy() on outputs that we know are finished. There’s a few ways we could do this, but one relatively simple way would be to expose a can_read property on NDArrays that lets us know when they’re ready to read. Services could then poll any output NDArrays, and as soon as one is ready it could read the result, and pass that result back to a request thread.
Another approach that should be considered would just be to call asnumpy() on requests threads as soon as you start a prediction. It would block requests threads, but it would never block our main thread. My understanding is that although asnumpy() is only reading data, it still mutates state, so calling it directly from requests threads would not be thread-safe. (It would be great if a core dev could correct me if this is not the case).
MXNet 1 Proposal
Therefore, our proposal is to expose a can_read property on NDArrays that is set to true when the underlying array is ready to be read. As a medium-term solution this would enable parallel request processing for services, example here. It would also not be a large code change, and it would not be a breaking change to any public APIs.
Long Term Proposal
In the long term we propose that we make all inference calls to read-only models thread-safe.
Limits to Parallel Processing
There are some use cases where parallel processing would not be advisable.
- If a user for example is running on CPU and only has access to a limited number of cores processing multiple requests will cause excess pre-emptive multitasking which will lower the user’s cache-hit rate and dramatically lower their performance.
- If a user runs on a GPU they would have to carefully ensure that they don’t exceed the amount of memory available on their GPU. The user won’t have to worry about duplicating weights in their GPU RAM, as only a single copy will be stored, but still running inference multiple times will require some working memory and there’s a limit to how many requests they’ll be able to run in parallel. There may also be some latency / throughput tradeoffs they would want to consider, so they may want to run some benchmarks to see what the optimal number of parallel requests would be for their use case.
- If a user does any kind of online-learning, parallel requests will likely not work well as their weights could be mutated and/or read by several threads at once.