Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Problem Statement

MXNet tries to leverage multithreading on CPUs and GPUs to solve many problems. Few big areas are: dependency engine to run operators in parallel, within operator implementation logic, for data loading using iterators. This designs helps MXNet achieve great performance, but adds some challenges with respect to usability. Below I demonstrate two scenarios where MXNet doesn't handle exceptions gracefully and causes the main thread to crash. 

Example 1

Code Block
languagepy
import mxnet as mx
mx.nd.random_normal(0, -1, (2,3))
mx.nd.waitall()



...

Code Block
terminate called after throwing an instance of 'dmlc::Error'
  what():  [02:32:04] ../src/engine/./threaded_engine.h:359: [02:32:04] ../src/operator/random/./sample_op.h:301: Check failed: param.scale > 0 (-1 vs. 0) scale parameter in gaussian has to be positive
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN4dmlc10StackTraceB5cxx11Ev+0x54) [0x7eff0140bf5b]
[bt] (1) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x2a) [0x7eff0140c242]
[bt] (2) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN5mxnet2op12SampleMasterIN7mshadow3cpuENS0_13NormalSamplerIS3_EEE2opERKN4nnvm9NodeAttrsERKNS_9OpContextERKNS_9OpReqTypeEPNS_5TBlobE+0x120) [0x7eff01a56c8a]
[bt] (3) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN5mxnet2op7Sample_IN7mshadow3cpuENS0_13NormalSamplerIS3_EEEEvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISE_EERKSD_INS_9OpReqTypeESaISJ_EESI_+0xa1) [0x7eff01a4e8ca]
[bt] (4) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNSt17_Function_handlerIFvRKN4nnvm9NodeAttrsERKN5mxnet9OpContextERKSt6vectorINS4_5TBlobESaIS9_EERKS8_INS4_9OpReqTypeESaISE_EESD_EPSJ_E9_M_invokeERKSt9_Any_dataS3_S7_SD_SI_SD_+0x91) [0x7eff01606165]
[bt] (5) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNKSt8functionIFvRKN4nnvm9NodeAttrsERKN5mxnet9OpContextERKSt6vectorINS4_5TBlobESaIS9_EERKS8_INS4_9OpReqTypeESaISE_EESD_EEclES3_S7_SD_SI_SD_+0xa6) [0x7eff03d1732c]
[bt] (6) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISA_EERKS9_INS_9OpReqTypeESaISF_EESE_EEPKNS2_2OpES5_RKNS_7ContextERKS9_IPNS_6engine3VarESaISW_EES10_RKS9_INS_8ResourceESaIS11_EERKS9_IPNS_7NDArrayESaIS17_EES1B_RKS9_IjSaIjEESJ_ENKUlNS_10RunContextEE_clES1G_+0x1f2) [0x7eff03e691f6]
[bt] (7) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextEEZNS0_10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS0_9OpContextERKSt6vectorINS0_5TBlobESaISD_EERKSC_INS0_9OpReqTypeESaISI_EESH_EEPKNS5_2OpES8_RKNS0_7ContextERKSC_IPNS0_6engine3VarESaISZ_EES13_RKSC_INS0_8ResourceESaIS14_EERKSC_IPNS0_7NDArrayESaIS1A_EES1E_RKSC_IjSaIjEESM_EUlS1_E_E9_M_invokeERKSt9_Any_dataOS1_+0x44) [0x7eff03e712f5]
[bt] (8) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNKSt8functionIFvN5mxnet10RunContextEEEclES1_+0x56) [0x7eff03c731fc]
[bt] (9) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(+0x39b9f33) [0x7eff03c90f33]
 
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN4dmlc10StackTraceB5cxx11Ev+0x54) [0x7eff0140bf5b]
[bt] (1) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x2a) [0x7eff0140c242]
[bt] (2) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x4f6) [0x7eff03c7cb44]
[bt] (3) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9CPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x9d) [0x7eff03c878c9]
[bt] (4) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS0_8OprBlockEbENKUlvE_clEvENKUlSt10shared_ptrINS0_10ThreadPool11SimpleEventEEE_clES8_+0x56) [0x7eff03c85774]
[bt] (5) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x5c) [0x7eff03c8a424]
[bt] (6) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNKSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEEclES5_+0x49) [0x7eff03c900f3]
[bt] (7) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNSt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES6_EE9_M_invokeIILm0EEEEvSt12_Index_tupleIIXspT_EEE+0x68) [0x7eff03c90066]
[bt] (8) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNSt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES6_EEclEv+0x2c) [0x7eff03c8fefa]
[bt] (9) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x1c) [0x7eff03c8fe4a]

 

Example 2

Code Block
languagepy
import mxnet as mx
data_path = 'manual_2.csv'
data_train = None
try:
    data_train = mx.io.CSVIter(data_csv=data_path, data_shape=(4,10),
            batch_size=1)
    for batch in iter(data_train):
        print data_train.getdata().asnumpy()
except mx.base.MXNetError:
    print 'Exception handled'

...

Code Block
terminate called after throwing an instance of 'dmlc::Error'
  what():  [02:08:14] ../src/io/iter_csv.cc:125: Check failed: row.length == shape.Size() (4 vs. 40) The data size in CSV do not match size of shape: specified shape=[4,10], the csv row-length=4
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN4dmlc10StackTraceB5cxx11Ev+0x54) [0x7febfb693f5b]
[bt] (1) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x2a) [0x7febfb694242]
[bt] (2) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN5mxnet2io7CSVIter7AsTBlobERKN4dmlc3RowIjEERKN4nnvm6TShapeE+0x14a) [0x7febfe0d9832]
[bt] (3) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN5mxnet2io7CSVIter4NextEv+0x25e) [0x7febfe0d9312]
[bt] (4) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZN5mxnet2io11BatchLoader4NextEv+0xa1) [0x7febfe0653f3]
[bt] (5) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZZN5mxnet2io14PrefetcherIter4InitERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_ESaISA_EEENKUlPPNS_9DataBatchEE_clESH_+0x50) [0x7febfe04bf98]
[bt] (6) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNSt17_Function_handlerIFbPPN5mxnet9DataBatchEEZNS0_2io14PrefetcherIter4InitERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESE_ESaISF_EEEUlS3_E_E9_M_invokeERKSt9_Any_dataOS3_+0x37) [0x7febfe053473]
[bt] (7) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNKSt8functionIFbPPN5mxnet9DataBatchEEEclES3_+0x49) [0x7febfe053797]
[bt] (8) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZZN4dmlc12ThreadedIterIN5mxnet9DataBatchEE4InitESt8functionIFbPPS2_EES4_IFvvEEENKUlvE_clEv+0x311) [0x7febfe0512eb]
[bt] (9) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/libmxnet.so(_ZNSt12_Bind_simpleIFZN4dmlc12ThreadedIterIN5mxnet9DataBatchEE4InitESt8functionIFbPPS3_EES5_IFvvEEEUlvE_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE+0x28) [0x7febfe05a188]

 

Why is this a problem ?

Proper exception handling and propagation in MXNet is important for two types of use-case.  The first is for MXNet users who are using one of our APIs to build or test a model, and the second is MXNet service owners who are using MXNet in production for DL enabled services.

...

Look at the community requests here:https://github.com/apache/incubator-mxnet/issues/7335

Exception Handling for Iterators

Approach

MXNet uses a general IO processing pipeline based on ThreadedIter class in dmlc-core.

...

The main thread maintains a queue of exception_ptrs and checks if the queue is non empty. If the queue is non empty it pulls out the exception ptrs and rethrows the exception.

Proof of Concept

https://github.com/dmlc/dmlc-core/pull/355

 

Open Questions

1. Should we keep a queue of arbitary size or queue of size 1. One advantage of using queue of arbitary size is that we can store exceptions from multiple threads in the case where multiple threads throw exceptions. One disadvantage of using queue of arbitary size is that it consumes extra memory.

Exception Handling for Operators

Approach1

  • Add exception_ptr member for ThreadedOpr opr_ex and exception_ptr member for ThreadedVar var_ex.
  • Put a try catch block in the ExecuteOprBlock around the execution of the operator. 
  • If there is an exeption thrown during the execution of the operator, then we intend to catch the exception and use the exception_ptr member for the ThreadedOpr to point to the exception object. We explicitly make a call to callback in this case. 
  • In the callback, we set the exception_ptr member for all the variables that the current operator will mutate.
  • In the callback we also set the exception_ptr member for the current operator to the one held by one of its dependencies. This way we can propagate an old exception_ptr down the dependency chain.
  • Also set the global_exc_ptr depending on whether there is exception associated with a read var.
  • In WaitForVar, check if the threaded_var->var_ex is set. If it is set, rethrow the exception. Since we are waiting for this var, if this var had an exception associated with it means somewhere in the dependency path to get to the var there was an exception thrown.
  • In WaitForAll, we can rethrow exception based on whether global_exc_ptr is set or not.

Proof of Concept for Approach 1

https://github.com/apache/incubator-mxnet/pull/9373

Approach2

  • Add exception_ptr member for ThreadedOpr opr_ex and exception_ptr member for ThreadedVar var_ex.
  • Put a try catch block in the ExecuteOprBlock around execution of the operator. dont execute the operator if the threaded_opr already contains the exception.
  • Functions pushed using Push_async will take three parameters instead of two: on_start, execute, on_complete.
  • on_start callback will propagate exception_ptr based on whether read dependencies have exception_ptr associated with them.
  • on complete callback will propagate exception_ptr to write_vars based on whether the threaded_opr has exception associated with them.
  • The logic to rethrow the exception in WaitForVar and WaitForAll should be same as approach1.

Comparison of Approach1 and Approach2

Approach1Approach2
Forces to execute operators even if prev operators failed. This can be a problem if subsequent operators after a failed operator throw exceptions other than dmlc::Error

Once there is a failed operator all the operators that depend on the current operator won't be executed.

Minimal api changes.The lambda closure expected by PushAsync has a different signature after adding onstart callback.
For the cases where exception is thrown, there is an overhead of execution of subsequent operators. For the cases where exception is thrown, there is no overhead of execution of subsequent operators.
Performance impact should be minimal in cases where there will be no exception thrown.Performance impact needs to be investigated because of additional overhead of the onstart callback even for cases where no exception thrown.

Recommendation

My recommendation is to take Approach1 since this introduces minimal api changes and also minimal performance impact in the case where no exception is thrown.

...

Since the performance impact of both Approaches should be similar and since Approach2 has an advantage of non execution of subsequent operators and addresses the issue with Approach1, the recommendation is to proceed with Approach2.

Open Questions

1. How to handle WaitForAll situation ?

...