View Source

Title

Link to dev List discussion

https://lists.apache.org/thread.html/3a1d3fdd1fd76617792e7b7129a4a5dba34bea8462bff6173b58426a@%3Cdev.mxnet.apache.org%3E

Feature Shepherd

YiZhi Liu

Problem

MXNet Scala uses native memory to manage NDArray, Symbol, Executor, DataIterators using the MXNet c_api. C APIs provide appropriate interfaces to create, access and free these objects MXNet Scala has corresponding Wrappers and APIs which have pointer references to the native memory.

Current JVM users(Scala/Clojure/Java..) of Apache MXNet have to manage MXNet objects manually using the dispose pattern, there are a few usability problems with this approach.

Users have to track the MXNet objects manually and remember to call dispose, this is not Java Idiomatic and not user-friendly, quoting a user "this feels like I am writing C++ code which I stopped ages ago"
Leads to memory leaks if dispose is not called.
Many Objects in MXNet-Scala are managed in native memory, needing to use dispose on them as well.
bloated code with dispose() methods.
hard to debug memory-leaks.

Goals/Usecases

Provide MXNet JVM Users automated memory management which can release native memory when there are no references to JVM objects.
Be able to manage both GPU and CPU Memory automatically.
Performance does not degrade with automated memory management.

User Experience

With this change, MXNet-Scala users will be able to use MXNet Objects in 3 different ways:

1) Use MXNet Objects like regular Java Objects and let the ResourceHandler deal with deAllocating Off-Heap memory. The user needs to selecting the right DeAllocation strategy. ie., periodically run System.gc()/run System.gc() when off-heap memory reaches a certain threshold or Let JVM decide when to run Garbage Collection if it feels pressure on JVM heap.

This approach may not be sufficient for cases where memory intensive objects such as NDArrays are greedily allocated and the time interval for Garbage Collection has not elapsed or when the call to (system.gc) is not honored.

2) Use ResourceScope. ResourceScope collects all MXNet Object Off-heap pointer references within a block scope and releases them at the end of the scope.

This follows the try-with-resources paradigm in Java7+ and is similar to JavaCPP's PointerScope and NDArrayCollector that YiZhi Liu has implemented, but is enhanced to handle MXNet Objects in a generic way.

Since this uses ThreadLocal to manage the scopes, it is not a thread-safe in an application such as Producer-Consumer where the Producer might leave before the consumer has consumed the object and is unavailable.

Scala

Scala does not support try-with-resources, so an alternate method .using method is made available on ResourceScope. At the end of the using method, the NativeResources within the block Scope will be disposed. This approach is suggested by Martin Odersky( Creator of Scala in Slide 21 from FOSDEM 2009 )

    ResourceScope.using(new ResourceScope()) {
      var r = NDArray.ones(Shape(2,2))
      var r1 = NDArray.ones(Shape(50,50))
  }

Java

Not tested on Java

try (ResourceScope rs = new ResourceScope()) {
	NDArray r: NDArray.ones(Shape(2,2))
	NDArray r1: NDArray.ones(Shape(50,50))
}

3) Call dispose() explicitly at the end of using each MXNet object. This is the current experience in MXNet aside from NDArrayCollector.

val nd: NDArray = NDArray.ones(Shape(2,2))
nd.dispose()

Open Questions

How to calculate bytesAllocated for MXNet Objects such as Symbol, Executor, DataIterator, etc.,
bytesAllocated for NDArrays are calculated as a product of dtype and Shape product * Sizeof(Float32..)
Running multiple examples for a long period of time, I did not experience any issues when Objects were freed on a separate thread.
Will there be a situation where a native pointer is still in use but the Scala Object is not reachable.

Proposed Approach

For Automatic Resource Management, a NativeResource Interface is used that maintains references to nativeAddress and NativeDeAllocator Address of the MXNet Object.

An Object of type NativeResourceRef tracks NativeResource using PhantomReference.

MXNet Objects extend NativeResource and updates nativeAddress, bytesAllocated & nativeDeAlloctor during the object Creation and calls NativeResource.register to register a phantomRef.

When the Garbage Collector runs and finds that NativeResource Object is not reachable, it adds the tracking PhantomReference to the Reference Queue specified. A Separate Cleanup thread waits on a Reference Queue's remove Blocking method and releases the Native memory upon notification.

The problem with this approach is native Objects are released only after Garbage Collection has determined that the Object is not reachable, For the garbage collection to run, the GC Subsystem has to feel pressure on the JVM Heap, however in MXNet Scala most MXNet objects are allocated in Native Memory and GC does not run as frequently as we would like, there are 2 approaches we can take here to alleviate this problem.

1) call System.gc periodically

2) call System.gc after a threshold of Off-Heap bytes is used.

3) Let GC collector run on its own schedule based on the JVM Implementation the code is run on.

Note: Though there is no guarantee that invoking System.gc() will force Garbage Collection to run, my experiments on OpenJDK 8 show that they are effective and helpful to release Native Memory.

It also should be noted that calling System.gc() is expensive and not recommended to call it repeatedly.

Android manages Native C++ Memory using the PhantomRef approach, this Video from Google IO/17 details on how to use PhantomReference and issues with Finalizers.

ResourceScope

We can create a ResourceScope class that implements the AutoCloseable Interface. This class provides a static method using similar to try-with-resources and takes another parameter block of code to run, any NativeResource created within the block scope registers with the instance of ResourceScope. Upon exit of the block scope, the using method runs close on ResourceScope which releases all the Native Resources.

Scala users can use the using static method that can execute a block scope and at the end of the scope releases the stack of Objects.

One problem I can see with this approach is misuse of ResourceScope, instead of calling the ResourceScope at a more granular scope the user could end up in wrapping up a higher level function and end up holding onto the created memory for a much longer time or until OutOfMemory exception is received. As an example consider when the entire training method which runs for 100s of epochs is passed as a block to be executed within the ResourceScope instead of using the ResourceScope for each epoch or creating separate Resource scopes for data-preprocessing/ training and post-processing.

This can be alleviated by tracking WeakReferences to NativeResources and leveraging the GC to find EOLed Objects + PhantomRef approach discussed above. (Needs to be tested)

Another problem is that since this approach uses ThreadLocal to manage scopes, it is not suitable for producer-consumer kind of applications where the producer might not wait until the consumer has used the NativeResource object.

Test Results of Prototype

Long running MNIST example with current code

MXNet > JVM Memory Management > 1.png

Long running MNIST example, System.gc() called every epoch

MXNet > JVM Memory Management > 2.png

Long running MNIST example, System.gc() called every 5 seconds

MXNet > JVM Memory Management > 3.png

Long running MNIST example, System.gc() not explicitly called.

MXNet > JVM Memory Management > 4.png

Long running MNIST example on GPU

MXNet > JVM Memory Management > 5.png

Running GAN Example on GPU, calling System.gc() every second.

MXNet > JVM Memory Management > 7.png

Addition of New APIs

None

Backward compatibility

Yes, it will continue to work with WarnIfNotDisposed Interface.

Performance Considerations

Run MNIST Training, measure average time per epoch in the current code which uses Dispose
Measure time taken when ResourceScope is used for each epoch
Measure time taken when System.gc() is called periodically
Measure time taken When System.gc() is called on maxOffHeapBytes is reached.
Run Inference using ResourceScope and test Performance.

Test Plan

Run from Master branch(without any changes) and see how it performs - use it as a baseline
Run Tests calling System.gc periodically
Run tests on JDK7 & JDK8 environment
Run tests on OSX
Run tests using different memory consuming examples(training using large Images, GAN, RNN)
Run inside a container with limited memory
Run tests for ResourceScope and check for Memory Stability.

Alternative Approaches

Earlier versions of MXNet-Scala made use of WeakReferences and Finalizers to release NativeResources, however this caused segfaults due to MXNet backend requiring all calls going through the same thread and Finalizer running on its own thread.

Finalizer with Dispose using Dispatcher Pattern.

Ran Finalizer and calling dispose using a Dispatcher pattern, As I researched and learnt more about Finalizers it was clear that this approach is not recommended,

An Item and article on the perils of using Finalizers by Joshua Bloch(Author of Effective Java) is here.

A few key points related to finalizers:

Deprecated in JDK 9.
Finalizers are not guaranteed to run immediately and is run on a separate thread later.
Finalizers run in arbitrary order – when this happens on two objects that become unreachable together (2nd object depending on the first object), it can corrupt both C++ heap and Java Heap(JVM Runtime).
Finalizers sometimes extend the life time of the objects.
The underlying native resource might be disposed at arbitrary time while it is used in other objects.

AutoCloseable and using try-with-resources{}

we can provide an interface similar to a Java File for each MXNet Object implementing the AutoCloseable interface and users can use try-with-resources approach that they are familiar to treat them like IO Resources, however this would become tedious when users would have to declare them before hand.

Object Pooling

We can implement a object pool for Native objects such as NDArrays and when objects go out of scope they can return to the pool. This could be an extension to the proposed approach and probably useful for use-cases such as Inference where the size of NDArrays do not change between runs.

Milestones

1. Implement NativeResource, extend NDArray, Symbol, Executor to be compatible with NativeResource.

2. Implement ResourceScope

3. Implement GCStrategy

4. Performance Tests using different GCStrategies.

5. Add Stress Test (long running tests).

References

Item 7: Avoid finalizers: http://www.informit.com/articles/article.aspx?p=1216151&seqNum=7
How to Manage Native C++ Memory in Android (Google I/O '17): https://www.youtube.com/watch?v=7_caITSjk1k
try-with-resources in Java7+: https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html
Garbage Collection(Chapter 6,7): https://www.amazon.com/Optimizing-Java-techniques-application-performance/dp/1492025798
PhantomReference: https://docs.oracle.com/javase/7/docs/api/java/lang/ref/PhantomReference.html
JavaCPP's PointerScope: https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/PointerScope.java
Yizhi's NDArrayCollector: https://github.com/apache/incubator-mxnet/blob/master/scala-package/core/src/main/scala/org/apache/mxnet/NDArrayCollector.scala

Feedback received so far:

I discussed with a few colleagues whom I work with(Frank Liu, Qing, Andrew, Yizhi, Calum..) and they provided the below feedback.

Aaron MarkhamThe Proposed approach seems to suggest there are issues with sub approaches, clearly call out when to use which approach
Andrew Ayres What happens when dispose is called within ResourceScope - This should work and not cause any issues.
Andrew Ayres How do we use in Java - TBD
Frank Liu We will get a compiler warning in Java, since the ResourceScope within the try block will not be used – This needs to be researched and resolved.
Frank Liu Using ResourceScope will be an issue with regards to Readability if users use deep nested code such as a Class that creates NDArray objects(they will be unaware that the objects are tracked in ResourceScope and deAllocated when they go out of scope, they might assume it can be safely used in another thread – I can't Right now of a way right now, I will continue to explore
YiZhi Liu we should keep NDCollector for a while and then deprecate after a few releases. – agree
YiZhi Liu We should suggest to use ResourceScope for training and probably use the PhantomRef approach for Inference
Sina Afrooze Using ResourceScope might more natural for Deep Learning applications than periodically calling System.gc
Rakesh: Use a Real life Model instead of MNIST for testing.
Calum Leslie: It might be intrusive to call System.gc() on the user's behalf – mitigated through user controllable properties.

Glossary:

StrongReference: An object that has an active reference such as val nd = NDArray.ones(Shape(2,2)), here nd stores a strong reference to the NDArray created.

WeakReference: A Weak reference is a reference to an object that does not prevent the Garbage collector from Collection.

PhantomReference: Phantom reference objects, which are enqueued after the collector determines that their referents may otherwise be reclaimed.