Tracking memory

Backend memory is tracked via a hierarchy of MemTrackers. This allows enforcement of memory limits for queries, resource pools and processes. It also allows reporting on memory consumption.

Not all memory is tracked. Particularly, small control data structures (e.g. ExecNodes) are generally not tracked because their contribution to query memory consumption is minimal and the overhead in terms of code and runtime bookkeeping (with the current tools for memory management) exceeds the benefit in tracking the additional small amount of memory.

Memory should be tracked if it has the potential to add up to a significant amount per query (i.e. at least 10s of kbs).

In some cases we currently do not track memory that should be tracked. E.g. we don't track LLVM data structure memory, in part because it is allocated directly from malloc() in LLVM code instead of from within Impala's code.

Data structures and resources

We should aim to logically separate release of resources with teardown of data structures in the code.

Release of resources (e.g. memory, CPU threads) should happen explicitly in a method like Close() or ReleaseResources() so that it's easy to understand where the resources will be released. I.e. we should avoid releasing resources implicitly in destructors, and particularly avoid releasing resources in destructors of shared_ptr-managed objects. It is a good practice in most situations to enforce that resources were released by adding a DCHECK in the destructor.

Control data structures (e.g. MemTrackers, FragmentInstanceStates, RuntimeStates) do not need to be torn down at the same time as resources are released. Generally we want to tear down all control structures with the same scope at the same time. E.g. all control data structures that have query scope should be torn down along with the QueryState. This reduces the number of edge cases in our code that come from dealing with partially torn down control structures. This can be implemented using scoped_ptr/unique_ptr or by putting the objects in an ObjectPool.

Smart pointers and ObjectPools

Currently the Impala code uses a mix of utility classes to automatically free data structures.

  • boost::scoped_ptr: used for objects that are exclusively owned by one object and can never be moved. Going forward, std::unique_ptr is preferred to be more consistent with standard C++.
  • std::unique_ptr: used for objects that are exclusively owned by one object. Supports additional functionality over boost::scoped_ptr: allows moving the pointer and allows use in containers like std::vector.
  • std::shared_ptr: used if ownership is shared. Avoid using when possible - prefer exclusive ownership and explicit lifetimes.
  • ObjectPool: a pool of objects that are exclusively owned by one object. Used as an alternative to unique_ptr/scoped_ptr. Sometimes useful if a variable number of objects are allocated with the same scope. Be careful that the scope of the object pool matches the scope of the object you allocate. E.g. adding objects to the RuntimeState's object pool can lead to the query leaking memory during execution.
  • No labels

5 Comments

  1. Tim Armstrong - Should we recommend using const std::unique_ptr instead of boost::scoped_ptr to prevent transfer of ownership?

    1. Is that equivalent? Wouldn't that prevent modifying the unique_ptr value in place?

      1. You are right, that will make the pointer immutable altogether, i.e. prevent calling reset().

  2. I feels that in the codebase in many cases the usage of std::unique_ptr/boost::scoped_ptr is only for a "nullable" object. Should we recommend boost::optional in these cases? It saves a malloc.

    1. That's definitely worth considering on a case-by-case basis. One major advantage of using pointers for member variables in a large project is that we only need to forward-declare the class to declare a pointer. If we need to have the full class declaration visible it can result in huge numbers of headers being transitively pulled in, hurting compile times a lot.