Debugging in CloudStack using logs is cumbersome. To ease the tracking a command through logs (and upto the resource layer), the concept of contextID and sequence numbers was introduced.
The initial idea of using contextIDs was to provide create a unique ID for every thread in management server, and use the NDC (aka stack) model for pushing new contexts on the stack. It was assumed that every touch point - API, new threads, async job - will correctly push and pop the contexts and ease tracking of an API call.
After discussion, it was realized this has not fully materialized. Instead of modifying existing NDC code-base, it was suggested that we use MDC (aka hash) model and pass the same logContextID, up to the resource layer.
Using MDC has additional advantage that, being a hash, one can put additional information that would ease searching in logs.
Architecture and design description
- Being a key, value pair object, easier to manipulate
- More extensible than NDC stack to add more info in future as needed for better analysis(resource id, management server id, agent id and host id from agents etc.)
- MDC is designed to be lightweight, hence the context hashmap will only contain String (key,value) pairs
- Note that MDC is not a mechanism to pass method parameters between methods / threads. The implementation should not be used to pass parameters either, and no guarantees will be made about consistency if data is added to the MDC
- MDC design assumes inserts / deletes to MDC are not high frequency, and we need to respect this design
To incorporate MDC model
- Create new LogContext and LogContextListener to manage semantics around MDC
- Modify CloudStack log4j files to provide MDC info in each log message
- Generate new MDC logContext on every API call
- The contextID should be returned in the API call (so that a user can grep through the logs using this ID)
To ensure propagation
- Every new thread incorporates MDC logContextID from its calling parent. If none is available, a new logContext will be generated
- All system threads will use a systemLogContext to be created on start-up
- For async jobs, save the context information when job info is serialized and retrieve when the job returns. Pass the logContextID to async job
All APIs, sync or async, to return logid when invoked
async_job table to have an additional column to store MDC hash (as of now only logid), which will be used to propagate information when job is dispatched to resource layer. Additionally, will be used to populate MDC back when the dispatched job returns, so that it may be tracked again on the management server side