MDC logging for CloudStack

Introduction

Debugging in CloudStack using logs is cumbersome. To ease the tracking a command through logs (and upto the resource layer), the concept of contextID and sequence numbers was introduced.

The initial idea of using contextIDs was to provide create a unique ID for every thread in management server, and use the NDC (aka stack) model for pushing new contexts on the stack. It was assumed that every touch point - API, new threads, async job - will correctly push and pop the contexts and ease tracking of an API call.

After discussion, it was realized this has not fully materialized. Instead of modifying existing NDC code-base, it was suggested that we use MDC (aka hash) model and pass the same logContextID, up to the resource layer.

Using MDC has additional advantage that, being a hash, one can put additional information that would ease searching in logs.

References

http://wiki.apache.org/logging-log4j/NDCvsMDC
Using CallContext (CloudStack)
MDC model from https://github.com/cattleio/cattle

Architecture and design description

Why MDC

Being a key, value pair object, easier to manipulate
More extensible than NDC stack to add more info in future as needed for better analysis(resource id, management server id, agent id and host id from agents etc.)

Constraints Imposed

MDC is designed to be lightweight, hence the context hashmap will only contain String (key,value) pairs
Note that MDC is not a mechanism to pass method parameters between methods / threads. The implementation should not be used to pass parameters either, and no guarantees will be made about consistency if data is added to the MDC
MDC design assumes inserts / deletes to MDC are not high frequency, and we need to respect this design

To incorporate MDC model

Create new LogContext and LogContextListener to manage semantics around MDC
Modify CloudStack log4j files to provide MDC info in each log message
Generate new MDC logContext on every API call
The contextID should be returned in the API call (so that a user can grep through the logs using this ID)

To ensure propagation

Every new thread incorporates MDC logContextID from its calling parent. If none is available, a new logContext will be generated
All system threads will use a systemLogContext to be created on start-up
For async jobs, save the context information when job info is serialized and retrieve when the job returns. Pass the logContextID to async job

API changes

All APIs, sync or async, to return logid when invoked

UI changes

None

DB changes

async_job table to have an additional column to store MDC hash (as of now only logid), which will be used to propagate information when job is dispatched to resource layer. Additionally, will be used to populate MDC back when the dispatched job returns, so that it may be tracked again on the management server side

Future enhancements

TBD

Space shortcuts

Child pages