Logs are one of the most important parts of any running application. The operations team uses them to make sure the application runs smoothly. Developers use the log for troubleshooting.
Below are rules aimed to improve logging in Ignite 3 through log statement unifying and providing a guidelines on how to use the logging subsystem.
Instantiation of a new Logger is depended upon particular logging backend and potentially may be a fairly expensive operation. This is why a static logger should be preferred over a non-static one for server side usage. On the other hand, client-side logging should not follow this rule as particular backend may be provided on per-client basis, hence all loggers should be instantiated with provided LoggerFactory
.
As more deployments move to public clouds, there is growing demand for a log management and analytics solution to handle any server logs. This is why it is important not only to make a log statement human readable, but also a machine readable. In addition, a unified log format eases the log analyzing and for the engineer too. Here is a log format to use within the Ignite-3:
<Short description of the event> [<key value pairs describing context of the event>]
, for example:
LOG.info("Start new raft node [group={}, peers={}]", groupId, peers) LOG.debug("Replicator added [node={}, term={}, peer={}]", getNodeId(), this.currTerm, peer);
Besides the format, it's important to keep messages short and simple yet sufficient to understand what is happened:
log.info("New leader was elected for the raft group and pending reconfiguration was discovered [group={}, partition={}, table={}, peers={}]", ...); // this one too cumbersome log.info("New leader elected. Going to reconfigure peers [group={}, partition={}, table={}, peers={}]", ...); // short and simple yet provides the same information as the first one
There is support for string substitution within log message with backend-independent anchors '{}'. Message templating should be preferred to string concatenation, because such approach looks a way cleaner while reading the code. Please take a look at example below:
log.debug("Move pages from heap to PageMemory [list=" + name() + ", bucket=" + bucket + ", pages=" + pages + ']'); VS log.debug("Move pages from heap to PageMemory [list={}, bucket={}, pages={}]", name(), bucket, pages);
Please take a look at the example below:
Restore partitions state
This line provides neither information about table nor result of restoration.
It could be improved by providing a context as well as result of the operation:
Partitions state restored [tableName=<tName>, state=<restored state or null if there is no state to restore>]
While the right choice depends on many factors, here are some advices on what each level serves for:
Operation has failed because of reasons not related to the user. Normal cluster functioning is in question (there could be auto recovery though). Attention of the maintenance team is required immediately.
Rule of thumb: ERROR
is when you need a duty engineer to wake up in the middle of the night. Remember that sometimes it’s you. ERROR is always actionable - the user must understand how to react.
Examples:
The cluster is working fine, but we are close to some kind of error. The attention of the maintenance team is required, but not immediately.
Rule of thumb: WARN
is what the team will need to review at the start of the next business day, at a high priority. WARN is always actionable - the user must understand how to react.
Examples:
A level for regular non frequent event. Should be used to log out any change of cluster state. Should be avoided for frequent operations like get or put.
Rule of thumb: INFO
is good for anything that’s infrequent and will give info to support or development teams when troubleshooting. If an event happens no more frequently than every few minutes (such as page memory checkpoint), it probably should be logged at INFO
.
Examples:
Additional information that could help to debug a problem. Could be used to log additional context to the INFO
level or events that won't be needed in a normal situation.
Examples:
INFO
level Rule of thumb: TRACE
is similar to DEBUG
but used for stuff that can be really frequent (like a table insert). Usually you don't want this level until you are 100% sure you need this.