Child pages
  • Kafka Error Handling and Logging
Skip to end of metadata
Go to start of metadata

Background

Today we observe two common scenarios in our logging:

1. In some places, we creates too much not-so-useful logging content which pollute the logging files. For example, below is a snippet of the server log under normal operations (setting log4j level to INFO):

2014/08/24 00:00:12.145 INFO [Processor] [kafka-network-thread-10251-1] [kafka-server] [] Closing socket connection to /172.17.198.21.
2014/08/24 00:00:12.152 INFO [Processor] [kafka-network-thread-10251-6] [kafka-server] [] Closing socket connection to /172.24.25.165.
2014/08/24 00:00:12.191 INFO [Processor] [kafka-network-thread-10251-4] [kafka-server] [] Closing socket connection to /172.17.198.24.
2014/08/24 00:00:12.209 INFO [Processor] [kafka-network-thread-10251-4] [kafka-server] [] Closing socket connection to /172.17.232.135.
2014/08/24 00:00:12.218 INFO [Processor] [kafka-network-thread-10251-6] [kafka-server] [] Closing socket connection to /172.17.198.41.
2014/08/24 00:00:12.235 INFO [Processor] [kafka-network-thread-10251-5] [kafka-server] [] Closing socket connection to /172.17.198.30.
2014/08/24 00:00:12.326 INFO [Processor] [kafka-network-thread-10251-1] [kafka-server] [] Closing socket connection to /172.17.198.27.
2014/08/24 00:00:12.362 INFO [Processor] [kafka-network-thread-10251-2] [kafka-server] [] Closing socket connection to /172.17.198.23.
2014/08/24 00:00:12.455 INFO [Processor] [kafka-network-thread-10251-5] [kafka-server] [] Closing socket connection to /172.17.198.94.
2014/08/24 00:00:12.488 INFO [Processor] [kafka-network-thread-10251-3] [kafka-server] [] Closing socket connection to /172.17.198.20.

 

This is from the socket server, which logs closing a socket each time as INFO. However, since most clients use a temporary socket to refresh their metadata, this result in tons of closing socket INFO entries which are useless most of the time.

 

Another example is that when there is a network connection issue causing clients / follower-replicas not able to connect to servers / leader replicas, an ERROR / WARN log entry will be recorded. But since most of the clients will retry frequently upon failures, and usually log4j will be configured to at least include ERROR / WARN entries, this can cause the clients logs to be swamped with ERROR / WARN along with stack traces:

ERROR kafka.producer.SyncProducer  - Producer connection to localhost:1025 unsuccessful
java.net.ConnectException: Connection refused
    at sun.nio.ch.Net.connect0(Native Method)
    at sun.nio.ch.Net.connect(Net.java:465)
    at sun.nio.ch.Net.connect(Net.java:457)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:639)
    at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
    at kafka.producer.SyncProducer.connect(SyncProducer.scala:146)
    at kafka.producer.SyncProducer.getOrMakeConnection(SyncProducer.scala:161)
    at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:68)
    at kafka.producer.SyncProducer.send(SyncProducer.scala:112)
    at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:53)
    at kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:82)
    at kafka.producer.async.DefaultEventHandler$$anonfun$handle$1.apply$mcV$sp(DefaultEventHandler.scala:69)
    at kafka.utils.Utils$.swallow(Utils.scala:186)
    at kafka.utils.Logging$class.swallowError(Logging.scala:105)
    at kafka.utils.Utils$.swallowError(Utils.scala:45)
    at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:69)
    at kafka.producer.Producer.send(Producer.scala:74)
    at kafka.javaapi.producer.Producer.send(Producer.scala:32)
    ...
WARN kafka.client.ClientUtils$  - Fetching topic metadata with correlation id 0 for topics [Set(test-topic)] from broker [id:0,host:localhost,port:1025] failed
java.net.ConnectException: Connection refused
    at sun.nio.ch.Net.connect0(Native Method)
    at sun.nio.ch.Net.connect(Net.java:465)
    at sun.nio.ch.Net.connect(Net.java:457)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:639)
    at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
    at kafka.producer.SyncProducer.connect(SyncProducer.scala:146)
    at kafka.producer.SyncProducer.getOrMakeConnection(SyncProducer.scala:161)
    at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:68)
    at kafka.producer.SyncProducer.send(SyncProducer.scala:112)
    at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:53)
    at kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:82)
    at kafka.producer.async.DefaultEventHandler$$anonfun$handle$1.apply$mcV$sp(DefaultEventHandler.scala:69)
    at kafka.utils.Utils$.swallow(Utils.scala:186)
    at kafka.utils.Logging$class.swallowError(Logging.scala:105)
    at kafka.utils.Utils$.swallowError(Utils.scala:45)
    at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:69)
    at kafka.producer.Producer.send(Producer.scala:74)
    at kafka.javaapi.producer.Producer.send(Producer.scala:32)
    ...

 

2. On the other hand, some important logging information is missing, which makes debugging / trouble shooting much more difficult than it should be, for example:

[2014-08-11 15:09:56,078] DEBUG [KafkaApi-1] Error while fetching metadata for [item_topic_0,0]. Possible cause: null (kafka.server.KafkaApis)

 

Some related JIRAs have been filed for these issues, for example, KAFKA-1066 and KAFKA-1122.

 

In order to resolve this issue, it is necessary to discuss some logging and exception handling conventions along with the existing code styles described here. This wiki page is created to summarize some implicit conventions we have been following, and to inspire discussions about these conventions. The hope is that moving forward, we will have a better sense about exception handling and logging when writing / reviewing code.

Error Handling

1. When calling some API functions of some Java / Scala libraries or other Kafka modules, we need to make sure each  of the possible throwing checked exceptions are either 1) caught and handled, or 2) bypassed and hence higher level callers needs to handle them.

Also it is recommended to add the throwable exceptions list for all public API functions with @exception.

2. When logging is needed in exception handling, we need to make a careful call about 1) which logging level should we use (a good guidance would be this, for example); and 2) whether to print exception name, exception message or exception trace.

Usually, if it is clear where the exception is thrown (for example, it is a Kafka exception and hence we are clear which inner functions may throw it), it is recommended to just print the exception name and message (i.e. exception.toString);

If the error is logged with WARN or below, usually we can just print the exception name; if the error is logged with ERROR or FATAL, we need to include the stack trace.

3. In addition, we usually include description of the current state of the system, possible causes of the thrown exceptions, etc in the logging entry.

Logging

When to Log

We usually need to add some logging entries that are expected to record under normal operations:

1. At the beginning / end of module startup(), and shutdown().

2. At the beginning / end of some important phases of background / special case logic such as LogManager.cleanupLogs(), Controller.onBrokerFailure(), etc

 

What to Log

In an log entry, we usually needs to include the following information:

  1. Related local / global variable's values within the function / code block.

    For example, the fetch request string for logging "request handling failures", the current replicas LEO values when advancing the partition HW accordingly, etc.

  2. For exception logging (WARN / ERROR), include the possible cause of the exception, and the handling logic that is going to execute (closing the module, killing the thread, etc).

And here are a few general rules we take for logging format:

  1. Each logging entry is usually one-line for the ease of text parsing tools upon trouble shooting, unless it will cause a very long line that may span over multiple screens.

    For example, the topic-change listener firing log, that may contain hundreds of partitions information.

  2. The logging indent need to be concise and self-distinctive (this is usually also related to client id naming).

  3. Use canonical toString functions of the struct whenever possible.

    For example, instead of using "topic [%s, %d]".format(topic, partitionId), use "topic %s".format(TopicAndPartition(topic, partitionId)).

Logging Cleaning Actions

Based on some discussions around these points, we have proposed several actions so far to clean our logging for now, this following list may expand as we discuss along:


1. Compress request logs.

The request handling logs has been a pain for a long time for us. Since it records one entry for each of the request received and handled, and simply due to the large number of clients and some clients may not follow any backoff mechanism in sending metadata refresh requests, etc, its logging files can easily grow to hundreds of GBs in a day. The plan is to change the request logs to default to DEBUG level (summary) with TRACE level (requests detail) printed as binary format at the same time, along with compression / rolling of the binary logs. The DEBUG level information is sufficient under normal operations, and we only need to de-compress and transform the request details logs into text when trouble shooting is needed.

Details can be found in KAFKA-1590.

2. Clean-up unnecessary stack traces.

This one is kind of obvious, we need to go through the Kafka code base and check for all these prints of stack traces, if it is really necessary.

Details can be found in KAFKA-1591.

3. Clean-up unnecessary INFO level log4j entries.

Some of the INFO level logging entries should really be DEBUG or even TRACE, we need to reduce them so that the sever / client logs are clean and concise under normal operations.

Details can be found in KAFKA-1592.

  • No labels

22 Comments

  1. These all make sense and I agree they are important to-dos that should be done.  

    +1 fixing "Closing socket connection"

    +1 more debug logs... I have had to add debug logs on many occasion but never patched because the priority didn't seem to be there for someone to review and commit it.

    +1 no more "possible cause null"

    In addition I would also add:

    1) we need to-do more about how we "bubble up" exceptions in the default event handler to give more information to clients on errors e.g.

    Error rendering macro 'jira'

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.

    2) There are also the cases where INFO should be WARN (at least) that should change too (e.g. https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/utils/ZkUtils.scala#L291) (other places also just not jumping at me off the top of my head like this one).

    3) more debug statements, I have to go in to ad in debug statements to troubleshoot issues frequently enough.... not sure what the process here should be do I create a JIRA and upload a patch? I could have one of my developers (now that I have developers) do it so I can review and commit it... 

    A lot of these are "simple" changes that can get made but we don't have a great way to collaborate on this as a community. As was stated on the mailing list "we talked about this at LinkedIn" and I think the approach that was taken with the confluence page and discuss on the mailing list was a great way to proceed and look forward to this type of collaboration (and more) moving ahead.

    I don't understand though what we are doing with log cleanup... isn't that all just managed by log4j properties why do we have to-do anything?

    1. Thanks Joe.

       

      1) Yes, and I think one general reason is that previously when we make calls to some other kafka classes we did not carefully check which exceptions can be throwable and simply handle all in "catch (Exception e)" or even "catch (Throwable t)". On the other hand, we also want to make only necessary changes to the old clients (e.g. default event handler in producer) as the new clients are on the way. But if you feel it is really a pain and would like to fix it feel free to open an JIRA for that.

       

      2) Not sure which INFO entry are you referring to?

       

      "Log cleanup" here is a little misleading, what I meant is that there are some cases when the stack trace are unnecessarily printed in logging that pollutes the logs, and in some other cases we are making some entries to INFO but they should really be DEBUG / lower levels. Since usually people use INFO level logging in production, some of those unnecessary INFO logging is also swamping the logs.

       

      I would like to see the outcome of this discussion to be two folds, 1) moving forward when we are writing / reviewing code we would think more about logging and error handling; 2) a list of to-dos that people agree on and distributed the work as JIRAs in the community.

  2. +1 for fixing the issues mentioned in "background", they are indeed painful.

    Is there a JIRA for the issues mentioned in the first example? (closing socket and connection unsuccessful) I'll be more than happy to pick those up.

    +1 for documenting exceptions in public APIs. 

    Regarding logging:

    We have threads that wake up periodically to check something and only take action if there's a need (topic delete, log compaction, etc). We need to decide when and how much they log. If they log only when they take action (current status), users may worry that they are not working at all. Occasionally writing something like "Checking if action X is needed, decided it isn't based on value of Y, going back to sleep" can be  very helpful. 

     

     

    1. Thanks Gwen,

       

      The first example is covered in KAFKA-1592.

       

      Regarding the logging of the log compaction thread, I think to tackle the concern that this background thread is still alive and working, we'd better add a metric of the liveness and working progress of this thread instead of adding to the INFO logging. The reason is that this thread will do the checking periodically (default backoff is 15 secs), and if each time it wakes up some logs needs to be recorded that would be multiple lines per minute.

       

      In general, I would suggest to support checking system state (whether the daemon threads are still alive, whether there is a unique controller working, under replicated count, etc) as metrics, and only record indication of an action / event happening (server starting / shutting down, log compaction triggered, leader re-elected, controller migrated, new requests received, etc) in logs.

  3. Thanks! I see that KAFKA-1592 is in good hands, so hopefully this issue will be resolved soon.

    I agree that metrics are good solution for liveness. However, I prefer logging the "decision" and not just the "event". 

    Things like "log is only 20% dirty, below threshold of 50%". 

    I found it very useful for troubleshooting cleanup jobs that happen in the background. We do something similar in HDFS namenode (logging why we did or didn't truncate the edits file) and it was useful in production more than once to answer the question of "why did the cleanup not happen as we expected". 

     

     

    1. Hi Gwen, I think I agree with you. We just need to be aware that if this entry is made INFO then we would expect multiple entries per second for this background thread under normal operation when INFO is the default level.

      1. Agree.

        I think the general guideline should be something like:

        Reserve INFO messages to events that under normal conditions don't happen more often than every few minutes.

        1. Sorry I meant "per minute" since the default wake up period is 15 secs (tongue)

  4. For point 1: Closing socket connection - XXXXXXX

    Do we need to discard this from logging or changing from INFO level to DEBUG level will work? 

    1. Just change it from INFO to DEBUG should work since in practice people would just set log4j level to INFO.

      1. If we want to keep "Closing socket connection" then how about setting it to TRACE? I can't say it has much usefulness when debugging and will just get in the way there too.

          1. Yes, TRACE is a better option. Will make a change for this.

  5. Another point that Joel Koshy brought up is on Utils.swallow() function we are always printing the stack trace of the swallowed exception, and we have seen some transient exception traces while shutting down the server quite frequently. We could probably only print stack traces only for swallowError and be careful using swallowError function calls.

  6. Another repeating entry in unit test case server log is 'Awaiting socket connections on XXXXX:YYYY' from Kafka server code.  I think this entry should also be logged as TRACE. Please suggest over this??

    Apart from above following below two entries are very prominent - 

    1) Many WARN log entries 'Failed to send producer request with correlation id XX to broker YY with data for partitions' from class 'DefaultEventHandler'

                    java.net.SocketTimeoutException
                    at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
                    at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
                    at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)

    2) Many INFO entries followed by WARN entry 'Reconnect due to socket error:' from class 'SimpleConsumer'.

    Do we need handle them?

    1. 0): This line will only be called once in starting the kafka server. For unit tests we are creating one or more brokers for each test, but the log4j level is default to OFF so one should not see those entries unless he overrides to INFO.

      1): This will happen when the broker is not available or when the partition leaders have migrated, which should be infrequent in practice; however, since producer retries sending it could generates many such entries during the window. I would suggest keeping it as WARN but replace the stack trace by t.toString().

      2) +1. This may need to be fixed. Could you give an example of the INFO entries?

      1. Here are the INFO & WARN entries for point 3 - 

        INFO Reconnect due to socket error: java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer)
        WARN [ReplicaFetcherThread-0-0], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 5; ClientId: ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [new-topic,0] -> PartitionFetchInfo(2,1048576). Possible cause: java.nio.channels.ClosedChannelException (kafka.server.ReplicaFetcherThread)

        INFO log entry is from class SimpleConsumer line number 70.  Detailed log on expansion I got is -

        java.nio.channels.ClosedChannelException
        at kafka.network.BlockingChannel.send(BlockingChannel.scala:97)
        at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:70)
        at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:68)
        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:112)
        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:112)
        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:112)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:111)
        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
        at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:110)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:97)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:89)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)

        WARN log entry is from class AbstractFetcherThread line number is 101. 

        Please, do let me know what to do in this case?

         

        1. Hmm, both these two places are not printing the stack traces, so I am not sure how the stack trace are created in your example?

          1. Yes, I just changed it to get the full stack trace. My point is below mentioned log entry are very repetitive.  Do we need to handle this?

            INFO Reconnect due to socket error: java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer)
            WARN [ReplicaFetcherThread-0-0], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 5; ClientId: ReplicaFetcherThread-0-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [new-topic,0] -> PartitionFetchInfo(2,1048576). Possible cause: java.nio.channels.ClosedChannelException (kafka.server.ReplicaFetcherThread)

            1. I think the problem here is that the abstract fetcher thread uses a simple consumer to fetch data, and when the underlying simple consumer hit an exception, it would log it as INFO and retry the connection again before throwing it above to the abstract fetcher thread, which will then catch it as WARN.

               

              The problem is that, 1) for exceptions such as ClosedChannel, the retry would almost always fail again, causing the INFO / WARN pattern, and 2) for replica fetcher, it will not handle the exception but will retry almost immediately until it gets the LeaderISR request from controller, causing the pattern to repeat very frequently.

               

              So I think from the logging point of view, it is actually the right approach now. What we need to fix is "in ReplicaFetcherThread, implement the currently empty handlePartitionsWithErrors() function."

              1. please create a jira ticket related to it. I would like to work on it.

                1. Have created KAFKA-1629. PS, I have also commented on the stack trace JIRA.