Currently only some exceptions that occur during processing are wrapped as a StreamsException before being handed up to the uncaught exception handler. Unfortunately we don't make any guarantees about which exceptions will or will not be wrapped as a StreamException, which complicates the logic required by a user's custom exception handler and makes it difficult to enforce any kind of compatibility. I have also found the meaning of "StreamsException" to be rather ambiguous – is it just a basic exception type of Kafka Streams, or does it mean the exception came from Kafka Streams vs user code, or does it indicate unrecoverable error? This has been a repeated point of confusion across both users and devs, making it unclear how/when to wrap exceptions for devs, and when to unwrap/how to handle exceptions by users.
It would be cleaner to ensure that all exceptions thrown to the user/handler are wrapped (exactly once) as a StreamsException, and standardize on its definition/use.
Further, many exceptions can be traced back to a particular task that's experiencing an error specific to that part of the topology, or that partition, etc. It can be helpful to have that information when determining how to handle the exception, as well as for debugging purposes. We propose to add a TaskId field to the StreamsException class to help users identify the source of an exception.
- Guarantee that every exception that is thrown up to the uncaught exception handler, whether that be the new StreamsUncaughtExceptionHandler or the old generic UncaughtExceptionHandler, is wrapped as a StreamsException.
- Standardize definition of StreamsException: a top-level exception that indicates an occur has occurred during Streams internal processing, and wraps that error alongside any available info/context. Note that this does not mean the error came from Streams itself.
- Add a new TaskId field to the StreamsException class, with a getter API to expose it and corresponding constructors. This field will be set for any exception that originates from, or is tied to, a specific task. For example:
- Task timeout (ie exceeds the configured task.timeout.ms value
- User processing error or other exception thrown from Task#process
- Exceptions arising from task management, such as suspending/closing/flushing/etc