Connect will attempt to retry the failed operation for a configurable total duration, starting with a fixed duration (value of 300ms) and with exponential backoff between each retry.

Configuration Name	Description	Default Value	Domain
errors.retry.timeout	The total duration a failed operation will be retried for.	0	[-1, 0, 1, ... Long.MAX_VALUE], where -1 means infinite duration.
errors.retry.delay.max.ms	The maximum delay between two consecutive retries (in milliseconds). Jitter will be added to the delay once this limit is reached to prevent any thundering herd issues.	60000	[1, ... Long.MAX_VALUE]

Task Tolerance Limits

Tolerate up to a configurable number of errors in a task. A failed operation is declared to be an error only if Connect has exhausted all retry options. If the task fails to successfully perform an operation on a record within tolerance limit, the record is skipped. Once the tolerance limit is reached, the task will fail. Tolerance limits can be configured using the following new properties:

Config Option	Description	Default Value	Domain
errors.tolerance	Fail the task if we exceed specified number of errors in the observed duration.	none	[none, all].

Log Error Context

The error context and processing information can be logged along with the standard application logs using the following configuration properties:

Config Option	Description	Default Value	Domain
errors.log.enable	Log the error context along with the other application logs. This context includes details about the failed operation, and the record which caused the failure.	false	Boolean
errors.log.include.messages	Whether to include the Connect Record in every log. This is useful if users do not want records to be written to log files because they contain sensitive information, or are simply very large. If this property is disabled, Connect will still log some minimal information about the record (for example, the source partition and offset if it is a SourceRecord, and Kafka topic and offset if it is a SinkRecord).	false	Boolean

Dead Letter Queue (for Sink Connectors only)

For sink connectors, we will write the original record (from the Kafka topic the sink connector is consuming from) that failed in the converter or transformation step into a configurable Kafka topic.

Config Option	Description	Default Value	Domain
errors.deadletterqueue.topic.name	The name of the dead letter queue topic. If not set, this feature will be disabled.	""	A valid Kafka topic name
errors.deadletterqueue.topic.replication.factor	Replication factor used to create the dead letter queue topic when it doesn't already exist.	3	[1 ... Short.MAX_VALUE]
errors.deadletterqueue.context.headers.enable	If true, multiple headers will be added to annotate the record with the error context	false	Boolean

If the property errors.deadletterqueue.context.headers.enable is set to true, the following headers will be added to the produced raw message (only if they don't already exist in the message). All values will be serialized as UTF-8 strings.

Header Name	Description
__connect.errors.topic	Name of the topic that contained the message.
__connect.errors.partition	The numeric ID of the partition in the original topic that contained the message (encoded as a UTF-8 string).
__connect.errors.offset	The numeric value of the message offset in the original topic (encoded as a UTF-8 string).
__connect.errors.connector.name	The name of the connector which encountered the error.
__connect.errors.task.id	The numeric ID of the task that encountered the error (encoded as a UTF-8 string).
__connect.errors.stage	The name of the stage where the error occurred.
__connect.errors.class.name	The fully qualified name of the class that caused the error.
__connect.errors.exception.class.name	The fully qualified classname of the exception that was thrown during the execution.
__connect.errors.exception.message	The message in the exception.
__connect.errors.exception.stacktrace	The stacktrace of the exception.

Metrics

The following new metrics will monitor the number of failures, and the behavior of the response handler. Specifically, the following set of counters:

...

MBean name: kafka.connect:type=task-error-metrics,connector=([-.\w]+),task=([-.\w]+)

Metric/Attribute Name	Description	Implemented
total-record-failures	Total number of failures seen by this task.	2.0.0
total-record-errors	Total number of errors seen by this task.	2.0.0
total-records-skipped	Total number of records skipped by this task.	2.0.0
total-retries	Total number of retries made by this task.	2.0.0
total-errors-logged	The number of messages that was logged into either the dead letter queue or with Log4j.	2.0.0
deadletterqueue-produce-requests	Number of produce requests to the dead letter queue.	2.0.0
deadletterqueue-produce-failures	Number of records which failed to produce correctly to the dead letter queue.	2.0.0
last-error-timestamp	The timestamp when the last error occurred in this task.	2.0.0

Proposed Changes

A connector consists of multiple stages. For source connectors, Connect retrieves the records from the connector, applies zero or more transformations, uses the converters to serialize each record’s key, value, and headers, and finally writes each record to Kafka. For sink connectors, Connect reads the topic(s), uses the converters to deserialize each record’s key, value, and headers, and for each record applies zero or more transformations and delivers the records to the sink connector. In this proposal, we will specifically deal with the following failure scenarios which can occur during these stages:

Operation	Will Retry?	Tolerated Exceptions
Transformation	only on org.apache.kafka.connect.errors.RetriableException	java.lang.Exception
Key, Value and Header Converter	only on org.apache.kafka.connect.errors.RetriableException	java.lang.Exception
Kafka Produce and Consume	only on org.apache.kafka.common.errors.RetriableException	only on org.apache.kafka.connect.errors.RetriableException, fail task otherwise.
put() in SinkTask and poll() in SourceTask	only on org.apache.kafka.connect.errors.RetriableException	only on org.apache.kafka.connect.errors.RetriableException, fail task otherwise.

There are two behavioral changes introduced by this KIP. First, a failure in any stage will be reattempted, thereby “blocking” the connector. This helps in situations where time is needed to manually update an external system, such as manually correcting a schema in the Schema Registry. More complex causes, such as requiring code changes or corruptions in the data that can’t be fixed externally, will require the worker to be stopped, data to be fixed and then the Connector to be restarted. In the case of data corruption, the topic might need to be cleaned up too. If the retry limit for a failure is reached, then the tolerance limit is used to determine if this record should be skipped, or if the task is to be killed. The second behavioral change is introduced in how we report these failures. Currently, only the exception which kills the task is written with the application logs. With the additions presented in this KIP, we are logging details about the failed operation along with the bad record. We are also introducing an option to write bad records into a dead letter queue for Sink Connectors. This would write the original key, value and headers of failed records into a configured Kafka topic.

...

Write records that fail in the put() step of a sink connector to the dead letter queue: this is beyond the scope of this since sink connectors can chose to batch records in a put() method, it is not clear what errors are caused by what records (they might be because of records that were immediately written to put(), or by some previous records that were processed later). Also, there might be connection issues that are not handled by the connector, and simply bubbled up as IOException (for example). Effectively, errors sent back to the framework from the put() method currently do not have sufficient context to determine the problematic records (if any). Addressing these issues would need a separate KIP.

Space shortcuts

Child pages

Versions Compared

Old Version 67

New Version Current

Key

Task Tolerance Limits

Log Error Context

Dead Letter Queue (for Sink Connectors only)

Metrics

Proposed Changes

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 67

New Version Current

Key

Task Tolerance Limits

Log Error Context

Dead Letter Queue (for Sink Connectors only)

Metrics

Proposed Changes