DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: Under Discussion
Discussion thread: here
JIRA:
KAFKA-18455
-
Getting issue details...
STATUS
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Currently, when clients attempt to establish connections to a broker and encounter throttling or exceeding the maximum connection limit (waiting for an available connection slot), the broker does not provide any logs or metrics for these scenarios. Clients only receive connection timeout exceptions, which provide insufficient information for troubleshooting. We can enhance observability and help users effectively diagnose connection issues by implementing additional connection-related metrics.
Public Interfaces
| MetricName | Type | Group | Tag | Description | JMX Bean |
|---|---|---|---|---|---|
| waiting-connection | Gauge | Acceptor | listener:<listener_name> | Waiting connections for the specific listener | kafka.network:type=Acceptor,name=waiting-connection,listener={listener_name} |
| connection-latency | Histogram | Acceptor | listener: <listener_name> | connection wait time for the specific listener | kafka.network:type=Acceptor,name=connection-latency,listener={listener_name} |
Proposed Changes
We propose adding metrics described in the Public Interfaces section, which could help users effectively diagnose connection quota issues.
Compatibility, Deprecation, and Migration Plan
N/A.
This is a new metric, and there are no compatibility concerns.
Test Plan
The new metrics will need unit and integration tests to prove their correctness.
Rejected Alternatives
Adding logs to the SocketServer
This alternative was rejected because Kafka is a high-throughput system handling numerous concurrent connections.
Adding logs for connection throttling and limit exceeded scenarios would likely result in log flooding, potentially causing:
- I/O overhead
- Storage space issues
- Identifying critical issues among the massive volume of connection logs would be super annoying.
Using metrics instead of logs provides a more suitable solution for monitoring connection states without the overhead of extensive logging.