Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Alter proposal for distributing session keys, and a few additional tweaks

...

It should be noted that the goal here is not to completely secure any Kafka Connect cluster, but rather to patch an existing security hole for clusters that are already intended to be secure. A few examples of steps that should be taken in order to secure a Kafka Connect cluster include securing the public REST API (which can be done using a Connect REST extension), securing the worker group (which can be done with the use of ACLs on the Kafka broker), and securing the internal topic used by Connect to store configurations, statuses, and offsets for connectors (which can also be done with the use of ACLs on the Kafka broker). If any of these steps are not taken, the cluster is insecure anyways; therefore, relying on these precautions being in place in order to implement a fix for the problem posed by the internal REST endpoint used by Connect is acceptable.

However, it is not a goal of this KIP to limit the rollout of the new features based on whether the Connect cluster is already secure. With the pluggable nature of Kafka and Kafka Connect authorization, it would be difficult to know if all of the important Connect resources (worker group, internal topics, and Kafka Connect REST API to name a few) are actually secured. Additionally, the presence of other attack vectors shouldn't be justification for opening up a new one; it seems particularly dangerous in the event that someone accidentally misconfigures their Connect cluster and this endpoint ends up being exposed even though the user believes it to be protected.

Public Interfaces

There will be five new configurations added for distributed workers:

...

A new Connect subprotocol, sessioned, will be implemented that will be identical to the cooperative incremental protocol but with the addition of a session-key field to the assignment schema, which will then be retained by follower workers for use in request signing and by the leader for use in request verificationa higher protocol version number (2, instead of the current version for cooperative incremental rebalancing, which is 1). One downside of this approach is that the use of cooperative incremental assignments will be required in order to enable this new security behavior; however, given the lack of any serious complaints about the new rebalancing protocol thus far, this seems preferable to trying to enable this behavior across both assignment styles. In addition, periodically forcing a rebalance in order to rotate keys would incur a heavy performance penalty on a cluster using eager assignment; this approach isn't really practical in that case.

If the connect.protocol property is set to sessioned, the worker will advertise this new sessioned protocol to the Kafka group coordinator as a supported (and, currently, most preferable) protocol. If that protocol is then agreed on by the cluster during group coordination, a session key will be randomly generated during each rebalance and distributed by the leader to each follower nodeand distributed to the cluster via the config topic. This key will be used by followers to sign requests to the internal endpoint, and verified by the leader to ensure that the request came from a current group member. It is imperative that inter-worker communication have some kind of transport layer security; otherwise, this session key will be leaked during rebalance to anyone who can eavesdrop on request traffic.

Periodically (with frequency dictated by the internal.request.key.rotation.interval.ms property), the leader will force a rebalance by requesting to rejoin the group and, in the process, compute a new session key and distribute it to each follower worker. The performance impact of these rebalances should be negligible given that all Connect clusters with this new feature will already support incremental cooperative rebalancing. Every time a rebalance occurs, the next scheduled rebalance for key rotation will be reset; that is, if the rotation interval is one hour, and a rebalance occurs thirty minutes after the most recent key rotation, the next key rotation will be rescheduled for one hour after the rebalance, as opposed to remaining at one hour after the most recent rotationthe cluster.

The default algorithm used to sign requests will be HmacSHA256; this algorithm is guaranteed to be supported on all implementations of the Java Platform (source). However, users will be able to configure their cluster to use other algorithms with the internal.request.signature.algorithm property if, for example, the default is not suitable for compliance with an existing security standard.

...

The leader will only accept requests signed with the most current key. This should not cause any major problems; if a follower attempts to make a request with an expired key (which should be quite rare and only occur if the request is made during an in-progress rebalanceby a follower that is not fully caught up to the end of the config topic), the initial request will fail, but will be subsequently retried after a backoff period. This backoff period should leave sufficient room for the rebalance to complete. One potential downside is that, should this occur, an error-level log message of "Failed to reconfigure connector's tasks, retrying after backoff: " followed by a stack trace will be generated. This can be mitigated by altering the log message or the generated exception to include a note that this may not be an issue if key rotation is enabled, and/or logging an info-level log message after successfully completing task reconfiguration that potentially includes a note that any above error messages related to task reconfiguration may be safely disregarded.

Compatibility, Deprecation, and Migration Plan

Backwards compatibility

All of the proposed configurations here have default values, making them backwards compatible.

Reverting an upgrade

The group coordination protocol will be used to ensure that all workers in a cluster support verification of internal requests before this behavior is enabled; therefore, a rolling upgrade of the cluster will be possible. In line with the regression plan for KIP-415: Incremental Cooperative Rebalancing in Kafka Connect, if it is desirable to disable this behavior for some reason, the internalconnect.request.verificationprotocol configuration can be set to false compatible or default for one (or more) workers, and it will automatically be disabled.

Migrating to a new request signature algorithm

If a new signature algorithm should be used, a rolling upgrade will be possible with the following steps (assuming a new algorithm of HmacSHA489):

...

Rejected because: Achieving consensus in a Connect cluster about whether to begin engaging in this new topic-based protocol would require either reworking the Connect group coordination protocol or installing several new configurations and a multi-stage rolling upgrade in order to enable it. Requiring new configurations and a multi-stage rolling upgrade for the default use case of a simple version bump for a cluster would be a much worse user experience, and if the group coordination protocol is going to be reworked, we might as well just use the group coordination protocol to distribute keys instead. Additionally, the added complexity of switch from a synchronous to an asynchronous means of communication for relaying task configurations to the leader would complicate the implementation enough that reworking the group coordination protocol might even be a simpler approach with smaller changes required.

Distribute session key during rebalance

Summary: Instead of distributing a session key via the config topic, include the session key as part of the worker assignment handed out during rebalance. Periodically force a rebalance in order to rotate session keys.

Rejected because: The implementation complexity of adding a session key to the rebalance protocol would be quite high, and the additional API would complicate the code base significantly. Additionally, there are few, if any advantages, compared to distributing the keys via the config topic.