Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Tweak configuration changes

...

There will be four new configurations added for distributed workers:

  • internal.request.verificationkey.generation.algorithm
    • Purpose: control whether the internal Connect REST endpoint is restricted
    • Type: boolean
    • Default: true
    • the algorithm used to generate session keys
    • Type: string
    • Default: "HmacSHA256"
  • internal.request.key.size
    • Purpose: the size of generated session keys, in bits
    • Type: int
    • Default: 256
  • internal.request.internal.key.rotation.interval.ms
    • Purpose: how often to force a rotation of the internal key used for request validation, or 0 if forced rotation should never occur
    • Type: long
    • Default: 3600000 (one hour)
  • internal.keyrequest.signature.algorithm
    • Purpose: the algorithm to use to sign internal requests when sent from a follower worker to the leader
    • Type: string
    • Default: "HmacSHA256"
  • internal.keyrequest.verification.algorithms
    • Purpose: a list of supported algorithms for verifying internal requests that are received by the leader from a follower
    • Type: list
    • Default: "HmacSHA256"

The default value for the connect.protocol configuration will also be altered from compatible to sessioned, so that the new request signing behavior proposed by this KIP will be enabled by default (once all workers in a cluster support it).

Additionally, although not part of the public API, the POST /connectors/<name>/tasks endpoint will be effectively disabled for public use. This endpoint should never be called by users, but since until now there hasn't been anything to prevent them from doing so, it should still be noted that anything that relies that endpoint will no longer work after these changes are made. The expected impact of this is low, however; the Connect framework (and the connectors it runs) handle the generation and storage of task configurations and there's no discernible reason for using that endpoint directly instead of going through the public Connect REST API.

Proposed Changes

A new Connect protocolsubprotocol, sessioned, will be implemented that will be identical to the cooperative incremental protocol but with the addition of a session-key field to the assignment schema, which will then be retained by follower workers for use in request signing and by the leader for use in request verification. One downside of this approach is that the use of cooperative incremental assignments will be required in order to enable this new security behavior; however, given the lack of any serious complaints about the new rebalancing protocol thus far, this seems preferable to trying to enable this behavior across both assignment styles. In addition, periodically forcing a rebalance in order to rotate keys would incur a heavy performance penalty on a cluster using eager assignment; this approach isn't really practical in that case.

If the internalconnect.request.verificationprotocol property is set to true sessioned, the worker will advertise this new sessioned protocol to the Kafka group coordinator as a supported (and, currently, most preferable) protocol. If that protocol is then agreed on by the cluster during group coordination, a session key will be randomly generated during each rebalance and distributed by the leader to each follower node. This key will be used by followers to sign requests to the internal endpoint, and verified by the leader to ensure that the request came from a current group member. It is imperative that inter-worker communication have some kind of transport layer security; otherwise, this session key will be leaked during rebalance to anyone who can eavesdrop on request traffic.

Periodically (with frequency dictated by the internal.request.key.rotation.interval.ms property), the leader will force a rebalance by requesting to rejoin the group and, in the process, compute a new session key and distribute it to each follower worker. The performance impact of these rebalances should be negligible given that all Connect clusters with this new feature will already support incremental cooperative rebalancing. Every time a rebalance occurs, the next scheduled rebalance for key rotation will be reset; that is, if the rotation interval is one hour, and a rebalance occurs thirty minutes after the most recent key rotation, the next key rotation will be rescheduled for one hour after the rebalance, as opposed to remaining at one hour after the most recent rotation.

The default algorithm used to sign requests will be HmacSHA256; this algorithm is guaranteed to be supported on all implementations of the Java Platform (source). However, users will be able to configure their cluster to use other algorithms with the internal.request.signature.algorithm property if, for example, the default is not suitable for compliance with an existing security standard.

Similarly, the default algorithm used to generate request keys will also be HmacSHA256; again, this algorithm is guaranteed to be supported on all implementations of the Java Platform (source). And again, users will be able to configure their cluster to use other algorithms or keys of a different size with the internal.request.key.generation.algorithm and internal.request.key.size properties, respectively.

Each signed request will include two headers:

  • X-Connect-Authorization: the signature of the request body (base 64 encoded)
  • X-Connect-Request-KeySignature-Algorithm: the key algorithm used to sign the request

When a request is received by the leader, the request signature algorithm described by the X-Connect-Request-Signature-Algorithm header will be used to sign the request body and the resulting signature will be checked against the contents of the X-Connect-Authorization header. If the contents do not match, or the request signature algorithm is not in the list of permitted algorithms controlled by the internal.request.verification.algorithms property, the request will be rejected.

The leader will only accept requests signed with the most current key. This should not cause any major problems; if a follower attempts to make a request with an expired key (which should be quite rare and only occur if the request is made during an in-progress rebalance), the initial request will fail, but will be subsequently retried after a backoff period. This backoff period should leave sufficient room for the rebalance to complete. One potential downside is that, should this occur, an error-level log message of "Failed to reconfigure connector's tasks, retrying after backoff: " followed by a stack trace will be generated. This can be mitigated by altering the log message or the generated exception to include a note that this may not be an issue if key rotation is enabled, and/or logging an info-level log message after successfully completing task reconfiguration that potentially includes a note that any above error messages related to task reconfiguration may be safely disregarded.

...