Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update lag calculation during rebalance to use offsets managed by stores

...

Code Block
languagejava
firstline106
titleorg.apache.kafka.streams.processor.StateStore
linenumberstrue
    
    /**
     * Determines if this StateStore manages its own offsets.
     * <p>
     * If this method returns {@code true}, then offsets provided to {@link #commit(Map)} will be retrievable using
     * {@link #committedOffset(TopicPartition)}, even if the store is {@link #close() closed} and later re-opened.
     * <p>
     * If this method returns {@code false}, offsets provided to {@link #commit(Map)} will be ignored, and {@link
     * #committedOffset(TopicPartition)} will be expected to always return {@code null}.
     * <p>
     * This method is provided to enable custom StateStores to opt-in to managing their own offsets. This is highly
     * recommended, if possible, to ensure that custom StateStores provide the consistency guarantees that Kafka Streams
     * expects when operating under the {@code exactly-once} {@code processing.mode}.
     * 
     * @deprecated New implementations should always return {@code true} and manage their own offsets. In the future,
     *             this method will be removed and it will be assumed to always return {@code true}.
     * @return Whether this StateStore manages its own offsets.
     */
    @Deprecated
    default boolean managesOffsets() {
        return false;
    }

    /**
     * Flush any cached data
     * 
     * @deprecated Use {@link org.apache.kafka.streams.processor.api.ProcessingContext#commit() ProcessorContext#commit()}
     *             instead.
     */
    @Deprecated
    default void flush() {
        // no-op
    }

    /**
     * Commit all written records to this StateStore.
     * <p>
     * This method <b>CANNOT<b> be called by users from {@link org.apache.kafka.streams.processor.api.Processor
     * processors}. Doing so will throw an {@link java.lang.UnsupportedOperationException}.
     * <p>
     * Instead, users should call {@link org.apache.kafka.streams.processor.api.ProcessingContext#commit() 
     * ProcessorContext#commit()} to request a Task commit.
     * <p>
     * If {@link #managesOffsets()} returns {@code true}, the given {@code changelogOffsets} will be guaranteed to be 
     * persisted to disk along with the written records.
     * <p>
     * {@code changelogOffsets} will usually contain a single partition, in the case of a regular StateStore. However,
     * they may contain multiple partitions in the case of a Global StateStore with multiple partitions. All provided
     * partitions <em>MUST</em> be persisted to disk.
     * <p>
     * Implementations <em>SHOULD</em> ensure that {@code changelogOffsets} are committed to disk atomically with the
     * records they represent, if possible.
     * 
     * @param changelogOffsets The changelog offset(s) corresponding to the most recently written records.
     */
    default void commit(final Map<TopicPartition, Long> changelogOffsets) {
        flush();
    }

    /**
     * Returns the most recently {@link #commit(Map) committed} offset for the given {@link TopicPartition}.
     * <p>
     * If {@link #managesOffsets()} and {@link #persistent()} both return {@code true}, this method will return the
     * offset that corresponds to the changelog record most recently written to this store, for the given {@code
     * partition}.
     * 
     * @param partition The partition to get the committed offset for.
     * @return The last {@link #commit(Map) committed} offset for the {@code partition}; or {@code null} if no offset
     *         has been committed for the partition, or if either {@link #persistent()} or {@link #managesOffsets()}
     *         return {@code false}.
     */
    default Long committedOffset(final TopicPartition partition) {
        return null;
    }

...

When determining the partition assignment, StreamsPartitionsAssignor considers the checkpoint offsets on-disk for all StateStores on each instance, even for Tasks they have not (yet) been assigned. This is done via TaskManager#getTaskOffsetSums(), which, currently, directly reads from the per-Task .checkpoint file.

With our new APISince stores will now be managing their offsets internally, we will need to continue to ensure that this file contains the offsets of stores that are closed, or have not yet been initialized. We will do this under EOS by updating the .checkpoint file whenever a store is close()d. Note: stores that do not manage their own offsets will continue to also update this file on-commit (provided at least 10,000 records have been written).

Compatibility, Deprecation, and Migration Plan

Kafka Streams will automatically migrate offsets found in an existing .checkpoint file, and/or an existing .position file, to store those offsets directly in the StateStore, if managesOffsets returns true. Users of the in-built store types will not need to make any changes. See Upgrading.

flush deprecation

read offsets from the stores themselves instead of the .checkpoint file.

  • On start-up, we will construct and initialize state stores for every corresponding Task store directory we find on-disk, and call committedOffset to determine their stored offset(s). We will then cache these offsets in-memory and close() these stores. This cache will be shared among all StreamThreads, and will therefore be thread-safe.
  • On rebalance, we will consult this cache to compute our Task offset lags.
  • After state restore completes for an assigned Task, we will update the offset cache to Task.LATEST_OFFSET, to indicate that the Task now has the currently latest offset.
  • When closing a StateStore, we will update the offset cache with the current changelog offset for the store.
    • This will ensure that when a Task is reassigned to another instance, the Task lag for the local state matches what's on-disk, instead of using the sentinel value Task.LATEST_OFFSET, which is only valid for Tasks currently assigned to a thread on the local instance.

We will conduct performance testing for this strategy, and if we find that initializing and closing many on-disk stores on startup is prohibitively expensive, we will parallelize this process by forking a new thread for each Task directory.

Compatibility, Deprecation, and Migration Plan

Kafka Streams will automatically migrate offsets found in an existing .checkpoint file, and/or an existing .position file, to store those offsets directly in the StateStore, if managesOffsets returns true. Users of the in-built store types will not need to make any changes. See Upgrading.

flush deprecation

All internal usage of the StateStore#flush method has been removed/replaced. Therefore, the main concern is with end-users calling StateStore#flush directly from custom Processors. Obviously, users cannot safely call StateStore#commit, because they do not know the changelog offsets that correspond to the locally written All internal usage of the StateStore#flush method has been removed/replaced. Therefore, the main concern is with end-users calling StateStore#flush directly from custom Processors. Obviously, users cannot safely call StateStore#commit, because they do not know the changelog offsets that correspond to the locally written state. Forcibly flushing/fsyncing recently written records to disk may also violate any transactional guarantees that StateStores are providing. Therefore, the default implementation of flush has been updated to a no-op. Users are now advised (via JavaDoc) that should instead request an early commit via. ProcessingContext#commit().

...

An alternative approach was suggested that we force-flush, not on every commit, but only on some commits, like we do under ALOS. This would be configured by a new configuration, for example statestore.flush.interval.ms  and/or statestore.flush.interval.records. However, any default that we chose for these configurations would be arbitrary, and could result, under EOS, in more flushing than we do today. For some users, the defaults would be fine, and would likely have no impact. But for others, the defaults could be ill-suited to their workload, and could cause a performance regression on-upgrade.

We instead chose to take the opportunity to solve this with a more comprehensive set of changes to the StateStore API , that should have additional benefits.

StreamsPartitionsAssignor to query stores for offsets

Instead of having TaskManager#getOffsetSums()  read from the .checkpoint  file directly, we originally intended to have it call StateStore#committedOffset() on each store, and make the StateStore responsible for tracking the offset for stores even when not initialized.

However, this is not possible, because a StateStore  does not know its TaskId , and hence cannot determine its on-disk StateDirectory  until after StateStore#init() has been called. We could have added a StateStoreContext  argument to committedOffset() , but we decided against it, because doing so would make the API less clear, and correctly implementing this method would be more difficult for StateStore implementation maintainers.

configurations would be arbitrary, and could result, under EOS, in more flushing than we do today. For some users, the defaults would be fine, and would likely have no impact. But for others, the defaults could be ill-suited to their workload, and could cause a performance regression on-upgrade.

We instead chose to take the opportunity to solve this with a more comprehensive set of changes to the StateStore API , that should have additional benefitsInstead, we revert to the existing behaviour, and have the Streams engine continue to maintain the offsets for closed stores in the .checkpoint file.