Current state: Under Discussion
Discussion thread: here
JIRA: KAFKA-3705 Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Note: This KIP contains two major sections - there was a previous attempt at resolving this was previously worked on by Jan. The current proposal is at the top, with Jan's portion preserved at the end of the document.

Motivation

Close the gap between the semantics of KTables in streams and tables in relational databases. It is common practice to capture changes as they are made to tables in a RDBMS into kafka topics (JDBC-connect, Debezium, Maxwell). These entities typically have multiple one-to-many relationship. Usually RDBMSs offer good support to resolve this relationship with a join. Streams falls short here and the workaround (group by - join - lateral view) is not well supported as well and is not in line with the idea of record based processing.

...

A custom partitioner is used to ensure that the CombinedKey left this-table data is correctly copartitioned with the right other-table data. This is a simple operation, as it simply extracts the foreign key and applies the partitioner logic. It is important that the same partitioner that is used to partition the right table is used to partition the rekeyed leftthis-table data.

This/Event Processor Behaviour

CombinedKey data goes to a repartition topic coparitioned with the RightOther/Entity data.
The primary key in CombinedKey is preserved for downstream usage.

Gliffy Diagram

name	LeftProcessingWithCombinedKey
pagePin	45

...

Other/Entity PrefixScan Processor Behaviour

Requires the specific serialized format detailed above
Requires a RocksDB instance storing all of the LeftThis/Entity Event data

Gliffy Diagram

name	RangeScanCombinedKeyUsage
pagePin	56

Problem: Out-of-order processing of Rekeyed data

...

Gliffy Diagram

name	OutOfOrder Problem
pagePin	12

This race condition is especially visible when multiple threads are being used to process the Kafka topic partitions. A given thread on a given node may process its records much sooner or much later than the other threads, due to load, network, polling cycles, and a variety of other causes. It is expected that there is no guarantee on the order in which the messages arrive. All things equal, it is equally likely that you would see the "null" message as the final result as it would be the correct updated message. This issue is only further compounded if the foreign key were changed several times in short succession, with multiple additional partitions.

...

There is no impact to existing users.

Rejected Alternatives:

Problem: Out-of-order processing of Rekeyed data

...

Solution A - Hold Ordering Metadata in Record Headers and Highwater Mark Table

...

Gliffy Diagram


name	OutOfOrderResolution-RecordHeaders
pagePin	7
version	5

Since the final out-of-order data is sourced from a topic, the only way to ensure that downstream KTables have the means to query their parent's ValueGetter is to materialize the final state store. There is no way to get specific values directly from a topic source - a Materialized store is required when providing statefulness to data stored in a topic (see KTableSource). In this case, it would mean that a user-provided Materialized store is mandatory. The work flow would look like this:

...

GroupBy + Reduce / Aggregate

The following design doesn't take into account the order of which events may arrive from the various threads. The easiest way to trigger this issue is to rapidly change the foreign-key of a record. This in turn has different Stream Threads operating asynchronously on the records, returning them in a non-deterministic order. At this point, GroupBy and reduce simply provide the order that the records arrive. Should your foreign key change rarely or not at all this may not be a problem, but since we want a comprehensive solution for all data patterns this is considered rejected.

Gliffy Diagram

name	RejectedGroupByReduce
pagePin	2

...

Rejected because it requires an additional materialized table, and does not provide any significant benefit.

Solution B - User-Managed GroupBy (Jan's)

A Table KTable<CombinedKey<A,B>,JoinedResult> is not a good return type. It breaks the KTable invariant that a table is currently partitioned by its key, which this table wouldn't be and the CombinedKey is not particularly usefull useful as its a mere Kafka artifact.

User Managed Group by

with With a followed up group by, we can remove the repartitioning artifact by grouping into a map. Out of order events can be hold in the map and can be dealt with, hower however one likes it. Either wait for some final state and propagate no changes that are "intermediate" and show artifacts or propagate directly. The eventuall correcness eventual correctness is guaranteed in both ways. The huge advantage is further, that the group by can be by any key, resulting in a table of that key.

Gliffy Diagram


size	600
name	OutOfOrderResolution-RecordHeaders
pagePin	7

Jan Filipiak's Original Proposal

...

(From here to end of document)

Public Interfaces

Less intrusive

We would introduce a new Method into KTable and KTableImpl

Code Block

language	java
title	KTable.java

   /**
     * 
     * Joins one record of this KTable to n records of the other KTable,
     * an update in this KTable will update all n matching records, an update
     * in other table will update only the one matching record.
     * 
     * @param the table containing n records for each K of this table
     * @param keyExtractor a {@code ValueMapper} returning the key of this table from the others value
     * @param joinPrefixFaker a {@code ValueMapper} returning an outputkey that when serialized only produces the
     *                             prefix of the output key which is the same as serializing K 
     * @param leftKeyExtractor a {@code ValueMapper} extracting the Key of this table from the resulting Key 
     * @param <KO> the resultings tables Key
     * @param <VO> the resultings tables Value
     * @param joiner
     * @return
     */
    <KO, VO, K1, V1> KTable<KO, VO> oneToManyJoin(KTable<K1, V1> other,
            ValueMapper<V1, K> keyExtractor,
            ValueMapper<K, KO> joinPrefixFaker,
            ValueMapper<KO, K> leftKeyExtractor, 
            ValueJoiner<V, V1, VO> joiner,
            Serde<K1> keyOtherSerde, Serde<V1> valueOtherSerde,
            Serde<KO> joinKeySerde, Serde<VO> joinValueSerde);

More intrusive

We would Introduce a new Complex type: Combined Key

...

Code Block

language	java
title	KTable.java

   /**
     * 
     * Joins one record of this KTable to n records of the other KTable,
     * an update in this KTable will update all n matching records, an update
     * in other table will update only the one matching record.
     * 
     * @param the table containing n records for each K of this table
     * @param keyExtractor a {@code ValueMapper} returning the key of this table from the others value
     * @param leftKeyExtractor a {@code ValueMapper} extracting the Key of this table from the resulting Key 
     * @param <VO> the resultings tables Value
     * @param joiner
     * @return
     */
    <VO, K1, V1> KTable<CombinedKey<K,K1>,VO> oneToManyJoin(KTable<K1, V1> other,
            ValueMapper<V1, K> keyExtractor, 
            ValueJoiner<V, V1, VO> joiner,
            Serde<K1> keyOtherSerde, Serde<V1> valueOtherSerde,
            Serde<VO> joinValueSerde);

Tradeoffs

The more intrusive version gives the user better clarity that his resulting KTable is not only keyed by the other table's key but its also keyed by this table's key. So he will be less surprised that in a theoretical later aggregation he might find the same key from the other ktable twice. On the other hand the less intrusive method doesn't need to introduce this wrapper class but let the user handle the need of having both tables keys present in the output key himself. This might lead to a deeper understanding for the user and serdes might be able to pack the data denser. An additional benefit is that the user can stick with his default serde or his standard way of serializing when sinking the data into another topic using for example to() while the CombinedKey would require an additional mapping to what the less intrusive method has.

Back and forth mapper

This is a proposal to get rid of the Type CombinedKey in the return type. We would internally use a Combined key and a Combined Key Serde and apply the mappers only at the processing boundaries (ValueGetterSupplier, context.forward). The data will still be serialized for repartitioning in a way that is specific to Kafka and might prevent users from using their default tooling with these topics.

Code Block

language	java
title	KTable.java

     /**
	 * 
     * Joins one record of this KTable to n records of the other KTable,
     * an update in this KTable will update all n matching records, an update
     * in other table will update only the one matching record.
     * 
     * @param the table containing n records for each K of this table
     * @param keyExtractor a {@code ValueMapper} returning the key of this table from the others value
     * @param customCombinedKey a {@code ValueMapper} allowing the CombinedKey to be wrapped in a custom object
	 * @param combinedKey a {@code ValueMapper} allowing to unwrap the custom object again.
     * @param <VO> the resultings tables Value
     * @param joiner
     * @return
     */
    <KO VO, K1, V1> KTable<KO,VO> oneToManyJoin(KTable<K1, V1> other,
            ValueMapper<V1, K> keyExtractor, 
			ValueMapper<CombinedKey<K1,K>,KO> outputKeyCombiner,
			ValueMapper<KO,CombinedKey<K1,K>> outputKeySpliter,
			ValueJoiner<V, V1, VO> joiner,
            Serde<K1> keyOtherSerde, 
	        Serde<V1> valueOtherSerde,
            Serde<VO> joinValueSerde);

Custom Serde

Introducing an additional new Serde. This is the approach is the counterpart to having a back and forth mapper. With this approach it is possible to keep any Custom serialization mechanism off the wire. How to serialize is completely with the user.

...

Code Block

language	java
title	KTable.java

   /**
     * 
     * Joins one record of this KTable to n records of the other KTable,
     * an update in this KTable will update all n matching records, an update
     * in other table will update only the one matching record.
     * 
     * @param the table containing n records for each K of this table
     * @param keyExtractor a {@code ValueMapper} returning the key of this table from the others value
     * @param leftKeyExtractor a {@code ValueMapper} extracting the Key of this table from the resulting Key 
     * @param <VO> the resultings tables Value
     * @param joiner
     * @return
     */
    <VO, K1, V1> KTable<CombinedKey<K,K1>,VO> oneToManyJoin(KTable<K1, V1> other,
            ValueMapper<V1, K> keyExtractor, 
            ValueJoiner<V, V1, VO> joiner,
            Serde<K1> keyOtherSerde, Serde<V1> valueOtherSerde,
            Serde<VO> joinValueSerde,
			CombinedKeySerde<K,K1> combinedKeySerde);

Streams

We will implement a default CombinedKeySerde that will use a regular length encoding for both fields. So calls to the "intrusive approach" would constuct a default CombinedKeySerde and invoke the Serde Overload. This would work with all serde frameworks if the user is not interested in how the data is serialized in the topics.

Protobuf / Avro / thrift / Hadoop-Writeable / Custom

Users of these frameworks should have a very easy time implementing a CombinedKeySerde. Essentially they define an object that wraps K and K1 as usual keeping K1 as an optional field. The serializer returned from getPartialKeySerializer() would do the following:

...

This should work straight forward and users might implement a CombinedKeySerde that is specific to their framework and reuse the logic without implementing a new Serde for each key-pair.

JSON

Implementing a CombinedKeySerde depends on the specific framework with json. A full key would look like this "{ "a" :{ "key":"a1" }, "b": {"key":"b5" } }" to generate a true prefix one had to generate "{ "a" :{ "key":"a1"", which is not valid json. This invalid Json will not leave the jvm but it might be more or less tricky to implement a serializer generating it. Maybe we could provide users with a utility method to make sure their serde statisfies our invariants.

Proposed Changes

Goal

With the two relations A,B and there is one A for each B and there may be many B's for each A. A is represented by the KTable the method described above gets invoked on, while B is represented by that methods first argument. We want to implement a Set of processors that allows a user to create a new KTable where A and B are joined based on the reference to A in B. A and B are represented as KTable B being partitioned by B's key and A being partitioned by A's key.

Algorithm

ascii_raw_KIP-213

Call enable sendOldValues() on sources with "*"
Register a child of B
- extract A's key and B's key as key and B as value.
  - forward(key, null) for old
  - forward(key, b) for new
  - skip old if A's key didn't change (ends up in same partition)
Register sink for internal repartition topic (number of partitions equal to A, if a is internal prefer B over A for deciding number of partitions)
- in the sink, only use A's key to determine partition
Register source for intermediate topic
- co-partition with A's sources
- materialize
- serde for rocks needs to serialize A before B. ideally we use the same serde also for the topic
Register processor after above source.
- On event extract A's key from the key
- look up A by it's key
- perform the join (as usual)
Register processor after A's processor
- On event uses A's key to perform a Range scan on B's materialization
- For every row retrieved perform join as usual
Register merger
- Forward join Results
- On lookup use full key to lookup B and extract A's key from the key and lookup A. Then perform join.
Merger wrapped into KTable and returned to the user.

Step by Step

TOPOLOGY INPUT A	TOPOLOGY INPUT B	STATE A MATERIALZED	STATE B MATERIALIZE	INTERMEDIATE RECORDS PRODUCED	STATE B OTHER TASK	Output A Source / Input Range Proccesor	OUTPUT RANGE PROCESSOR	OUTPUT LOOKUP PROCESSOR
key: A0 value: [A0 ...]		key: A0 value: [A0 ...]				Change<null,[A0 ...]>	invoked but nothing found. Nothing forwarded
key: A1 value: [A1 ...]		key: A0 value: [A0 ...] key: A1 value: [A1 ...]				Change<null,[A1 ...]>	invoked but nothing found. Nothing forwarded
	key: B0 : value [A2,B0 ...]	key: A0 value: [A0 ...] key: A1 value: [A1 ...]	key: B0 : value [A2,B0 ...]	partition key: A2 key: A2B0 value: [A2,B0 ...]	key: A2B0 : value [A2,B0 ...]			invoked but nothing found Nothing forwarded
	key: B1 : value [A2,B1 ...]	key: A0 value: [A0 ...] key: A1 value: [A1 ...]	key: B0 : value [A2,B0 ...] key: B1 : value [A2,B1 ...]	partition key: A2 key: A2B1 value [A2,B1 ...]	key: A2B0 : value [A2,B0 ...] key: A2B1 : value [A2,B1 ...]			invoked but nothing found Nothing forwarded
key: A2 value: [A2 ...]		key: A0 value: [A0 ...] key: A1 value: [A1 ...] key: A2 value: [A2 ...]	key: B0 : value [A2,B0 ...] key: B1 : value [A2,B1 ...]		key: A2B0 : value [A2,B0 ...] key: A2B1 : value [A2,B1 ...]	Change<null,[A2 ...]>	key A2B0 value: Change<null,join([A2 ...],[A2,B0 ...]) key A2B1 value: Change<null,join([A2 ...],[A2,B1...])
	key: B1 : value null		key: B0 : value [A2,B0 ...]	partition key: A2 key: A2B1 value:null	key: A2B0 : value [A2,B0 ...]			key A2B1 value: Change<join([A2 ...],[A2,B1...],null)
	key: B3 : value [A0,B3 ...]		key: B0 : value [A2,B0 ...] key: B3 : value [A0,B3 ...]	partition key: A0 key: A0B3 value:[A0,B3 ...]	key: A2B0 : value [A2,B0 ...] key: A0B3 : value [A0,B3 ...]			key A0B3 value: Change<join(null,[A0 ...],[A0,B3...])
key: A2 value: null		key: A0 value: [A0 ...] key: A1 value: [A1 ...]	key: B0 : value [A2,B0 ...] key: B3 : value [A0,B3 ...]		key: A2B0 : value [A2,B0 ...] key: A0B3 : value [A0,B3 ...]	Change<[A2 ...],null>	key A2B0 value: Change<join([A2 ...],[A2,B0 ...],null)

Range lookup

It is pretty straight forward to completely flush all changes that happened before the range lookup into rocksb and let it handle a the range scan. Merging rocksdb's result iterator with current in-heap caches might be not in scope of this initial KIP. Currently we at trivago can not identify the rocksDb flushes to be a performance problem. Usually the amount of emitted records is the harder problem to deal with in the first place.

Missing reference to A

B records with a 'null' A-key value would be silently dropped.

Compatibility, Deprecation, and Migration Plan

There is no impact to existing users.

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

...

Space shortcuts

Child pages

Versions Compared

Old Version 60

New Version 61

Key

Motivation

This/Event Processor Behaviour

Other/Entity PrefixScan Processor Behaviour

Problem: Out-of-order processing of Rekeyed data

Rejected Alternatives:

Problem: Out-of-order processing of Rekeyed data

Solution A - Hold Ordering Metadata in Record Headers and Highwater Mark Table

GroupBy + Reduce / Aggregate

Solution B - User-Managed GroupBy (Jan's)

User Managed Group by

Jan Filipiak's Original Proposal

(From here to end of document)

Public Interfaces

Less intrusive

More intrusive

Tradeoffs

Back and forth mapper

Custom Serde

Streams

Protobuf / Avro / thrift / Hadoop-Writeable / Custom

JSON

Proposed Changes

Goal

Algorithm

Step by Step

Range lookup

Missing reference to A

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 60

New Version 61

Key

Motivation

This/Event Processor Behaviour

Other/Entity PrefixScan Processor Behaviour

Problem: Out-of-order processing of Rekeyed data

Rejected Alternatives:

Problem: Out-of-order processing of Rekeyed data

Solution A - Hold Ordering Metadata in Record Headers and Highwater Mark Table

GroupBy + Reduce / Aggregate

Solution B - User-Managed GroupBy (Jan's)

User Managed Group by

Jan Filipiak's Original Proposal

(From here to end of document)

Public Interfaces

Less intrusive

More intrusive

Tradeoffs

Back and forth mapper

Custom Serde

Streams

Protobuf / Avro / thrift / Hadoop-Writeable / Custom

JSON

Proposed Changes

Goal

Algorithm

Step by Step

Range lookup

Missing reference to A

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives