This document is intended to summarize various use cases for punctuate.
Terminology
Before going into the details let us first clarify the terms and their meanings for the purpose of this discussion.
Term | Description |
---|---|
Stream Time | KafkaStreams has defined an interface called TimestampExtractor which can be implemented to extract the Event Time from the event contents, the ingestion time, the processing/wall-clock time, or any other logical time. Stream Time for us would be the value returned by the TimestampExtractor implementation in use. |
Punctuate Time | This is the reference time used to trigger the Punctuate calls. It could be Stream Time [current semantics], System Time, or Stream Time with an expiry timer based on System Time (hybrid or complex punctuate). |
Punctuation Timestamp | The reference timestamp passed to punctuate method being called. It could be Stream Time of the event that triggered the punctuate [current semantics], System Time in case of System Time based punctuate, or a PunctuationTime object specifying the details in case of a hybrid/complex punctuate. |
Output Record Time | This is the output record timestamp for the events generated in the punctuate method. Currently, it is the internal time of the Stream Task which is the Stream Time. |
Punctuate Use Cases
Use Case 1
Quoting the words from Jay Kreps' mail correspondence.
"You aggregate click and impression data for a reddit like site. Every ten minutes you want to output a ranked list of the top 10 articles ranked by clicks/impressions for each geographical area. I want to be able run this in steady state as well as rerun to regenerate results (or catch up if it crashes)."
Behaviour | Requirement |
---|---|
Deterministic output (reprocessing or delayed processing possible) | |
Output in the absence of events |
Use Case 2
In an Event Count Audit system, we want to count the events produced every minute from the producers, recount the events at the first cluster, and at the subsequent mirror clusters at a per minute per producer granularity. We need to account for some late events and also the out of order event times in the same topic.
Behaviour | Requirement |
---|---|
Deterministic output (reprocessing or delayed processing possible) | |
Output in the absence of events |
Evaluation Matrix
In the currently listed use cases the Stream time is the Event Time.
Behaviour | Stream Time based Punctuate | System Time based Punctuate | Stream Time based Punctuate with System Time based secondary expiry |
---|---|---|---|
Deterministic output (reprocessing or delayed processing possible) | |||
Output in the absence of events | |||
Join of same time events across different topics | |||
Correct punctuation in the presence of long queueing | |||
Alternate Representation
Design Semantics | Use Case 1 | Use Case 2 | |
---|---|---|---|
Stream Time based Punctuate | Pros | Deterministic output (reprocessing or delayed processing doesn't change the output) | Deterministic output (reprocessing or delayed processing doesn't change the output) |
Cons | Low event geographic areas might not generate a list every 10 minutes. | Event count output wont happen until an event with event time crossing the punctuate interval arrives. | |
System Time based Punctuate | Pros | Rank list output every 10 minutes, irrespective of event flow. | Count flushes in the absence of events |
Cons | Output during a reprocessing won't be same. | Late event will cause another aggregate event during steady state operation. | |
Stream Time based Punctuate with System Time based secondary expiry | Pros | Deterministic output (reprocessing or delayed processing doesn't change the output) Low event geographic areas will generate a list as per the expiry timer. | Deterministic output (reprocessing or delayed processing doesn't change the output) |
Cons | ? | In case of similar scenarios where processing time is more leading to queueing and late processing, the system timer may expire before the last even within the window arrives. | |