...
How it works:
The upstream tasks will send out control message to all the downstream intermediate topic partition. The control message will be serialized and sent out with user messages in the same stream.
Downstream Samza processor will consume the intermediate streams, and deserialize both user messages and control messages in SystemConsumers.
The control messages will be reconciled based on the count from all the producers (tasks) from the upstream. See below for more details of different control message reconsiliation.
Intermediate Stream Message Format:
The format of the intermediate stream message:
Code Block |
---|
IntermediateMessage => [MessageType MessageData] MessageType => byte MessageData => byte[] MessageType => [0(UserMessage), 1(Watermark), 2(EndOfStream)] MessageData => [UserMessage/ControlMessage] ControlMessage => [EndOfStreamMessage/WatermarkMessage] Version => int TaskName => String TaskCount => int Other Message Data (based on different types of control message) |
For user message, we will use the user provided serde (default is the system serde). For control message, we will use JSON serde since it is built in Samza and easy to parse.
Reconciliation
The reconciliation of control messages happens inside TaskInstance after the message is delivered to it from the chooser. For the scope of this proposal, we support two kinds of control messages: end-of-stream and watermark.
- End-of-stream Message: This message indicate the upstream task has ended producing to this stream.
- Watermark Message: This message contains a timestamp of the upstream task has processed so far.
The reconciliation process works as follows:
- The downstream TaskInstance receives the control message, and update the internal bookkeeping of the messages. For end-of-stream, it keeps the set of upstream tasks for the intermediate stream. For watermark, it keeps the mapping from task to its latest timestamp.
Once the task count in the bookkeeping matches the total count, the TaskInstance will emit a single IncomingMessageEnvelope containing the intermediate stream and partition, and the message itself. The timestamp in the watermark message will be:
InputWatermark = min { OutputWatermark(task) for each task in upstream tasks } - After reconsiliation, the control message evelope will be sent to the task to process.
The TaskInstance uses the following maps for bookkeeping received end-of-stream and watermark messages:
Code Block |
---|
EndOfStream Bookkeeping: Map( streamId -> { Set<TaskName>, totalTasks } ) Watermark Bookkeeping: Map( streamId -> { Map<TaskName, Timestamp>, totalTasks, timestampOfLastEmission } ) |
Checkpoint control messages
For failure scenario, we need to keep the state of bookkeeping so we can restore it during recovery. This can be done by checkpointing the bookkeeping states along with the input messages offset.
The checkpoint for EndOfStream:
Code Block |
---|
EndOfStreamCheckpoint => streamId => String totalTasks => int tasks => Set<String> |
The checkpoint for Watermark:
Code Block |
---|
WatermarkCheckpoint => streamId => String totalTasks => int tasksToEventTime => Map<String, Long> |
Detail details
ControlMessage
We will support two types of ControlMessage: EndOfStreamMessage and WatermarkMessage
Code Block |
---|
public abstract class ControlMessage { private final String taskName; private final int taskCount; private int version = 1; public ControlMessage(String taskName, int taskCount) { this.taskName = taskName; this.taskCount = taskCount; } public String getTaskName() { return taskName; } public int getTaskCount() { return taskCount; } public void setVersion(int version) { this.version = version; } public int getVersion() { return version; } } public class EndOfStreamMessage extends ControlMessage{ private final String streamId; private EndOfStreamMessage(String streamId, String taskName, int taskCount) { super(taskName, taskCount); this.streamId = streamId; } public String getStreamId() { return streamId; } } public class WatermarkMessage extends ControlMessage{ private final long timestamp; private WatermarkMessage(long timestamp, String taskName, int taskCount) { super(taskName, taskCount); this.timestamp = timestamp; } public long getTimestamp() { return timestamp; } } |
Rejected Alternative:
Out-of-band control stream
In this approach the ApplicationRunner will create a separate control stream for propagating control messages. The control stream is a one-partition broadcast stream which will be consumed by each container in the application. The application runner will manage the lifecycle of the control stream: it creates it for the first time and purge the stream at the start (same as output streams when consuming from Hadoop) of future runs.
When an input stream is consumed to the end, Samza sends an Eos message to the control channel which includes the input topic and partition.
Once the EOS messages are received from all the partitions of this input, we know the input is end-of-stream. Then the ControlStreamConsumer will inspect the stream graph and find out the intermediate stream that all its input streams to it have been all end-of-stream. If so, we mark the intermediate stream pending end-of-stream. After that, whenever a marked intermediate stream partition reaches its highest offset (high watermark in Kafka), we can emit end-of-stream message for this partition. It’s guaranteed that the partition reaches end of stream.
Comparisons of the two approaches:
Pros | Cons | |
Out-of-band | - Intermediate streams are clean with only user data. This is convenient if user wants to consume it elsewhere. - Simple recovery from failure, just read the control stream from the beginning. - Less number of messages. The control messages needed is the same as the input stream partition count (n partitions). So the total will be n messages. | - Need to correlate the out-of-band control message with the source stream, which is complex to track and requires synchronization between input streams and control stream. - Need to maintain a separate stream for control messages |
In-band | - No coordination needed between control message and input messages. When a control message is received, it is a marker that the messages sent before the control message have been consumed completely. This is critical to support general event-time watermarks. | - Complicated failure scenario. The consumer of control messages needs to checkpoint the control messages received, so when it recovered from failure, it can still resume. - More control messages required. For each intermediate stream (m partitions), we need to write each task of the producer (n tasks) into it. So the total will be n*m messages. |
Based on the pros and cons above, we propose to use the in-band approach to support control messages.
...