...
For instance, to perform the operations illustrated in Fig. 1 on a stream of messages, a user can write the Samza app in listing 1 using Samza high-level API:
Fig. 1 — A logical workflow of stream processing operations |
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
public class StreamStreamJoinApp implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream s1 = graph .getInputStream("S1") .filter(/* Omitted for brevity */); MessageStream s2 = graph .getInputStream("S2"); OutputStream s3 = graph.getOutputStream("S3"); s1.join(s2, /* Omitted for brevity */) .sendTo(s3); } } |
Fig. 2 — An illustration of the OperatorSpec graph of objects generated by Samza for the application in listing 1. OperatorSpecs associated with input/output streams are highlighted in yellow. |
...
The Execution Planner is the core Samza module responsible for verifying that all streams participating in any given Join operation agree in partition count. To achieve this, it traverses the graph of OperatorSpec
s produced by Samza High-Level API to verify compliance to this requirement among all such sets of streams.
Fig. 3 — 2 examples cases of Stream-Stream Joins. After considering the partition counts of the joined input streams, Samza’s Execution Planner accepts the one to the left but rejects the one to the right. |
...
Any intermediate stream joined with an input stream gets assigned the same partition count as that input stream.
Any intermediate stream not covered by the first rule is assigned the partition count value specified by the Samza config property
job.intermediate.stream.partitions
.If no value is specified for
job.intermediate.stream.partitions
, the Execution Planner falls back to using the maximum partition count among all input and output streams, capped at a maximum hard-coded value of 256.
Fig. 4 — The OperatorSpec graph of an example high-level Samza application that employs the Partition-By operation. The Execution Planner decides to assign the partition count value 16 to intermediate stream S2′, the same value of input stream S1, since they are joined together. |
It is important to realize there are situations where it is not possible to enforce agreement between an intermediate stream and the input streams it is joined with, a scenario that would cause the Execution Planner to signal an error and reject the whole application. Fig. 5 illustrates one such case.
Fig. 5 — The |
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
public class StreamTableJoinApp implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream s1 = graph.getInputStream("S1"); MessageStream s1Prime = s1.partitionBy(/* Omitted for brevity */); // Assume local table Table t = graph.getTable(/* Omitted for brevity */); s1Prime.sendTo(t); MessageStream s2 = graph.getInputStream("S2"); OutputStream s3 = graph.getOutputStream("S3"); s2.join(t, /* Omitted for brevity */) .sendTo(s3); } } |
Fig. 6 — A diagram illustrating the logical data flow in the example Samza application in listing 2. Stream S1 is partitioned then sent to table T which is then joined with stream S2. |
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
public class StreamTableJoinApp implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { Table<KV<Integer, String>> t = graph .getTable(/* Omitted for brevity */); MessageStream s1 = graph .getInputStream("S1") .filter(/* Omitted for brevity*/); s1.sendTo(t); MessageStream s2 = graph.getInputStream("S2"); OutputStream s3 = graph.getOutputStream("S3"); s2.join(t, /* Omitted for brevity*/) .sendTo(s3); } } |
Fig. 7 — A graph representing the OperatorSpec graph generated by Samza for the application in listing 3. As usual, OperatorSpecs associated with input/output streams are highlighted in yellow. |
...
To extend Samza’s ExecutionPlanner to support Tables, we need to address the disconnect between a SendToTableOperatorSpec
and all relevant StreamTableJoinOperatorSpec
s. One possibility that does not require changing Samza’s High-Level APIs is to modify the OperatorSpec
graph traversal such that virtual connections are assumed between every SendToTableOperatorSpec
and all the StreamTableJoinOperatorSpec
s that reference the same table (TableSpec
) in the entire OperatorSpec
graph.
Fig. 8 — A graph representing the |
...
OperatorSpec
graph analysisConversion to
StreamEdge
sJoin input validation
Details of each one of these steps are outlined below.
...