Stream Processing Frameworks
MillWheel from Google
Slides: https://docs.google.com/presentation/d/11YqFU760Gk1qcKr3K3rCG4dXHP5X2zmTS45vQY2lDN4/present?slide=id.g4672b9e62_010 (slide 53 on the trigger API)
1. Use event time as the timestamp to determine window boundaries. Keep windows open for delayed / out-of-order event messages until some long time has elapsed.
a. Use "low watermark" to bound the event timestamp that future records may come with.
b. Low watermark values are injected by external systems to the very first computational stage, following stages infer LM from the up-streaming stage's LM.
2. Keys are specified and extracted on-the-fly by the consumers, so different consumers can consume the same stream by form different key-value pairs for processing, this saves the re-partition stage.
3. Persistent storage is per-key, where you can store per-key aggregates / updated states, or buffered individual key-value pairs for certain key. This storage is backed up by BigTable / Spanner.
4. Use sequencers to avoid zombie writes, and checkpointing delivery / updates before the real action to guarantee atomic state update.
5. Each stage checks for duplicate consumed messages with unique record IDs and ACK only after it has been consumed AND processed to guarantee exactly-once.
Heron from Twitter
1. Heron is Storm 2.0, just like YARN as Hadoop 2.0, in the sense that 1) it separates the topology management from the processing layer, 2) use process-level isolation and remove shared threads.
2. Each machine as only one "Stream Manager" which is responsible for receiving / sending data, and all process instances of that machine communicated with that SM for data in/out. SM has its own buffer to sending / receiving and when the buffer is full it will back-pressure on up-stream to let it slow-down the traffic.
3. Each instance has two thread, one for communicating with SM (including ser-de, etc), and one for processing.
1. For deployment, it seems only support runnable mode, in which the Flink framework runtime needs to be started first, then deploy the Flink jobs on it either through cmdline or a web UI (http://ci.apache.org/projects/flink/flink-docs-release-0.9/quickstart/setup_quickstart.html).
2. Support some high-level operator programming interface in Java / Scala / Python (beta), like map / reduce, filter, union, fold and some windowing operations (http://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/streaming_guide.html#transformations).
One important JavaDoc class is DataStream (https://ci.apache.org/projects/flink/flink-docs-release-0.9/api/java/)
3. Use some "stream connectors" to leverage third-party messaging queues like Kafka / RabbitMQ (http://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/streaming_guide.html#stream-connectors) besides programmable source / sink and normal formats like socket / files.
4. For compiling, it follows the standard optimization process of high-level language presentation -> logical operator DAG for optimization -> physical plan for runtime (it is not clear what optimizations are applied for physical plan generation since it is not mentioned on the wiki).
5. Deployment supports YARN / GCloud / standalone. The Flink framework runtime contains a JobManager (JB) that works as a task scheduler to accept submitted jobs in the form of the physical plan JobGraph, and run them as parallel tasks, each managed by a TaskManager (TM). Note each task contains a complete parallel pipeline of the task instead of a single stage of the task (https://ci.apache.org/projects/flink/flink-docs-release-0.9/internals/job_scheduling.html).
Apache Spark Streaming
- DStream, the abstraction of a data stream in Spark Streaming, is a sequence of RDDs. This is called a "micro-batch" architecture. New batches are created at regular time intervals (determined by a batch interval parameter).
- Recovery from a fault uses recomputation of RDDs.
- API supports a comprehensive set of collection functions (map, flatMap, ...) and also windowed operations by a sliding window.
- Stateless transformations include map, filter, reduceByKey, joins etc. They are RDD transformations applied to a batch.
- Stateful transformations include sliding window based transformation and state tracking across time (updateStateByKey).
- Sliding window has two parameters, window duration and sliding duration. Both of them are multiple of the batch interval.
- The window duration defines the size of the window.
- The sliding duration defines how frequently the computation happens.
- Checkpointing saves the application state to a reliable storage system, such as HDFS, S3.
- Spark Streaming guarantees exactly-once semantics.
- Spark has a local mode for quick experiment (Spark shell)
- Spark has own cluster management and also works with other cluster managers (YARN, Mesos).
- Spark doesn't seem to have an embedded mode.
- A user submits an application (a jar file or a python script) using a provided script (spark-submit) to Spark.
Photon from Google
1. Photon is a special Stream-Stream join system that transform the computation into a Stream-Table where it picks the higher-volume stream and stores it as a table, and then joins with the lower-volume stream.
2. Supports cross-DC joins by using Paxos to detect duplicates, etc.
Apache Apex (incubating)
- Pipeline processing architecture, can be used for real-time and batch processing in unified architecture.
- Architected for scalability, low-latency processing, high availability, operability.
- Stateful fault tolerance (checkpoints operator state without user having to write code for it).
- Runs natively on YARN and HDFS, local mode for development.
- Rich library of pre-built operators (Malhar) with many adapters for message buses, databases, file systems etc.
- Supports Kafka as source and sink (at any point in the topology), connector with offset management for exactly once semantics / idempotency.
Resource Manager Frameworks
Mesos / Marathon
YARN / Slider