Research and development roadmap for Flink Streaming
Fault tolerance guarantees for streaming
There has been a lot of development throughout the years towards fault tolerant stateful streaming. It is true that fault tolerance mechanisms can be very expensive in stream processing applications both in storage and latency requirements. Part of our research on Flink is to minimize these costs by taking advantage of semantical properties such as window pre-aggregates as well as investigating different state management approaches. The ultimate goal of fault tolerance in Flink will be to offer exactly-once processing semantics while sacrificing as few resources as possible.
Currently, there is an ongoing effort for streaming fault tolerance in Flink that targets at least once semantics with in-memory replication within the stream sources combined with lightweight tuple lineage tracking. Thus, an exactly-once semantics approach can build on this mechanism.
Overall, the following concrete tasks should be implemented:
At least-once semantics using source replication and tuple lineage tracking
Proper declaration and handling of explicit state for stateful operators
Efficient state checkpointing implementation
Upstream backup across network channels
Exactly-once semantics scheme implementation using techniques above
A common pipeline architecture for batch and streaming
Flink currently offers two separate APIs, a batch and a streaming one both providing support for operations on either Dataset or DataStream collections respectively. While there are some major differences in the way those two types of jobs are handled, such as the addition of an optimisation phase of batch jobs, they are both running under the same flow-based runtime. Operations are also almost-identical, allowing the reusability of user defined functions among both APIs.
Modelling such a unified architecture and finding the basic principles behind it would serve as great topic for theoretical research in distributed data management. One of the results of this study and implementation would be to yield a set of binary operators that can be applied to sets and streams. Furthermore, the ability to combine dataset and datastream operations within the same application would be beneficial for many types of machine learning or data mining tasks that require handling both offline and online data. Some examples of such applications are user profiling, collaborative filtering and anomaly detection. While there exist both incremental and offline algorithm variations of such tasks, there is further potential to achieve both high accuracy and low latency by offering flexible API support for achieving it. Furthermore, this multi-paradigm support can also introduce the fundamentals for building a new class of machine learning algorithms as well as streaming graph computations.
Overall, the following concrete tasks should be implemented:
Identification and implementation of a “lambda architecture” kernel of binary operators between DataSets and DataStreams.
Create a unique context and environment for such executions
Unify fault tolerance mechanisms efficiently
Currently we only provide a Java api for Flink streaming. In the near future we would like to make the same functionality accessible through a Scala (and later Python) api like for the batch processing functionality. Implementing the Scala api should be fairly straightforward because most of the features can be reused from the batch api.
Going one step further from the general streaming api-s we should provide integration to some other frameworks that use stream processing engines. We have already started working on integration for Samoa which is a distributed machine learning framework. Our notes on the current status of the integration can be viewed here.
Some of the tasks that could be identified are the following:
Implement intuitive Scala/Python APIs for Flink-Streaming
Integrate Flink to Samoa
Offer Exploratory Data Analysis Tool Compatibility on Flink which includes
Interactive Shell Support
Notebook - Zeppelin Integration
SQL - Streaming SQL integration
Offer Complex Event Processing semantics support (eg. cross-stream/dataset temporal pattern matching)
Multiparadigm machine learning algorithms
There exist several machine learning libraries building on data intensive systems nowadays. Some examples are MLlib and Mahout on Apache Spark for offline learning and Samoa on Apache Storm and S4 for incremental learning. While the streaming paradigm is not suitable for all machine learning algorithms, there are parts of a training model that can gain a benefit by using it while also allowing offline analysis. For example, in item correlation analysis used in collaborative filtering applications item co-occurences could be streamed in real time to update an underlying model, while batch jobs could be ran on demand on top of this model to extract eg. the top-K relevant items per item based on pairwise log-likelihood. In anomaly detection applications where it is unknown what an anomaly is it is beneficial to create offline a clustering model that identifies, for example, temporal anomalies through an extended period. Once knowledge is extracted from offline processing it can be used incrementally, eg. in the form of evaluation rules to classify incoming data in real-time.
The overall vision of this project is to create a library of novel machine learning algorithms that make use of both the batch and streaming paradigm in order to offer both high accuracy and low latency. Furthermore, in addition to the research benefits of this approach, such algorithms could be served as the backbone of real-world data-intensive pipelines, minimising processing costs while offering fresh predictive analytics.
Even though this is a general research effort some concrete tasks can be:
Design abstractions that generalise a hybrid stream and dataset architecture for machine learning
Create a framework or extend an existing one for general algorithm/pipeline design
One idea is to extend SAMOA which involves the implementation of that paradigm on all interlying systems
Alternatively create a new machine learning library
Implement a collection of representative machine learning algorithms and sample pipelines
Build or port a graph processing library on the same framework