Current state: ACCEPTED
Apache Flink’s documentation has grown a lot over the years. Some features are better documented than others; some content is outdated or become irrelevant over time. With this FLIP we would like to coordinate the efforts to restructure, improve and update our documentation.
Top-Level Structure with Clear Separation of Concerns
With this FLIP we would like to clean up the top-level structure of Flink’s documentation. The top-level sections should all have dedicated goals and clear separation of concerns (see below)
Expand and Restructure Existing Concepts Section
The Apache Flink documentation already contains a Concepts section, but it is a ) incomplete and b) lacks an overall structure & reading flow and c) describes Flink as the community presented it 2-3 years ago. Also, “concepts”-content is also spread over the development & operations documentation without references to the “concepts” section.
Align Application Development with current State and Flink Roadmap
The structure of the Application Development section of the documentation does not align well with the overall direction of the project. Broadly speaking, while we are moving into the direction of having two unified APIs for batch and stream processing (DataStream API & TableAPI SQL), the documentation is centered around DataStream and Dataset API.
Improve documentation on Deployment & Operations
The operations and monitoring are arguably the weakest sections of the Apache Flink documentation. While it is comprehensive and often goes into a lot of detail, it lacks an overall structure and does not address common overarching concerns of operations teams in an efficient way.
Improve getting-started experience
The current documentation has a few getting started guides but those only cover some of the APIs and are also not consistent. It would be good to have guides that address different aspects like APIs (job development / SQL queries), operations (job submission, management via UI/REST). Some of these could be realized with Docker compose environments.
Use of a common terminology
Some terms in the documentation are used to describe the same things or sometimes a term is also used to describe two different things. Examples are operator, task, instance, partition, record, event, function, transformation, etc. It would be great to have a reference pages that defines the meaning of these words and aim to use them in a consistent way.
Add a documentation style guide to contribution guide
Similar to the Checkstyle used to keep the Flink’s source code consistent, we should have a common style for all pages of the documentation. Doing so will allow full community involvement in growing the documentation while maintaining a consistent visual structure. Examples include how and when to use note and attention tags, where the table of contents should appear on the page, etc.
We propose a complete overhaul of the structure of our documentation; expanding, improving, and bringing up to date all sections.
Current (moved to)
Getting Started is targeted to new users, who want to get up and running quickly. It will contain quickstarts and example walkthroughs for the two main APIs: DataStream & TableAPI/SQL. Interactive Docker compose playgrounds would be great to have as well: 1) SQL CLI setup (like the SQL training, 2) Flink+ Kafka + a job jar to submit, savepoint, restart, etc via Web UI / REST..
The Concepts section introduces the user to Flink’s fundamental concepts for batch as well as stream processing.
The APIs section explains in detail how to write Flink applications with the DataStream API and Table API / SQL and for the time being also DataSet API. The APIs section references Concepts whenever the conceptual background is needed.
The Libraries section documents the usage of our libraries on top of Flink’s high-level APIs, namely for Complex Event Processing, Graph Processing, and Machine Learning.
The new top-level Connectors section makes it easy for new users to evaluate how Flink applications integrate with their existing infrastructure quickly. This section discusses existing connectors as well as a guide to custom sources & sinks.
The Deployment Section covers all topics related installation & setup of a Flink cluster or application including metrics, logging and security concerns.
In contrast the Deployment section, the Operations sections will focus on “Day 2 operations”, namely monitoring, upgrades, troubleshooting & tuning Flink applications & clusters.
Getting Started Section
Getting Started is targeted to new users, who want to get up and running quickly. We propose the following structure:
- Flink Overview (~ two pages)
- Project Setup
- Example Walkthrough - Table API / SQL
- Example Walkthrough - DataStream API
- Docker Playgrounds
- Flink Cluster Playground
- Flink Interactive SQL Playground
The Example Walkthroughs will be structured in multiple steps, similar to how it is done in Apache Beam: Simple Wordcount, Windowed Wordcount. The Example Walkthroughs should also contain deployment to a minimal local Flink cluster.
- Writing a program: Tell the story of how to evolve batch into streaming:
- Start with simple word count (batch)
- Windowed word count (batch, group by word and tie window)
- Continuous windowed word count (different source), fire by watermark or periodic proc.-timers. (two ways of producing output)
The playgrounds are Docker compose environments that come with Flink and Kafka. The SQL CLI playground can be a stripped down version of the SQL training. The Flink Ops playground could come with one (or more) reasonably complex streaming applications that can be submitted for execution. We could give some basic REST commands (taking savepoint, killing job, requesting metrics, …) and let users play with the WebUI. I think this could be a nice starting experience.
One of the primary goals of this FLIP is to create a better, more prominent Concepts section. The purpose of this section is to introduce Flink users to the fundamental concepts of stream & batch processing with Apache Flink. Each subsection should cover both: stream and batch processing.
We propose the following structure for this section:
- Stream Processing
- A Unified System for Batch & Stream Processing
- DAGs, Parallel DAGs, Partitioning (Random, Keyed, Broadcasted), Functions (wrapped in operators)
- Stateful Stream Processing & Persistent State
- What is State?
- State in Stream & Batch Processing
- State Types (Keyed State, Broadcast State)
- State Persistence
- Asynchronous Barrier Snapshots
- State Backends (Working Data Structure vs Checkpoint Storage; not describing the existing backends though)
- Delivery Guarantees & “Exactly Once Semantics”
- End-to-end exactly once semantics
- Time & The Latency-Completeness Trade-Off
- Latency & Completeness
- Latency vs. Completeness in Batch & Stream Processing
- Notions of Time
- Event Time
- Different Sources of Time (Event, Ingestion, Storage)
- Processing Time
- Making Progress: Watermarks, Processing Time, Lateness
- Propagation of watermarks
- Flink Architecture
- Flink Applications and Flink Sessions
- Anatomy of a Flink Cluster
- Client, Flink Master (Dispatcher, Resourcemanager, Jobmanagers), Task Managers
The Table API has the additional concept of “Dynamic Tables”. Since this “Concept” is only relevant to that API, we propose to introduce it in APIs -> Table API / SQL.
- Project Build Setup is removed from APIs and moved Getting Started -> Project Setup.
- Libraries is moved top-level.
- Basic API Concepts & Data Types & Serialization are mostly moved to DataStream API.
- Best Practices can be removed completely
This corresponds to the current Libraries section under Application Development with the following subsections:
- Complex Event Processing (moved as is. https://issues.apache.org/jira/browse/FLINK-9773 for restructuring)
- Graph Processing
- Machine Learning (document new library after replacement)
The Connector section will document existing connectors of Apache Flink as well as instructions and best practices to build your own connectors. The section should cover Table API/SQL as well as DataStream API, and for the time being DataSet API.
The content currently mainly resides in
- Application Development -> Streaming -> Connectors
- Application Development -> Batch -> Connectors
- Application Development -> Table API /SQL -> Connect to External Systems
and only needs to be moved and restructured.
- Stream Sources, Stream Sinks, Table Sources, Table Sinks (Relationship between Stream Sources/Sink and Table Sources Sinks)
- Existing Connectors
- Apache Kafka
- Elastic Search
- Existing Encodings
- (Stream Sources) (To be added after Source Interface Rework)
- How to write a Stream Source
- Stream Sink
- How to write a Stream Sink
- Table Sources and Sinks
- How to build Table Sources/Sinks on top of stream sources and sinks
The Testing section will document best practices for testing across api’s and the testing pyramid.
- Unit Testing
- Stateless Functions
- Stateful Functions
- How to use the StreamOperatorTestHarness
- Table API UDF’s
- Integration Testing
- Running embedded Flink clusters
- Injecting Faults
- Running the local cluster with savepoints to test recovery
The Deployment section is restructured and follows a top-down approach starting from a reference architecture, which explains the different components and their roles in a production setup indicating “optional” components. The different components are then discussed in the following subsections including different resource managers, state backends & Flink Master failover. This subsection concludes by three reference setups for typical Fink installations.
The rest of this section is straightforward covering how to setup Metrics & Logging (not the metrics themselves), how to secure a Flink environment and a reference of all Flink configurations.
- Apache Flink Reference Architecture
- Components: Resource Manager, Flink, Checkpoint Storage, High Availability Services Provider (aka Zookeeper)
- Resource Manager
- Flink Application on K8s
- Flink Cluster on K8s
- Flink Application on Yarn
- Flink Cluster on Yarn
- Flink Application on Mesos
- Flink Cluster on Mesos
- (Flink Application on Standalone)
- Flink Cluster on Standalone
- Flink Master Failover (formerly High Availability)
- Reference Architectures
- A Reference Architecture on YARN
- HDFS, Zookeeper, YARN, Keberos(?)
- A Reference Architecture for Hosted Kubernetes
- Hosted Kubernetes, Object Storage, Zookeeper/Zetcd?
- A Reference Architecture for Standalone on VMs
- Standalone, Standby Flink Masters, Standby Taskmanagers, NFS, Zookeeper, init.d for process supervising
- Metrics & Logging
- Metrics Reporter
- Logging Configuration
- Apache Flink Security
- Flink REST API
- Intra-Cluster Communication
- Configuration Reference Guide
In contrast to the Deployment section the Operations sections focuses on “Day 2 operations”, namely monitoring, upgrades, troubleshooting & tuning Flink applications & clusters. This distinction is not always clear cut though.
- Apache Flink Interfaces
- Apache Flink WebUI
- REST API
- SQL Client
- Scala Shell
- Upgrading Apache Flink Applications & Flink Sessions
- Application State Compatibility
- Metrics Reference
- Slots, Threading Model
- Slot Sharing
- Task Chaining
- Object Reuse
- Debugging & Troubleshooting
- “Common Issues”
- Issue A
- Issue B
New or Changed Public Interfaces
Migration Plan and Compatibility
At this point we only have a very rough migration plan. The implementation of this FLIP will be realized in three phases.
The first phase we simply implement a few prerequisites for such a large documentation change.
- Set up a redirect for 404s to be redirected to the Apache Flink homepage for existing links to Flink’s “stable” or “master” documentation.
- Create style guide for documentation (see Goals)
- Create 1st version of Concepts->Glossary
The second phase consists already of some significant rework and restructuring, but consists of independent projects and does not require major changes the existing documentation.
- Update Getting Started section & replace current Tutorials & Examples sections.
- Update Concepts section and move “Concepts” content from Application Development to Concepts
- Remove Flink Development & Internals from documentation
- Flink Development can partially be dropped or migrated into the contribution guide on the Flink project homepage
- Internals can be removed. Most content is outdated and can partially be reused for the updated Concepts section.
Afterwards, we tackle the other two larger parts of the documentation rework. These depend on the new Concepts section, but realistically the Concept sections will evolve significantly in the course of the whole implementation phase of this FLIP.
- Restructure Application Development and split up into APIs, Libraries and Connectors sections
- Rework and expand Deployment & Operations and Metrics & Monitoring section
All non-translation documentation tickets:
project = FLINK AND component = Documentation and status = Open and component != chinese-translation
Already finished translation efforts:
project = FLINK AND component = Documentation and resolutionDate !=null and component = chinese-translation
Tickets independent, but related to this FLIP/documentation, which came up in the discussion:
- https://issues.apache.org/jira/browse/FLINK-9773 (Break Down CEP Documentation)
- https://issues.apache.org/jira/browse/FLINK-12537 (Improve The Documentation Build Time)
- Create ticket to describe possible multi-data center setups
- Create ticket to improve reinterpretAsKeyedStream documentation
Existing tickets subsumed by this FLIP
- https://issues.apache.org/jira/browse/FLINK-8444 (Rework Dependency Setup Docs)
- https://issues.apache.org/jira/browse/FLINK-1294 (Performance Tuning Documentation)
- https://issues.apache.org/jira/browse/FLINK-4463 (FLIP-3: Restructure Documentation)
Alignment with Chinese Translation FLIP (FLIP-35)
We will coordinate with the community on the concurrent implementation of FLIP-35. In FLIP-35 we will first focus on those parts of the documentation, which won’t fundamentally change as part of this FLIP. The least changes will probably be in the following areas:
- Libraries -> Gelly
- Libraries -> CEP
- Application Development -> Streaming -> Connectors
The rest of Application Development -> Streaming will probably be restructured quite a bit, but the fundamental documentation of Flink’s API will of course stay around and can be translated as well.
In addition any documentation added in the course of the FLIP can of course be translated as well. The first parts will be Getting Started, Concepts and the Flink Documentation Guideline.
This is just an additional list of smaller & larger open questions.
- Where to put “Best Practices”?
In the corresponding sections:
- “Best Practices” for cluster sizing -> Deployment
- “Best Practices” for choosing a state backend -> probably Deployment
- “Best Practices” for using different state types (List, Value, Map) -> APIs
- Add top-level “Releases” section for Migration Guides (API as well as Operations side)?
- For the Application Development restructuring and the Deployment & Operations restructuring/rewrite, it might be useful to have many smaller PRs, which don’t go onto master directly, but on some feature branch.