Authors: Wei Zhong, Jincheng Sun
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Since the release of Flink 1.11, users of PyFlink have continued to grow. According to the feedback we received, current Flink documentation is not very friendly to PyFlink users. There are two shortcomings:
- Python related content is mixed in the Java/Scala documentation, which makes it difficult for users who only focus on PyFlink to read.
- There is already a "Python Table API" section in the Table API document to store PyFlink documents, but the number of articles is small and the content is fragmented. It is difficult for beginners to learn from it.
In addition, FLIP-130 introduced the Python DataStream API. Many documents will be added for those new APIs. In order to increase the readability and maintainability of the PyFlink document, we would like to rework it via this FLIP.
Goals
We will rework the document around the following three objectives:
- Add a separate section for Python API under the "Application Development" section.
That is, add a "Python API" section at the same level of "DataStream API", "DataSet API" and "Table API & SQL". We hope to have a unified entry for all PyFlink documentation, which includes Python Table API and Python DataStream API. So that Python users don’t need to dive into the ocean of Java documentation to find the Python part.
- Restructure current Python documentation to a brand new structure to ensure complete content and friendly to beginners
Please see the next section for detailed structure and description.
- Improve the documents shared by Python/Java/Scala to make it more friendly to Python users and without affecting Java/Scala users
In order to reduce redundancy, the new Python API documentation will still reference many Python/Java/Scala shared documents. But we can block out parts that are not related to Python as much as possible when users click on the "Python" tab.
Proposed Changes
Top-Level Structure
</> Application Development DataStream API ▾ DataSet API ▾ Table API & SQL Python API ▾ (newly added) Data Types & Serialization ▾ Managing Execution ▾ API Migration Guides |
Python API Document Structure
|
The following are descriptions for each document above. Some of them can refer to existing documents:
Overview
A brief introduction to PyFlink, including what is PyFlink, what PyFlink can do, and why users choose PyFlink. Including a "Where to go next?" section to guide users to different documents according to their needs.
Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/.
Installation
Tell users how to install PyFLink. It also includes PyFlink's requirements for the environment (e.g. Python 3.5+, Java 8+) and how to prepare an environment that can run PyFlink.
Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/installation.html.
Tutorial (Table API)
The tutorial that tells users how to set up a Python project, how to write jobs and how to run jobs locally.
Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/try-flink/python_table_api.html.
10 minutes to Table API
Shortly introduce the basic usage of Python Table API. Includes the structure of Python Table API programs, the creation of TableEnvironment, the creation of Table, common operations of Table, how to emit results and so on. This document will not involve advanced content, but we should ensure that users have a basic understanding of every component of Python Table API after reading it.
We can write this document based on https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/common.html.
TableEnvironment
Introduce how to create the Batch/Stream TableEnvironment with Flink/Blink planner. Explain every interface of the TableEnvironment. Shortly describe what is planner and the difference between Flink planner and Blink planner. Shortly introduce how to configure statebackend, checkpoint and restart strategy.
DataTypes
This document should be a Python version of https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/types.html.
Operations
Introduce all the Table Operations. The content can be extracted from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/tableApi.html. The reason for creating a new document for Python Table operations is that there are too many Java/Scala-only content in the origin document, and the links in the origin document will guide users to other Java/Scala documents, which will make Python users confused.
Expression Syntax
Currently the content of this document be a reference to https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/tableApi.html. In future it will be replaced with the syntax of Python Expression DSL.
Built-in Functions
Could be a reference to https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/functions/systemFunctions.html.
Connectors
The reason for writing connectors documentation for Python users separately is that using connectors on PyFlink is a little different from using them on Java/Scala, e.g. how to add the connector jars in Python program. These documents will only introduce the Python-only part of connectors usage. We will guide users to the top level "Connectors" section for detailed documentation.
User Defined Functions
Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/python_udfs.html and https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/vectorized_python_udfs.html.
Dependency Management
Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/dependency_management.html.
SQL
Could be a reference to https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/. But we need to modify these documents slightly to ensure users can not see any Java/Scala-only content when they click the "Python" tab on the pages.
Catalogs
This document should be a Python version of https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/catalogs.html.
ML Pipeline
An introduction of Python ML Pipeline API.
CEP
As the CEP jars are not placed in the "lib" directory of release distribution, we need to tell users how to add the CEP jars in PyFlink. Then we can guide users to https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/streaming/match_recognize.html.
Metrics
Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/metrics.html.
Configurations
Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/python_config.html.
Environment Variables
If the environment variable "FLINK_HOME" is set, PyFlink will launch the java gateway with the jars under the "$FLINK_HOME/lib". If the environment variable "PYFLINK_CLIENT_EXECUTABLE" is set, the Flink Java process will use "$PYFLINK_CLIENT_EXECUTABLE" as the python interpreter to launch Python client process when needed. We need a document to record these behaviors.
Debugging
During the execution of PyFlink programs, there are many PyFlink-only log files generated. These files are useful when the PyFlink programs go wrong. We need to tell users how to find these log files here.
DataStream API
This is a placeholder for DataStream API documentation.
CookBook
We can place the PyFlink best practices in common scenarios here.
FAQ
Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/common_questions.html.
API Docs
The generated sphinx doc of PyFlink, e.g. https://ci.apache.org/projects/flink/flink-docs-release-1.11/api/python/.
And some examples in the docs still use the deprecated APIs. We need to keep the examples in it up to date.
Rejected Alternatives
N/A
Implementation plan
The details will be refined in - FLINK-18775Getting issue details... STATUS in the way of creating sub JIRA.