Authors: Wei Zhong, Jincheng Sun

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Since the release of Flink 1.11, users of PyFlink have continued to grow. According to the feedback we received, current Flink documentation is not very friendly to PyFlink users. There are two shortcomings:

  1. Python related content is mixed in the Java/Scala documentation, which makes it difficult for users who only focus on PyFlink to read.
  2. There is already a "Python Table API" section in the Table API document to store PyFlink documents, but the number of articles is small and the content is fragmented. It is difficult for beginners to learn from it.

In addition, FLIP-130 introduced the Python DataStream API. Many documents will be added for those new APIs. In order to increase the readability and maintainability of the PyFlink document, we would like to rework it via this FLIP.

Goals

We will rework the document around the following three objectives:

  • Add a separate section for Python API under the "Application Development" section.

          That is, add a "Python API" section at the same level of "DataStream API", "DataSet API" and "Table API & SQL". We hope to have a unified entry for all PyFlink documentation, which includes   Python  Table API and Python DataStream API. So that Python users don’t need to dive into the ocean of Java documentation to find the Python part.

  • Restructure current Python documentation to a brand new structure to ensure complete content and friendly to beginners

          Please see the next section for detailed structure and description.

  • Improve the documents shared by Python/Java/Scala to make it more friendly to Python users and without affecting Java/Scala users

In order to reduce redundancy, the new Python API documentation will still reference many Python/Java/Scala shared documents. But we can block out parts that are not related to Python as much as possible when users click on the "Python" tab.

Proposed Changes

Top-Level Structure

</> Application Development

     DataStream API                        ▾

     DataSet API                              ▾

     Table API & SQL                       

     Python API                               ▾ (newly added)

     Data Types & Serialization        ▾

     Managing Execution                  ▾

     API Migration Guides

Python API Document Structure


  1. OverView
  2. Getting Started
    1. Installation
    2. Tutorial
      1. Table API
      2. DataStream API
  3. User Guide
    1. Table API
      1. 10 minutes to Table API
      2. TableEnvironment
      3. DataTypes
      4. Operations
      5. Expression Syntax
      6. Built-in Functions
      7. Connectors
        1. From/To Variables
        2. Formats
        3. DataGen & Print & BlackHole
        4. Kafka
        5. FileSystem
        6. JDBC
        7. HBase
        8. Elasticsearch
        9. Hive
        10. Custom Connectors
      8. User Defined Functions
        1. General User Defined Functions
        2. Vectorized User Defined Functions
      9. Dependency Management
      10. SQL
      11. Catalogs
      12. MLPipeline
      13. CEP
      14. Metrics
      15. Configurations
      16. Environment Variables
      17. Debugging 
    2. DataStream API
      1. 10 minutes to DataStream API
      2. DataTypes
      3. Operations
      4. Connectors
      5. Metrics
      6. Configurations
  4. CookBook
    1. Recipe for Event Driven scenario
    2. Recipe for ETL scenario
    3. Recipe for Data Analyze scenario
  5. FAQ
  6. API Docs

The following are descriptions for each document above. Some of them can refer to existing documents:

Overview

A brief introduction to PyFlink, including what is PyFlink, what PyFlink can do, and why users choose PyFlink. Including a "Where to go next?" section to guide users to different documents according to their needs. 

Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/.

Installation

Tell users how to install PyFLink. It also includes PyFlink's requirements for the environment (e.g. Python 3.5+, Java 8+) and how to prepare an environment that can run PyFlink.

Moved from  https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/installation.html.

Tutorial (Table API)

The tutorial that tells users how to set up a Python project, how to write jobs and how to run jobs locally.

Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/try-flink/python_table_api.html.

10 minutes to Table API

Shortly introduce the basic usage of Python Table API. Includes the structure of Python Table API programs, the creation of TableEnvironment, the creation of Table, common operations of Table, how to emit results and so on. This document will not involve advanced content, but we should ensure that users have a basic understanding of every component of Python Table API after reading it.

We can write this document based on https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/common.html.

TableEnvironment

Introduce how to create the Batch/Stream TableEnvironment with Flink/Blink planner. Explain every interface of the TableEnvironment. Shortly describe what is planner and the difference between Flink planner and Blink planner. Shortly introduce how to configure statebackend, checkpoint and restart strategy.

DataTypes

This document should be a Python version of https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/types.html.

Operations

Introduce all the Table Operations. The content can be extracted from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/tableApi.html. The reason for creating a new document for Python Table operations is that there are too many Java/Scala-only content in the origin document, and the links in the origin document will guide users to other Java/Scala documents, which will make Python users confused. 

Expression Syntax

Currently the content of this document be a reference to https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/tableApi.html. In future it will be replaced with the syntax of Python Expression DSL.

Built-in Functions

Could be a reference to https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/functions/systemFunctions.html.

Connectors

The reason for writing connectors documentation for Python users separately is that using connectors on PyFlink is a little different from using them on Java/Scala, e.g. how to add the connector jars in Python program. These documents will only introduce the Python-only part of connectors usage. We will guide users to the top level "Connectors" section for detailed documentation.

User Defined Functions

Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/python_udfs.html and https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/vectorized_python_udfs.html.

Dependency Management

Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/dependency_management.html.

SQL

Could be a reference to https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/. But we need to modify these documents slightly to ensure users can not see any Java/Scala-only content when they click the "Python" tab on the pages.

Catalogs

This document should be a Python version of https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/catalogs.html.

ML Pipeline

An introduction of Python ML Pipeline API.

CEP

As the CEP jars are not placed in the "lib" directory of release distribution, we need to tell users how to add the CEP jars in PyFlink. Then we can guide users to https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/streaming/match_recognize.html.

Metrics

Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/metrics.html.

Configurations

Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/python_config.html.

Environment Variables

If the environment variable "FLINK_HOME" is set, PyFlink will launch the java gateway with the jars under the "$FLINK_HOME/lib". If the environment variable "PYFLINK_CLIENT_EXECUTABLE" is set, the Flink Java process will use "$PYFLINK_CLIENT_EXECUTABLE" as the python interpreter to launch Python client process when needed. We need a document to record these behaviors.

Debugging

During the execution of PyFlink programs, there are many PyFlink-only log files generated. These files are useful when the PyFlink programs go wrong. We need to tell users how to find these log files here.

DataStream API

This is a placeholder for DataStream API documentation.

CookBook

We can place the PyFlink best practices in common scenarios here.

FAQ

Moved from https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/python/common_questions.html.

API Docs

The generated sphinx doc of PyFlink, e.g. https://ci.apache.org/projects/flink/flink-docs-release-1.11/api/python/.

And some examples in the docs still use the deprecated APIs. We need to keep the examples in it up to date.

Rejected Alternatives

N/A

Implementation plan

The details will be refined in  FLINK-18775 - Getting issue details... STATUS in the way of creating sub JIRA.