DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
This page is meant as a template for writing a FLIP. To create a FLIP choose Tools->Copy on this page and modify with your content and replace the heading with the next FLIP number and a description of your issue. Replace anything in italics with your own description.
Status
Current state: [One of "Under Discussion"]
Discussion thread: https://lists.apache.org/thread/0xd7mk4lv5xpo8cgdvqpbslxj4lljrc8
JIRA: here (<- link to https://issues.apache.org/jira/browse/FLINK-XXXX)
Released: <Flink Version>
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
The primary goal of this FLIP is to enhance the usability of PyFlink for the Python developer community. The name is an ode to the Zen of https://peps.python.org/pep-0020/. The main reason for this FLIP is that the current Python API for Flink can be challenging for users accustomed to idiomatic Python libraries used for data transformations. With the number of Python downloads on PyPi reaching into the millions/week, the time is right to invest in the Python API to make it more Pythonic. This proposal aims to solve several key problems:
- Non-Pythonic API: The existing API contains patterns influenced by Java/Scala, which can be unintuitive for Python developers.
- Difficult Debugging: Error messages often expose underlying Java stack traces, making it hard to diagnose and fix issues from Python code.
- Inadequate Documentation: Documentation is spread across multiple locations, lacks comprehensive examples, and doesn't sufficiently guide users through common use cases.
- Poor Local Development Experience: Setting up a local development and testing environment is cumbersome, and interactive data exploration in tools like Jupyter notebooks is not seamless.
- Limited Ecosystem Integration: PyFlink could better integrate with popular Python libraries such as Pandas, NumPy, and various machine learning frameworks.
By addressing these issues, we aim to make PyFlink as intuitive and powerful for Python developers as other leading data processing frameworks, thereby improving adoption and developer productivity.
Public Interfaces
Below is a summary of the key changes to public interfaces.
New Interfaces:
- Convenience Methods: Introduction of user-friendly methods on the Table object for data preview, such as .show() and .display(), similar to those in other data-frame libraries.
- String Expression Class: A new string class will be added to expressions to provide familiar methods like .str.upper() instead of upper_case().
- Type Hinting: Comprehensive type hints will be added across the public API (e.g., Table, TableEnvironment) to improve IDE support for autocompletion and static error checking.
- Python-Native Types: Support using standard Python types (e.g., int, str) in function signatures for UDFs, which will be automatically converted to Flink's DataTypes.
- Migrate from Builder Pattern: Where possible, move from builder patterns to dataclasses, constructors, factory functions and context/configuration patterns.
Changed Interfaces:
- Method to Attribute Conversion: Getter methods will be converted to properties where appropriate to follow Python conventions (e.g., using table.schema instead of table.get_schema()).
- Execution Consistency: The API for job submission will be unified to provide a consistent experience for both local (.wait() will no longer be required) and remote execution.
Removed Interfaces:
Currently there are no expected interfaces that will be removed, but as the work evolves there may be some required changes to determined non-pythonic areas, but the best effort will be made to mirror or maintain an escape hatch.
Proposed Changes
Task Name/FLIP/Issue | Description |
|---|---|
| Reference table columns as attributes | Allow pandas-like table.<my-col> reference in addition to col(“<my-col>”) for all table API arguments |
| Kwargs aliasing | Allow polars-like table.agg(a_sum=<expr>) in addition to table.select(<expr>.alias(“a_sum”) for providing named aliases via kwargs |
| Move getter methods to attributes where possible | Convert getter methods to properties where possible for more Pythonic access. |
| Using Python types as well as or instead of DataTypes | Allow users to specify Python types in function signatures, which are converted into Flink Types. |
| Move from Builder pattern to Python friendly patterns | Where possible move from the builder pattern that leaks from Java to Python-native patterns like dataclasses, constructors, factory functions and context/configuration patterns. |
| String methods | Add a string class to expressions with methods similar to Python and pandas. |
| Unraveling/Truncating Tracebacks | Capture and simplify JVM stack traces, showing only relevant information to the Python user. |
Compatibility, Deprecation, and Migration Plan
- What impact (if any) will there be on existing users?
- Improvement of existing API surface area.
- All functionality/API in use will continue to exist to minimize friction of adoption.
- If we are changing behavior how will we phase out the older behavior?
- If there are issues preventing backwards compatibility we will make a migration guide available in the documentation.
- If we need special migration tools, describe them here.
- See above
- When will we remove the existing behavior?
- There are no plans to remove existing behavior, only additions.
Test Plan
All new surface area will be tested where possible.
Rejected Alternatives
N.A.