DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Motivation
The current syntax for registering a user-defined function (UDF) that depends on a binary artifact requires the JAR keyword:
CREATE FUNCTION my_func AS 'com.myorg.MyUDF' USING JAR 'hdfs:///path/to/my.jar';
This syntax is limiting for a few key reasons:
- Inflexibility for Other Artifact Types: As Flink's ecosystem evolves, particularly with improved Python support (e.g., Python UDFs depending on wheel files or other archives), the
JARkeyword becomes semantically incorrect and restrictive. We need a more generic way to specify a dependency artifact. - Syntactic Inconsistency: To support other artifact types, we would need to introduce new keywords (e.g.,
USING MODEL,USING ZIP,USING ARCHIVE) or a different syntax entirely, such as aWITHclause. This would lead to an inconsistent DDL experience for users depending on the function's language. - Verbosity: The
JARkeyword is often redundant, as the URI itself ('path/to/my.jar') already indicates the artifact type.
This proposal aims to make the USING clause more generic and future-proof by adding ARTIFACT as a keyword in addition to JAR. This provides a cleaner, more consistent syntax for all types of function artifacts while maintaining full backward compatibility.
Table API Alignment and Consistency Motivation
This proposal to add the ARTIFACT keyword for SQL DDL is consistent with the more generic approach already adopted by Flink's programmatic Table API.
The underlying implementation in the Table API is already designed to be artifact-type agnostic, which aligns perfectly with the goal of future-proofing the SQL syntax.
- Java Table API: The
FunctionCatalog.createFunctionmethods in the Java Table API already use a genericList<ResourceUri>parameter for registering functions, rather than a JAR-specific type. The SQL change effectively extends this generic concept to the DDL layer, making the user experience consistent whether defining functions via SQL or the Java Table API. - Python Table API: The current
create_java_functionmethod relies on the Java classloader, which would need to be updated in the future.
This alignment ensures that users who work with both the SQL interface and the programmatic Table API will encounter a unified and predictable way of managing function dependencies.
Public Interfaces
The proposed change will affect the SQL DDL syntax for CREATE FUNCTION.
CREATE [TEMPORARY|TEMPORARY SYSTEM] FUNCTION [IF NOT EXISTS] [catalog_name.][db_name.]function_name AS identifier [LANGUAGE JAVA|SCALA|PYTHON] [USING JAR '<path_to_filename>.jar' [, JAR '<path_to_filename>.jar']* ] [WITH (key1=val1, key2=val2, ...)]
CREATE [TEMPORARY|TEMPORARY SYSTEM] FUNCTION [IF NOT EXISTS] [catalog_name.][db_name.]function_name AS identifier [LANGUAGE JAVA|SCALA|PYTHON] [USING JAR|ARTIFACT '<path_to_filename>.jar' [, JAR|ARTIFACT '<path_to_filename>.jar']* ] [WITH (key1=val1, key2=val2, ...)]
The key change is that the JAR keyword has another option of ARTIFACT. The behavior of the statement will be identical whether the keyword is JAR or ARTIFACT. No other public APIs or interfaces will be changed. It will be possible, in the case of a statement with multiple files, to mix JAR and ARTIFACT.
Proposed Changes
The implementation will focus on modifying the Flink SQL parser.
- SQL Parser Adjustment: The grammar for the
CREATE FUNCTIONstatement will be updated to recognizeUSING ARTIFACT <string_literal>as a valid clause, in addition to the existingUSING JAR <string_literal>. - Backend Logic: No significant changes are anticipated in the backend resource management logic. When the parser encounters a
USINGclause without theJARkeyword, it will process the provided URI and register it as a resource dependency for the function. In the future if different file types require separate handling this can be based on the language type or the identifier.
The change is confined to the SQL parsing layer, treating the URI from USING ARTIFACT '...' identically to the URI from USING JAR '...'.
Compatibility, Deprecation, and Migration Plan
Compatibility
This change is fully backward compatible. All existing SQL statements that use the CREATE FUNCTION ... USING JAR '...' syntax will continue to work without any modification.
Deprecation
There is no plan to deprecate the USING JAR '...' syntax. It can be retained as an optional, explicit clarifier for users who prefer it.
Migration Plan
No migration is necessary. Users can adopt the new, shorter syntax at their convenience.
Test Plan
The change will be validated by extending the existing test suites for Flink SQL DDL.
Positive Test Cases
- Add a test case to create a Java/Scala function using the new syntax:
CREATE FUNCTION ... USING ARTIFACT 'path/to/my.jar';. The test will then execute a query that invokes this function to verify it was registered and loaded correctly. - Ensure that creating a function with the old syntax (
USING JAR '...') continues to pass all existing tests (regression testing).
Negative Test Cases
- Add tests for invalid syntax variations (e.g.,
USING JARS '...', USING ARTIFACTS '...' '...') to ensure they are correctly rejected by the parser.
Rejected Alternatives
Using a WITH clause for artifacts
A WITH clause was considered to provide a generic key-value configuration mechanism, which could be useful for other future properties.
CREATE FUNCTION myUDF AS 'com.example.MyUDF' WITH ( 'artifact.uri' = 'hdfs:///path/to/my.jar' )
Reason for Rejection: This approach creates a significant syntactic departure from the existing USING JAR clause. It would force users to learn a new syntax and lead to inconsistency, where some functions are defined with USING JAR and others with WITH. The goal is to evolve the current syntax, not introduce a competing one for the same purpose.
Removing the JAR keyword by making it optional
Another alternative was to make the JAR keyword optional.
CREATE FUNCTION myUDF AS 'com.example.MyUDF' USING 'hdfs:///path/to/my.jar';
Reason for Rejection: given there are other resources like connections or models that are registered entities and could use the USING syntax, this would be confusing. Providing an alternative to JAR that is generic solves the same problem, but in a more explicit way.
Follow-up Work
While the SQL parser change is contained and requires no immediate changes to the backend resource management logic or the existing Table APIs, we recognize that enabling full functionality for non-JAR artifacts will require follow-up work on the user-facing APIs.
- Non-JAR Artifacts: The current Flink runtime primarily uses the Java classloader mechanism for resource loading, which inherently focuses on JAR files.
- Future Extensions: Subsequent work could focus on extending the Table API and the Flink runtime to specifically handle and utilize different artifact types (e.g., Python wheels, models, archives) based on the function's
LANGUAGEor the resource identifier, enabling full end-to-end support for non-JAR dependencies.