Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
A partition is a division of a logical database or its constituent elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons, or for load balancing[1].
Partition is widely used in hive. Especially in the ETL domain, most tables have partition attributes, which allow users to continue processing. Partition is more convenient for data management, time partitioning and business partitioning are common.
Goal and non-goal
Goal
Table partitioning means dividing table data into some parts based on the values of particular columns.
- Partition in flink only support single value list partition, this means that partitions can only be partitioned according to the specific values of a column. Not support: hash, range, etc.
- Both regular tables and temporary tables support partition, with PartitionableTableSource and PartitionableTableSink, user can do above read and write to the temporary table.
Read:
- Partition prune: partitioned table support partition pruning, this means that users can specify which partition to read to avoid scanning the entire table.
- Regular read: Without partition prune, will read all partition data, and select * will contain partition columns.
Write: Flink does not require users to create partitions in advance, and partitions are created automatically during writing.
- static partition write: the users can specify which partition to write.
- dynamic partition write: partition are specified by specific data. Many partitions may be generated based on the data.
- Streaming write to partition should support exactly-once.
Connectors:
- Introduce file system connector support partition.
- Improve Hive connector partition support.
Non-Goal
- Although Queue may distinguish partitions by the partition concept of the underlying queue. (Like kafka partition), streaming connector like queue(Kafka) support table partition is not our goal in this ticket.
- Bucket support to cover hash partition in traditional database and etc..
Background
Partition in traditional databases
Partition in traditional databases is very complex, and they support rich partitioning criteria, includes:
- list partition
- range partition
- hash partition
- subpartition
The DDL in traditional databases like:
CREATE TABLE pageview(
user VARCHAR(100),
cnt INT,
date VARCHAR(100))
PARTITION BY LIST (date) (
PARTITION day1 values(‘2019-8-28’),
PARTITION day2 values(‘2019-8-29’),
PARTITION day3 values(‘2019-8-30’)
);
Note:
- date is the reference of DDL defined fields
- partition values need to be saved in real data because they support rich partitioning criteria.
Partition in Hive
In today's big data systems, partition mainly comes from hive. The partition in Hive is only similar to the concept of single value list partition in traditional databases. There is no need for support rich partitioning criteria at present.
The Create DDL like:
CREATE TABLE page_view(
user STRING,
cnt INT)
PARTITIONED BY (date STRING);
The users can query on “where date = ‘2019-8-28’” to high performance partition pruning.
Note:
- date is not the reference of DDL defined fields.
- Partitioned field can not be included in the table declarative fields. Otherwise will get the error.
- Partitioned field data is not stored in real data. It just be used in directory.
Partition in Spark
Spark support hive partitioned by when use Hive catalog, and it also introduced its partitioned by DDL too when use inMemory catalog. (The two methods of use are mutually exclusive)
In SPARK-7654, Spark introduce partition interface to Dataset api.
In SPARK-14954, Spark introduce partitioned to CREATE TABLE DDL.
The DDL like:
CREATE TABLE page_view(
user STRING,
cnt INT,
date STRING)
PARTITIONED BY (date);
- date is the reference of DDL defined fields
- But partitioned field data is not stored in real data(FileFormat can not see the partition columns). It just be used in directory. So the real data indices are different from the definition of CREATE DDL.
disadvantages: This disrupts the format of real data, and partitioned columns may be in the middle of non-partitioned columns, which makes real data look strange.
Partition Pruning
Hive/Spark partition pruning
Hive/Spark use catalog to partition pruning. If use mysql as catalog storage, the partition filter will push down to mysql query.
This is the most efficient pruning method, which has less pressure on catalog and client.
Databricks delta partition pruning
Databricks delta is a transaction storage layer specially designed to use Apache Spark and Databricks File System. It don’t have catalog and focuses on transaction. It does partition pruning by launching a Spark SQL job. First, it reads checkpoint and changeLog, gets the current readable file list, and then filter it according to condition, and get the final partitions.
One of the main reasons is that partition pruning is too heavy in Delta. It needs to merge checkpoint and changeLog, and there may be many smaller files, so it needs to start a Spark SQL job to complete.
Proposed Change
Partition SQL
At present, the partition we want to support is similar to that of hive and only supports single value list partition.
static partitioning insert
Users can specify the value of partition while inserting the data:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
- PARTITION clause should contain all partition columns of this table.
- The fields returned in this select statement should not contain any of the partition columns.
- INSERT OVERWRITE will overwrite any existing data in the table or partition
- INSERT INTO will append to the table or partition, keep the existing data intact
- Both INSERT INTO and INSERT OVERWRITE will create a new partition if the target static partition doesn't exist.
For example:
INSERT INTO TABLE country_page_view PARTITION (date=’2019-8-30’, country=’china’) SELECT user, cnt FROM country_page_view_source;
This will create a new partition in country_page_view and insert all date from country_page_view_source to this partition. User can verify it by command:
➜ ~ SHOW PARTITIONS country_page_view;
date=’2019-8-30’,country=’china’
dynamic partitioning insert
In the dynamic partition inserts, users can give partial partition specifications, which means just specifying partial column values in the PARTITION clause or not provide PARTITION clause. Let the engine dynamically determine the partitions based on the values of the partition column from source table. This means that the dynamic partition creation is determined by the value of the input column.
INSERT INTO TABLE country_page_view SELECT user, cnt, date, country FROM country_page_view_source;
In this method, the engine will determine the different unique values from source table that the partition columns holds(i.e date and country), and creates partitions for each value.
Different from hive 2.X or smaller: Dynamic partitioned columns do not need to be on partition clause. (Hive 3.0.0 also support this in HIVE-19083).
Hive 2.X, user need define dynamic partition columns in PARTITION clause like this:
INSERT INTO TABLE country_page_view PARTITION (date, country) SELECT user, cnt, date, country FROM country_page_view_source;
Partially specified partition columns values are also supported:
INSERT INTO TABLE country_page_view PARTITION (date=’2019-8-30’) SELECT user, cnt, country FROM country_page_view_source;
NOTE:
- The dynamic partition columns must be specified last among the columns in the SELECT statement
- The dynamic partition columns must be in the same order in which they appear in the DDL of CREATE TABLE.
- Because of the existence of dynamic partitioning, we will stuff both static and dynamic columns into Row, so the data received by the sink contains all partition columns.
Behavior of dynamic partition INSERT OVERWRITE:
- delete all partition directories that match the static partition values provided in the insert statement. (spark behavior)
- only delete partition directories which have data written into it (hive behavior)
This is related to implementation, recommend hive’s behavior.
external partitioned tables
If we already have partition data on File system, if we want to load it into Flink catalog. At this point, we need to add partition grammar.
Consider we have a table country_page_view, it is a file table and its location is ‘/user/flink/country_page_view’. And now we have some data of partition (2019-8-30, china), we want to load it into Flink catalog, we can do:
- File system operation: move data to ‘/user/flink/country_page_view/2019-8-30/china/’
- ALTER TABLE country_page_view ADD PARTITION (date=’2019-8-30’, country=’china’);
NOTE: Using external partitioning tables is an option. Files in File system can also be loaded into managed non-partitioned tables, from which the date can be inserted into partitioned tables. But by external partitioning tables, user can avoid reading and writing real data, which can greatly improve performance.
Partition Read
Partition pruning
One of the great significance of Partition is to support partition Pruning. Users can specify the partition to read through standard filtering conditions, which can greatly improve the efficiency of reading.
Current blink partition pruning:
FLINK-5859 FLINK-12805 FLINK-13115 already introduce PartitionableTableSource to flink and implement it in blink planner.
Advantages and disadvantages:
- The engine will automatically prune the partitions based on the filters and partition columns. Source don’t need do something.
- The table source need get all partition values.
- The problem is that every partition Pruning needs to get all partition values. When there are thousands of partitions, there will be a lot of pressure on catalog (for example, MySQL storage).
How to do partition pruning depends entirely on TableSource's own implementation:
- The table source can use catalog to do partition pruning. For example, hive table source can touch its catalog from creation of HiveTableFactory.
- Without catalog, the table source will list sub directories to do the filter by name.
How to do partition pruning depends on table:
- The table is catalog table: planner will use catalog to do partition pruning.
- The table is temporary table: planner will use the all_partitions returned by the temporary table and do the filter by name.
Add Catalog Api:
List<CatalogPartitionSpec> listPartitionsByFilter(ObjectPath tablePath, List<Expression> filters)
Without Partition pruning
If it is a partition catalog table, will read all partition which is registered to catalog. Users can judge which partition by the partition column in data.
Partition write
Static Partition
Static partition writing is basically the same as non-partitioned writing. The only difference is that the directory of the final file needs to contain a subdirectory of the partition.
Dynamic Partition
We have already talked about the grammar of dynamic partitioning, and this time we will focus on its implementation and its impact on the sink interface.
Now there are two writing formats:
- Writing buffers is small: Like Csv/Text, In this case, when dynamic partitioning, we can write multiple files simultaneously in a task of sink.
- Writing buffers is big: Like Orc/Parquet, In this case, when dynamic partitioning, we can not write multiple files simultaneously in a task of sink. Otherwise, too much memory will lead to OOM.
Sink implementation should provide three writers:
- single-partition writer: writes data to a single partition (non-dynamic-partition writes).
- grouped multi-partition writer: inputs are grouped by dynamic partitions, So there's only one partition at the same time.
- ungrouped multi-partition writer: writing multiple partitions at the same time consumes more memory.
Streaming partition write
Scenes
There are many scenarios where data can be written to FileSink through streaming job. At the same time, these data can be analyzed and calculated by batch job.
- static partition writing to sink.
- dynamic partition writing
- partitioned by window time, maybe event time or processing time. Without trigger, the partition column is monotonically incremental.
- partitioned by regular columns.
Exactly-once semantics
Like StreamingFileSink, table sink should integrated with the checkpointing mechanism to provide exactly once semantics.
The files can be in one of three states: in-progress, pending or finished. The file that is currently being written to is in-progress. Once a file is closed for writing it becomes pending. When a checkpoint is successful the currently pending files will be moved to finished.
StreamingFileSink does many great works:
- Decouple checkpoint from file size. It provides an abstraction of RollingPolicy to determine file size. On snapshot, it will not only store the pending files, but also store in-progress files. In case of a failure, it will restore the pending files, and restore in-progress files too. (In-progress files will be truncated to discard the content that does not belong to that checkpoint. This is achieved by using RecoverableWriter.)
temp files and renaming versus recoverable writer:
- Either way, file visibility still depends on the checkpoint finish time.
- Complex Formats, such as hive, can hardly meet the requirements for recoverable writer. (Hive just provides abstract RecordWriter, which hardly supports above features: Flush to the file system and record its file offset on snapshot, and truncate redundant file contents on recovery)
To simplify the current implementation, we only consider that file size depends on checkpoint.
- snapshotState(cpId): The file currently being written changes from in-progress state to pending state. Store the pending files (Contains all unfinished checkpoints corresponding files) by operator state.
- notifyCheckpointComplete(cpId): Move all the pending files less than or equal to cpId to the target directory, and the corresponding files will be finished.
- HiveFormat's problem: At this stage, HiveFormat needs to access Metastore if the file needs to be visible. Only the Task side can have logic in notifyCheckpointComplete, which will lead to distributed access to Metastore, causing pressure.
- initializeState(retore): Copy the pending files from state to memory.
Partition support
Stream write support both static partition table and dynamic partition table. To static partition table is simple: just like regular table. The only thing is decide path by static partition first.
To dynamic partition table:
- partitioned by monotonically column (like partitioned by window time): In this case, the implementation should be the same as batch grouped multi-partition writer. At the same time, can open only one writer.
- partitioned by regular columns, Because in the case of streaming, upstream can not sort all data, so:
- Open multiple writers at the same time, If the file format is CSV or text or partition number is small, this is no problem. If it's a Parquet or Orc data format, it will consume too much memory.
- (Nice to have) Accumulate data in a single checkpoint, wait until snapshot, sort all data, and write partition data one by one.
FileSystemSink
Considering the stream writing and the mechanism of dynamic partitioning, we need to implement a FileSink to handle the relevant logic. Subsequent Flink file-related connectors and HiveSink can be unified into this sink. Formats only need to implement the relevant interface, without dealing with streaming exactly-once and partition-related logic.
- Support single-partition writing
- Support grouped multi-partition writing
- Support non-grouped multi-partition writing
- StreamingFileSystemSink support streaming exactly-once
Not recommend using StreamingFileSink to support partitioning in Table.
- The bucket concept and SQL's bucket concept are in serious conflict.
- In table, we need support single-partition writing, grouped multi-partition writing, non-grouped multi-partition writing.
- We need a global role to commit files to metastore.
- We need an abstraction to support both streaming and batch mode
- Table partition is simpler than StreamingFileSink, the concept of partitioning is that we only support partition references on fields, rather than being as flexible as runtime.
Flink FileSystem connector
The DDL can like this:
CREATE TABLE USER_T
......
WITH (
'connector.type' = ‘filesystem’,
'connector.path' = 'hdfs:///tmp/xxx',
'format.type' = 'csv',
'update-mode' = 'append',
'partition-support' = 'true'
)
The only difference from the previous FileSystem is that the partition-support attribute is required. We can use this identifier to represent the new connector support partition without changing the previous connector.Other attributes can be completely consistent.
'partition-support' = 'true' can be removed after we full support csv format.
And provide table factories:
- Provide FileSystemTableFactory: Csv format and Hive format will use it.
- Provide FileSystemTableSink and FileSystemTableSource
- Provide BatchFileSystemSink and StreamingFileSystemSink
Formats just needs to implement:
- InputFormat for read
- RecordWriter and FileCommitter to write.
Specific implementation format does not involve too much partition concept, it only manages its own reading and writing.
Code prototype: https://github.com/JingsongLi/flink/tree/filesink/flink-table/flink-table-api-java-bridge/src/main/java/org/apache/flink/table/sink/filesystem
Catalog changes
HiveCatalog
CatalogTable and CatalogPartition should cover HiveTableSource/HiveTableSink requirements (like hive StorageDescriptor). Should add more properties to the map in CatalogPartition from HiveCatalog:
- String location;
- String inputFormat;
- String outputFormat;
- String serializationLib;
- boolean compressed;
Partition statistics
- First, planner should support statistics of catalog table.
- Planner should read partition statistics and update to query optimizer.
- Related: FilterableTableSource need update statistics too.
Public Interfaces
DML
static partition writing:
INSERT { INTO | OVERWRITE } TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) select_statement1 FROM from_statement;
dynamic partition writing:
INSERT { INTO | OVERWRITE } TABLE tablename1 select_statement1 FROM from_statement;
If no specific partition value is specified, or less specified, it is dynamic partition writing.
alter partitions
ALTER TABLE table_name ADD PARTITION partition_spec [, PARTITION partition_spec];
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;
-- Move partition from table_name_1 to table_name_2
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec) WITH TABLE table_name_1;
-- multiple partitions
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec, partition_spec2, ...) WITH TABLE table_name_1;
ALTER TABLE table_name DROP PARTITION partition_spec[, PARTITION partition_spec, ...]
partition_spec ::= (partition_column = partition_col_value, partition_column = partition_col_value, ...)
Show
SHOW PARTITIONS lists all the existing partitions for a given base table. Partitions are listed in alphabetical order.
SHOW PARTITIONS table_name;
It is also possible to specify parts of a partition specification to filter the resulting list.
SHOW PARTITIONS table_name PARTITION(ds='2010-03-03', hr='12');
Nice to have:
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];
Describe
DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name [PARTITION partition_spec] [col_name];
Catalog interface
public interface Catalog {
…..
void renamePartition(ObjectPath tablePath, CatalogPartitionSpec spec, CatalogPartitionSpec newSpec) throws PartitionNotExistException, PartitionAlreadyExistsException, CatalogException;
void syncPartitions(ObjectPath tablePath) throws TableNotExistException, CatalogException;
List<CatalogPartitionSpec> listPartitionsByFilter(ObjectPath tablePath, List<Expression> filters) throws TableNotExistException, TableNotPartitionedException, CatalogException;
}
Further discussion
Create DDL
Should we support partition grammar like Spark SQL? (Subsequent votes will be taken to determine.)
CREATE TABLE country_page_view(
user STRING,
cnt INT)
PARTITIONED BY (date STRING, country STRING);
The table will be partitioned by two fields.
Recover Partitions (MSCK REPAIR TABLE)
Flink stores a list of partitions for each table in its catalog. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command) or removed from HDFS, the catalog will not be aware of these changes to partition information unless the user runs ALTER TABLE table_name ADD/DROP PARTITION commands on each of the newly added or removed partitions, respectively.[3]
However, users can run a command with the repair table option:
MSCK REPAIR TABLE table_name;
which will update catalog about partitions for partitions for which such catalog doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS.
TableSink Interface
public interface PartitionableTableSink {
List<String> getPartitionFieldNames();
// set the static partition into the TableSink.
void setStaticPartition(Map<String, String> partitions);
// get dynamic partition column names.
List<String> getDynamicPartitionFieldNames();
// If returns true, sink can trust all records will definitely be grouped by partition fields before consumed by the sink, sink can use “grouped multi-partition writer”. If returns false, there are no need to do partition grouping.
// If never invoke this method, that mean the execution mode(streaming mode) don’t support grouping, the sink should use its “ungrouped multi-partition writer” when there are dynamic partitions.
boolean enableDynamicPartitionGrouping();
}
Road map
- Modify DDL support.
- Rework partition pruning
- Rework dynamic partitioning
- Introduce FileSystemTableFactory
- Introduce BatchFileSystemSink
- Introduce StreamingFileSystemSink
- Introduce FileSystemTableFactory and FileSystemTableSource and FileSystemTableSink
- Introduce new CSV for FileSystemTableFactory
- Integrate Hive to FileSystemTableFactory
Nice to have:
- Integrate Create table DDL(with partition) to Hive
- push down partition pruning to hive metastore
- Introduce alter partitions commands
- Introduce recover partitions commands
- Introduce show/describe partitions commands
- Integrate partition statistics to planner
Reference
[1] https://en.wikipedia.org/wiki/Partition_(database)
[3] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
[4] https://resources.zaloni.com/blog/partitioning-in-hive
[5] https://issues.apache.org/jira/browse/FLINK-5859