Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Current state[One of "Under Discussion", "Accepted", "Rejected"]Released

Discussion thread: 

JIRA or Github Issue: 

Released: <Doris Version>1.2

Google Doc: <If the design in question is unclear or needs to be discussed and reviewed, a Google Doc can be used first to facilitate comments from others.>

...

Therefore, the purpose of this refactor is to:

Unify unify the general logic of scanning tasks, so that when we add new data sources, we no longer need to write the same processing logic repeatedly, and we can ensure that all new data sources can benifit from the existing logic.

...

  • OlapScanNode: Completed
  • FileScanNode: (end of September)
    FileScanNode will support both the previous FileScanNode's query capabilities for external appearances table (Hive, Iceberg, hudi, etc.) and the BrokerScanNode's import load capabilities. After that, the BrokerScanNode will be removed.
  • EsScanNode (October)
  • JdbcScanNode (October)
  • SchemaScanNode (October)


Most of the code is in the be/src/vec/exec/scan/ directory.

New Parquet Reader

This is another work, but related to the refactoring of the scan node.

Doris has currently implemented the multi catalog feature, and supports the reading of external data sources such as Hive, Iceberg, Hudi, etc. These data sources are read through FileScanNode.

The current FileScanNode, when reading the parquet format, uses the parquet-cpp implementation of from Apache Arrow.

But this function has the following problems:

1. To read data, it needs to be converted to arrow format first, and then converted to Doris format, with an extra memory copy and type conversion.
2. parquet-cpp does not support new Parquet features such as Page Index and BloomFilter.
3. In-depth optimization is not possible, such as predicate filtering using dictionary values for dictionary encoding types.

For the above reasons, we reimplemented a Parquet Reader to replace parquet-cpp.

The new Parquet Reader can support richer predicate pushdown, filter more pages in one pass, and reduce IO.

This work will be done at the end of Sept.