Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Google Doc: <If the design in question is unclear or needs to be discussed and reviewed, a Google Doc can be used first to facilitate comments from others.>

Motivation

Currently, there are various interfaces for file IO operations in Doris:

  • There are FileReader and FileWriter in the query layer. There are corresponding implementations for HDFS, S3, Broker, and Local.
  • In the storage layer, there is a BlockManager that abstracts Block, there are WriteableFileBlock, ReadableFileBlock.
  • For directory management work, there is an Env interface that can include directory operations, including RemoteEnv and PosixEnv, and there are also some link files and delete blocks in BlockManager; in addition, for S3, HDFS, there are operations such as S3StorageBackend that contain some file directories, including mkdir, copy , rm these operations

So many ways to operate will  cause the following problems:

  • It's messy, sometimes I don't know which one to use, many functions are repeated, but they have different abstract names;
  • Modifying a feature or fix a bug needs to be modified in multiple places. For example, if we want to read S3 and have a local cache, then all places need to be added;

We need to unify the IO stack to make it more clear and  extensible. In fact, access to IO can be roughly divided into the following three types:

  • Directory operations, create files, delete files, get file list, etc.
  • File write operation
  • File read operation

And we could implement these API for different storage backends:

  • Local file
  • S3 file
  • HDFS file
  • Broker

Once implemented, it can be used in the storage layer (separation of hot and cold, separation of storage and computing), query layer (query S3, query HDFS), backup and recovery, etc. 

When a new kind of file system is introduced, we only need to implement a new derived class for it and no need to modify any other interface in upper layerDescribe the problems you are trying to solve.

Related Research

some research related to the function, such as the advantages and disadvantages of the design, related considerations, etc.

...

the detailed design of the function.

Scheduling


specific implementation steps and approximate schedulingFf we change the IO interface directly, it will impact lots of place. I will divide it into two steps:

1. Rewrite the IO stack in totally new files, and leave current implements along, for easy reviewing.
2. Use the new IO stack to replace current calls.