Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Currently, there are various interfaces for file IO operations in Doris:

  • There are FileReader and FileWriter in the query layer. There are corresponding implementations for HDFS, S3, Broker, and Local.
  • In the storage layer, there is a BlockManager that abstracts Block, there are WriteableFileBlock, ReadableFileBlock.
  • For directory management work, there is an Env interface that can include directory operations, including RemoteEnv and PosixEnv, and there are also some link files and delete blocks in BlockManager; in addition, for S3, HDFS, there are operations such as S3StorageBackend that contain some file directories, including mkdir, copy , rm these operations

Image Added

So many ways to operate will  cause the following problems:

  • It's messy, sometimes I don't know which one to use, many functions are repeated, but they have different abstract names;
  • Modifying a feature or fix a bug needs to be modified in multiple places. For example, if we want to read S3 and have a local cache, then all places need to be added;

We need to unify the IO stack to make it more clear and  extensible. In fact, access to IO can be roughly divided into the following three types:

...


Ff we change the IO interface directly, it will impact lots of place. I will divide it into two steps:

1. Rewrite the IO stack in totally new files, and leave current implements along, for easy reviewing.
2. Use the new IO stack to replace current calls.