...
Currently, there are various interfaces for file IO operations in Doris:
- There are FileReader and FileWriter in the query layer. There are corresponding implementations for HDFS, S3, Broker, and Local.
- In the storage layer, there is a BlockManager that abstracts Block, there are WriteableFileBlock, ReadableFileBlock.
- For directory management work, there is an Env interface that can include directory operations, including RemoteEnv and PosixEnv, and there are also some link files and delete blocks in BlockManager; in addition, for S3, HDFS, there are operations such as S3StorageBackend that contain some file directories, including mkdir, copy , rm these operations
So many ways to operate will cause the following problems:
- It's messy, sometimes I don't know which one to use, many functions are repeated, but they have different abstract names;
- Modifying a feature or fix a bug needs to be modified in multiple places. For example, if we want to read S3 and have a local cache, then all places need to be added;
We need to unify the IO stack to make it more clear and extensible. In fact, access to IO can be roughly divided into the following three types:
...
Ff we change the IO interface directly, it will impact lots of place. I will divide it into two steps:
1. Rewrite the IO stack in totally new files, and leave current implements along, for easy reviewing.
2. Use the new IO stack to replace current calls.