Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Google Doc: <If the design in question is unclear or needs to be discussed and reviewed, a Google Doc can be used first to facilitate comments from others.>

View file
nameApache Doris冷热数据详细设计.docx
height250

Motivation

Cloud object storage is cheaper than multi replication local storage, thus we can put cold data to s3 to store much more data at lower price.  To be more general, doris should not lose any feature due to putting cold data to s3.

...

cooldown_replica_id=-1 (non-persistent)
cooldown_term=-1 (non-persistent)
cooldowned_version=-1 (non-persistent): The maximum version of the continuous rowset uploaded by the tablet, which can be calculated if the BE is restarted.
cooldown_meta_id="" (persistent): The uuid bound to the cold data meta, every upload of data/ColdDataCompaction will generate a new meta and generate a uuid, the meta data uploaded to s3 is {uuid, rowset_metas}. For replicas with the same cooldown_meta_id, the meta of the cold data part is exactly the same.

CooldownConfHandler

Flow Chart

Image Removed

CooldownHandler in FE

CooldownHandler CooldownConfHandler uniformly manages the configuration of data transfer to coldcooling, which only takes effect in on the MASTER node. There is a list that records all the tablets configured with cold and hot data rules and their replicas, and checks the status of each replica, and the relevant information is stored in the metadata of the tablet. The processing here is only for ordinary tablets and does not operate on ShadowTabletsTablet, not for ShadowTablet. The replica copy information is consistent with the that of show tablet.
When the CooldownType CooldownReplicaId of a tablet is inconsistent with the status state reported by the BE, the CooldownType status CooldownReplicaId needs to be synchronized first. When the replica of a Tablet tablet is abnormal (the CooldownType of all replicas is USE_REMOTE_DATA), it is determined that the Tablet needs to re select the UPLOAD replica and set the CooldownType to Upload_DATA.

When the CooldownType needs to be issued, the FE side needs to send a request TPushCooldownConfReq, send the status to BE, and modify the UPLOADING file of the specified TABLET on S3_ REPLICA_ ID, the file content is that the current CooldownType is UPLOAD_ REPLICA of the copy of DATA_ ID。

The CooldownType on the FE side can only be accessed from the USE_ REMOTE_ DATA changes to UPLOAD_ DATA,UPLOAD_ Generally, the DATA node does not need to be changed unless BE times out. When BE times out, the FE side will automatically process the TABLET migration. The newly migrated TABLET is defaulted to USE_ REMOTE_ DATA, UPLOAD will be re specified_ DATA node. Ensure that two CooldownTypes are not UPLOAD at the same time_ Node of DATA.

Considering that the coolown operation does not require high real-time performance, and in order to prevent UPLOAD_ DATA switches frequently, and the timeout threshold can be set longer, such as half an hour.

Cooldown in BE

The CooldownType status of each tablet on the BE side is USE by default_ REMOTE_ DATA, that is, do not upload. Only when FE sends a request to set CooldownType to UPLOAD_ Only DATA updates its status and sets it to upload.

When CooldownType is UPLOAD_ During DATA, check UPLOADING on S3_ REPLICA_ ID, if UPLOADING_ REPLICA_ If the ID is not the current REPLICA, set the local CooldownType to USE_ REMOTE_ DATA, no longer upload. If UPLOADING_ REPLICA_ The ID is the current REPLICA, upload the file, and regenerate the remote META file on S3. The META file name needs to contain the current REPLICA_ ID to prevent write conflicts with other replicas.

When CooldownType status is USE_ REMOTE_ When DATA, obtain the UPLOADING on S3_ REPLICA_ ID, according to REPLICA_ ID to find the corresponding remote META file. Get the META file and merge it with the local META. If the VERSION corresponding to the local ROWSET is

UPLOAD_ DATA is a super operation with high risk. It is necessary to ensure that only one copy is performing upload requests at the same time. Therefore, some real-time performance can be sacrificed to reduce the probability of this risk.

BE in UPLOAD_ DATA type

Because different copies of the same batch of imported data under the same tablet rowset_ The IDs are different. When uploading a segment, its corresponding TabletMeta should be uploaded for reference by other copies of the tablet. This information only needs to include the Upload segment, not the local segment.

Each time CooldownType changes, the first launch of coolown will trigger metadata synchronization, first read the TabletMeta file in the remote cluster directory, check the existing data range on the remote cluster according to it (which VERSIONS have been uploaded), pull down the ROWSET metadata of the specified VESION, and replace the corresponding VERSION fragment in the local TabletMeta.

In coolown, check UPLOADING on S3_ REPLICA_ ID, if UPLOADING_ REPLICA_ If the ID is not the current REPLICA, set the local CooldownType to USE_ REMOTE_ DATA, no longer upload. If UPLOADING_ REPLICA_ The ID is the current REPLICA, upload the file, and regenerate the remote META file on S3.

During the whole process of uploading files, the uploaded files are all your own files and will not conflict with other copies. In extreme cases, two copies are being uploaded at the same time, because UPLOADING_ REPLICA_ If the ID file exists, only the data of one copy file will be used.

Note: In order to prevent inconsistency between different replicas of the local Rowset, the cold data cannot be de duplicated with the cold data on the remote cluster. The cold data will not be compressed temporarily.

BE in USE_ REMOTE_ DATA type

cooldown will trigger metadata synchronization. First read the UPLOADING in the directory specified in S3_ REPLICA_ ID file, read the TabletMeta file on S3 according to the ID, check the existing data range on S3 according to it (which VERSIONS have been uploaded), pull down the ROWSET metadata of the specified VESION, and replace the corresponding VERSION fragment in the local TabletMeta.

After the local metadata is updated, the local corresponding data can be deleted.

Scheduling

CooldownReplicaId does not match the ReplicaId in the current FE), the CooldownReplicaId is reassigned.
PushCooldownConf sends configuration information to BE: cooldown_replica_id, cooldown_term.

Cooldown on the BE side processes

Since different copies of the same batch of imported data under the same tablet have different rowset_ids, when uploading a segment, the corresponding TabletMeta must be uploaded at the same time for reference by other copies of the tablet. The information only needs to include the uploaded segment, not the local segment.
When the CooldownReplicaId is the same as the current ReplicaId, it means that the current Replica needs to perform an upload operation. According to the data range on the cluster, upload the subsequent Segment, and then update the remote TabletMeta.
When the CooldownReplicaId is different from the current ReplicaId, during cooldown, first read the TabletMeta file in the remote cluster directory, check the existing data range on the remote cluster (which Segments have been uploaded), and compare with the local TabletMeta Compare, if the progress is consistent with the local Meta, merge the remote TabletMeta.

Delete invalid files

Since remote files are shared by multiple BEs in single-copy mode, deleting files must ensure that they are no longer used in all copies.
1. The BE side collects the file list and generates delete_id
The replicas corresponding to the CooldownReplicaId on the BE end regularly list, and query which files in the remote Tablet directory are not in the TabletMeta. This is an invalid file, and the corresponding delete_id is generated and stored in the BE cache. At the same time, delete_id, cooldown_meta_id, and cooldowned_version are reported to FE through tabletReport.
2. The FE sends a file deletion request
When the FE side checks that the tabletReport contains delete_id, check whether the Replica is CooldownReplicaId, and check whether the metadata of other Replicas are synchronized with the current Replica according to cooldown_meta_id and cooldowned_version. Only when the progress of all replicas is synchronized can it be confirmed that other replicas are no longer using these invalid files.
Check whether the current FE is the Master. After checking and confirming, send a CooldownDelete request and send the delete_id back to the BE that just reported the status.
3. Delete invalid files on the BE side
After receiving the CooldownDelete request from FE, the BE side checks whether the delete_id is consistent with the cache. If it is consistent, it needs to delete the invalid file corresponding to the delete_id.

Compaction

Compared with the local copy, when processing the compaction of hot and cold data, the local rowset and the remote rowset need to be compacted separately. Compaction logic uses the original compaction logic.
When the remote file performs compaction, the upload node generates a new rowset according to the version, rebuilds the RemoteTabletMeta, and uploads it to the remote.
Then, the download node downloads the remote meta and synchronizes with the upload node. Junk files are deleted through the previous logicspecific implementation steps and approximate scheduling.