Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Currently the ‘commit-leader’ makes a synchronous ‘commit’ call to the Pinot Controller to upload the segment being committed to a deep storage system. If this upload succeeds then the segment is deemed as committed and can be brought online.

Our new approach relies on the fact that "segmentLocation" is now a list of different HTTP end points. Any one location should be sufficient to download this segment (needless to say if one location is unavailable, we can retry using the next location). Exactly how this segment is persisted and when can it transition to ONLINE state can be done as follows:

a) Persisting a segment to both Pinot server (synchronously) and a configured deep storage (asynchronously and optional):

  • The commit leader becomes the owner of the segment being committed. It will make this segment downloadable using a new HTTP end point.
  • During segment completion protocol, the committer server uploads the segment to the configured deep storage. It will wait a short period of time (say 5 seconds) and then send the segment metadata to the controller. The wait maximizes the chances that other server can download the segment from deep storage instead of the commit server. The deep storage uri will be added to the segment location list in Zookeeper.
  • When the controller asks a server to download the segment for whatever reason, it will first try to download it from a deep storage uri stored in the zk as segment metadata to minimize the impact on servers. If there is no such uri, it will go down the list of servers to download the segment. More details on how to discover the server list is discover later.
  • All existing error handling routines remain unchanged. The next section lists diagrams showing the critical scenarios of (1) happy path, (2) controller failure and (3) slow server downloading from deep storage or Peer server. 

New server side segment completion steps

  1. The commit server sends the segment metadata to the controller with segment uri set to be the configured (if any) deep storage location.
  2. If step (1) commit to controller is successful, the server performs best-effort upload of the segment to the deep storage. i.e., the server does not need to wait for the segment to succeed.
  3. Otherwise, the present commit failure handling kicks in.

RealtimeValidationManager changes

b) Use the RealtimeValidationManager to fix LLC segments only has a single copy. Currently the validation manager only fix the most recent 2 segments of each partition. With this change, it will scans all segments's Zk data, if it finds a segment does not have a deep storage copy, it will instruct a server to upload the segment to the deep storage and then updates it segment location url with the deep storage uri. 

...

In our new design, the segment upload to deep storage is asynchronous and best effort. As a result, the server can not depend only on deep storage. Instead we allow the server to download the segment from peer servers (although it is not preferred).  There are two options to discover the servers hosting a given segment:

...

  1. Try to download the segment from the download url in segment Zk.
  2. If (1) failed, get the external view of the table (from Controller) to find out the ONLINE servers hosting the segments.
  3. Download the segment from the server. If failed, move to the next server in the list. The download format is Gzip'd file of the segment on the source server. 

Failure handling

...

  1. What happens when the segment upload fails but the

...

  1. preceding metadata commit succeeds ?
    • In this case, if a server needs to download the segment, it needs to download from the commit server which has a copy of the data.
    • We in fact want to minimize downloading from peer servers: so the segment completion mode should be set as DEFAULT instead of DOWNLOAD.
    • In the background, RealtimeValidationManager will fix the upload failure by periodically asks the server to upload missing segments.
  2. What happens if another server gets a "download" but the committer has not gotten to ONLINE state yet?  
    • To account for the fact that the metadata commit happens before the segment upload, another server should do retries (with exponential backoff) when downloading.
    • The retries with wait can greatly reduce the issues caused by the above race condition.

New interaction diagrams on segment completion protocol

...