Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Refactor existing controller code to run Helix only controller (https://github.com/apache/incubator-pinot/pull/3864). 
  2. Refactor the existing code so that Pinot controller can have a unique interface for all the periodic workloads (https://github.com/apache/incubator-pinot/pull/3264).
  3. Add logic to create the new resource but disable it in HelixSetupUtils class. The rebalance mode can be set as FULL-AUTO (https://github.com/apache/incubator-pinot/pull/4047).
  4. Add controller config in Pinot controller side to choose whether to use (i.e. Pinot only mode, Helix only mode or dual mode(default mode)).
  5. Add logic in controller side to start checking whether new resource is enabled or not. Pinot controller will cache the partition number once it becomes master of the partition. If lead controller resource is yet disabled, controller won’t get any state transition messages. 
    iWhen there’s a state transition from Slave to Master for Partition_X:  Cache Partition number X in Pinot controller.
    ii. When there’s a state transition from Master to Slave for Partition_X:  Remove Partition number X from cache in Pinot controller. 
    iii. When a periodic task is run, or real-time segment completion request is received: 

  6. Add logic in server side to look at new resource if it’s disconnected from Helix controller & new resource is enabled or not. Currently server side logic caches the previous lead controller. With this new feature, the caching logic will still be on, and new checks will happen only when disconnected or we get not_leader message back. Since Pinot server only fetches external view once and will cache the new leader information, it doesn't increase ZK reads by too much.

 

...

Migration Rollout Plan

The deployment plan consists of 3 4 steps. 

Step 0 

Roll out all the code changes and don’t enable the new resource yet. We won’t make any code changes after this step.  

Step 1 

Add as many new controllers to the cluster as the number of helix controllers you need. Be sure to include redundancy for failures/upgrades. We suggest three new controllers. Start them in dual mode. These dual-mode controllers Make sure to have the cluster up and running before these steps. Right now all the controllers are in dual mode; they will be Pinot-only mode controllers once rollout completed. 

Step

...

1 

Enable the new resource. All the dual-mode controllers will be immediately registered as masters/slaves in the new resourcePeriodic tasks and real-time segment completion will immediately be distributed. Have it bake for several weeks. During this time, we can test the robustness of this feature by trying to disable and re-enable the resource, running stress tests like simulating node connection loss/failures, or bumping up a compatible Helix version. The The following criteria must be met in order to test the robustness of this feature before we move on to the next step and they might take several weeks to achieveIt could take days, weeks or months depending on the installation:

  1. All LLC and HLC tables have completed at least one segment and started new ones.
  2. All tables are accounted for in all the periodic tasks (no table is ignored).
  3. At least one round of rolling restart of pinot controllers is done, and criteria 1 and 2 are verified after the restart.

...

  1. If any of these criteria goes wrong, disable the lead controller resource

...

  1. and everything comes back to the original state. 

Step 2

...

After verifying everything working fine, we can add 2 to 3 1 or more Helix-only controllers to the cluster, so that they can be the candidates of the Helix cluster leadership.

Step 3 

Restart Then, switch all the dual mode controllers to Pinot-only mode one by one. After doing so, only Helix-only controller can be Helix leader, and all the Pinot-only controllers only work on Pinot’s workloads. Rollout finished. 

Image RemovedImage Added

Rollback Plan

Rollback plan is the reverse of rollout plan: . If anything goes wrong in the rollout plan, make sure that no more further step left before rolling back the current step.

Step 1 

Restart all the Pinot-only controllers to dual-mode controllers.  

...

Disable lead controller resource. All the controller workload will be done by Helix leader.

Test

...

Plans and

...

Schedule

Once the final plan has been adjusted and finalized, we can do the following steps. 

...