Pinot leverages Apache Helix for cluster management. Currently Pinot controller leader tightly binds with Helix leader. They both run on the same host within the same process, which means the process not only does its Pinot controller’s work such as periodic tasks and real-time segment completion, but also takes the responsibility of doing Helix’s work, such as cluster management, and Helix task scheduling. The below shows the responsibilities for these two controllers.
Helix controller tasks:
Pinot controller tasks:
We’ve observed performance issues with Helix in large clusters. If we separate the two controllers, then Helix controller can be debugged independent of Pinot controller for performance issues and easily identify the root cause. We can then roll out helix bug fixes and measure any performance enhancements without stopping any Pinot functionality.
When Helix controller is separated from Pinot controller, there’s no concept of a “lead” pinot controller. Currently lead Pinot controller handles all the periodic tasks such as retention management, etc, as well as real-time segment completion. Thus, we need to define a new “lead” controller to handle these tasks in the Pinot world. However, we can do better by distributing the tasks across all controllers instead of just the leader. Thus, we can reduce the load on the current lead controller.
Note: The separation of these two controllers also has its downside.
Thus, we encourage to run Helix controller using Pinot code base; we will provide the options for cluster admin to choose whether they prefer to use Pinot controller only mode, Helix only mode or dual controller mode of Pinot controller settings.
Github issue: https://github.com/apache/incubator-pinot/issues/3957
Like "Table Resource" and "Broker Resource", we introduce a new resource called “Lead Controller Resource" in Helix cluster. Pinot controllers will be also of participant role in the cluster, just like pinot broker and pinot server do. The number of partitions (M) in the new resource is fixed to be larger than the maximum number of controllers we expected in a cluster. This controller resource will follow the master-slave state model. Helix guarantees that there is at most one host in master state for each partition.
We choose M as 17 for two reasons:
Each partition will have N replicas, out of which there’ll be one in master state and N-1 in slave state. The controller which is the master of a partition will handle periodic tasks and real-time segment completion for a subset of tables that map into that partition. The table name is mapped to a partition based by the following formula:
partitionIndex = hashFunction(rawTableName) % numPartitions
That means we will not tolerate N controllers going down at the same time. The below shows what a partition looks like when N is 3:
In production, the number of replicas N would be set as 1, since it takes minimum effort for Pinot controllers to become SLAVE from OFFLINE and to become MASTER from SLAVE.
The rebalance mode used for this resource is FULL-AUTO mode. Thus, administrators don’t need to maintain a list of controller hosts in the ideal state. This new resource is auto-rebalancing. Once a controller is shutdown, it will be removed from the lead controller resource. Thus, it’s easy for administrators to add/remove/swap controller hosts with least effort. Plus, this resource leverages CRUSH-ed rebalance strategy, which provides more even partition distribution so that all the Pinot controllers can be in master state evenly across all the partitions. The default delayed rebalance time is set to 5 minutes, which means if some Pinot controller host goes offline, Helix cluster will wait at most 5 minutes for this controller to recover. If timeout is met, Helix cluster will elect a new master. The benefit of having this 5-minute delay is that it won't be too often to switch leadership (e.g. when restarting all the controller hosts one by one, it won't take 5 minutes to finish the restart). Plus, since some periodic tasks run more frequently, 5 minutes isn't a too long number so that Helix cluster can take action to elect a new leader timely.
There’re 3 different modes to run controllers, i.e. dual mode, Helix only, and Pinot only.
Dual mode controller behaves the same as current controller. The extra thing is that controller is registered as participant in new lead controller resource.
Pinot only controller focuses only on Pinot’s workloads. It will do the periodic workloads for a subset of the tables when it becomes master of some certain partitions in the new resource.
Helix only controller will be run based on the same Pinot controller code base. But it only runs Helix logic and won’t run any Pinot controller’s logic.
Note: Right now, the modes other than dual mode are experimental and not recommended. We will update this document when the feature is fully debugged and ready to be used.
When thinking of separating Helix and Pinot controllers, we think of how to reduce the workload of one single Pinot controller. Here’s our plan to achieve that.
All the logic of Pinot controller and Helix controller will be separated. Previously Pinot controller’s workloads will be distributed to the controllers which are selected as master of each partition. Helix controllers are running independently.
The deployment plan consists of 4 steps.
Roll out all the code changes and don’t enable the new resource yet. We won’t make any code changes after this step. Make sure to have the cluster up and running before these steps. Right now all the controllers are in dual mode; they will be Pinot-only mode controllers once rollout completed.
Enable the new resource. The way of enabling it is to change the setting "RESOURCE_ENABLED" from false to true in ZNode CONFIGS/RESOURCE/leadControllerResource in ZK.
Here is the value of leadControllerResource before enabling the resource:
{ "id" : "leadControllerResource", "simpleFields" :{ "RESOURCE_ENABLED" : "false" }, "mapFields" : {}, "listFields" : {} }
Once the setting "RESOURCE_ENABLED" turned enabled, all the dual-mode controllers will be immediately registered as masters/slaves in the new resource. Periodic tasks and real-time segment completion will immediately be distributed. The following criteria must be met in order to test the robustness of this feature before we move on to the next step. It could take days, weeks or months depending on the installation:
If you want to keep Helix controller and Pinot controller running in the same hardware, you can stop at this step. If you want to have these two controllers run in separate hardware, please follow the following steps below.
After verifying everything working fine, we can add 1 or more Helix-only controllers to the cluster, so that they can be the candidates of the Helix cluster leadership.
Restart all the dual mode controllers to Pinot-only mode one by one. After doing so, only Helix-only controller can be Helix leader, and all the Pinot-only controllers only work on Pinot’s workloads. Rollout finished.
Rollback plan is the reverse of rollout plan. If anything goes wrong in the rollout plan, make sure that no more further step left before rolling back the current step.
Restart all the Pinot-only controllers to dual-mode controllers.
Shutdown Helix-only controllers.
Disable lead controller resource. All the controller workload will be done by Helix leader.
In order to get the information of table assignments for better debugging purpose, the following APIs are needed:
GET /leader/tables
lead controller resource enabled | lead controller resource disabled |
---|---|
|
|
GET /leader/tables/{tableName}
lead controller resource enabled | lead controller resource disabled |
---|---|
|
|
Once the final plan has been adjusted and finalized, we can do the following steps.
The below is the future work that Pinot can benefit by changing the current codebase or leveraging some new features by Helix.