Motivation

Pinot leverages Apache Helix for cluster management. Currently Pinot controller leader tightly binds with Helix leader. They both run on the same host within the same process, which means the process not only does its Pinot controller’s work such as periodic tasks and real-time segment completion, but also takes the responsibility of doing Helix’s work, such as cluster management, and Helix task scheduling. The below shows the responsibilities for these two controllers.

Helix controller tasks:

Cluster management

Pinot controller tasks:

Periodic task execution (bind with Helix controller)
Real-time segment completion (bind with Helix controller)

Tenant/Table/Segment level management

We’ve observed performance issues with Helix in large clusters. If we separate the two controllers, then Helix controller can be debugged independent of Pinot controller for performance issues and easily identify the root cause. We can then roll out helix bug fixes and measure any performance enhancements without stopping any Pinot functionality.

When Helix controller is separated from Pinot controller, there’s no concept of a “lead” pinot controller. Currently lead Pinot controller handles all the periodic tasks such as retention management, etc, as well as real-time segment completion. Thus, we need to define a new “lead” controller to handle these tasks in the Pinot world. However, we can do better by distributing the tasks across all controllers instead of just the leader. Thus, we can reduce the load on the current lead controller.

Note: The separation of these two controllers also has its downside.

We need to run an independent Helix controller. That means to run the whole Pinot cluster, we need to run an extra Java program for Helix apart from running Zookeeper cluster and all the Pinot components, i.e. Pinot controller, Pinot broker and Pinot server.

The version of Helix used in Helix controller may diverge from the version used in Pinot. It’s up to administrators to manually ensure using compatible versions.

Thus, we encourage to run Helix controller using Pinot code base; we will provide the options for cluster admin to choose whether they prefer to use Pinot controller only mode, Helix only mode or dual controller mode of Pinot controller settings.

Github issue: https://github.com/apache/incubator-pinot/issues/3957

Design

Lead Controller Resource

Like "Table Resource" and "Broker Resource", we introduce a new resource called “Lead Controller Resource" in Helix cluster. Pinot controllers will be also of participant role in the cluster, just like pinot broker and pinot server do. The number of partitions (M) in the new resource is fixed to be larger than the maximum number of controllers we expected in a cluster. This controller resource will follow the master-slave state model. Helix guarantees that there is at most one host in master state for each partition.

We choose M as 17 for two reasons:

We don’t expect more than 17 controllers in one single cluster. Thus, all the Pinot controllers can be master in at least one of the partitions and none of them will be idle.
Using a prime number can reduce the changes in hash collision.

Each partition will have N replicas, out of which there’ll be one in master state and N-1 in slave state. The controller which is the master of a partition will handle periodic tasks and real-time segment completion for a subset of tables that map into that partition. The table name is mapped to a partition based by the following formula:

partitionIndex = hashFunction(rawTableName) % numPartitions

That means we will not tolerate N controllers going down at the same time. The below shows what a partition looks like when N is 3: Pinot > Controller Separation between Helix and Pinot > Screen Shot 2019-05-22 at 9.36.15 PM.png

In production, the number of replicas N would be set as 1, since it takes minimum effort for Pinot controllers to become SLAVE from OFFLINE and to become MASTER from SLAVE.

The rebalance mode used for this resource is FULL-AUTO mode. Thus, administrators don’t need to maintain a list of controller hosts in the ideal state. This new resource is auto-rebalancing. Once a controller is shutdown, it will be removed from the lead controller resource. Thus, it’s easy for administrators to add/remove/swap controller hosts with least effort. Plus, this resource leverages CRUSH-ed rebalance strategy, which provides more even partition distribution so that all the Pinot controllers can be in master state evenly across all the partitions. The default delayed rebalance time is set to 5 minutes, which means if some Pinot controller host goes offline, Helix cluster will wait at most 5 minutes for this controller to recover. If timeout is met, Helix cluster will elect a new master. The benefit of having this 5-minute delay is that it won't be too often to switch leadership (e.g. when restarting all the controller hosts one by one, it won't take 5 minutes to finish the restart). Plus, since some periodic tasks run more frequently, 5 minutes isn't a too long number so that Helix cluster can take action to elect a new leader timely.

Controller Modes

There’re 3 different modes to run controllers, i.e. dual mode, Helix only, and Pinot only.

Pinot > Controller Separation between Helix and Pinot > Screen Shot 2019-05-22 at 9.37.18 PM.png

Dual mode controller behaves the same as current controller. The extra thing is that controller is registered as participant in new lead controller resource.

Pinot only controller focuses only on Pinot’s workloads. It will do the periodic workloads for a subset of the tables when it becomes master of some certain partitions in the new resource.

Helix only controller will be run based on the same Pinot controller code base. But it only runs Helix logic and won’t run any Pinot controller’s logic.

Note: Right now, the modes other than dual mode are experimental and not recommended. We will update this document when the feature is fully debugged and ready to be used.

Distributing Pinot Controller Workloads

When thinking of separating Helix and Pinot controllers, we think of how to reduce the workload of one single Pinot controller. Here’s our plan to achieve that.

Pinot > Controller Separation between Helix and Pinot > Screen Shot 2019-05-22 at 9.38.02 PM.png

All the logic of Pinot controller and Helix controller will be separated. Previously Pinot controller’s workloads will be distributed to the controllers which are selected as master of each partition. Helix controllers are running independently.

Code changes

Refactor existing controller code to run Helix only controller (https://github.com/apache/incubator-pinot/pull/3864).
Refactor the existing code so that Pinot controller can have a unique interface for all the periodic workloads (https://github.com/apache/incubator-pinot/pull/3264).
Add logic to create the new resource but disable it in HelixSetupUtils class. The rebalance mode can be set as FULL-AUTO (https://github.com/apache/incubator-pinot/pull/4047).
Add controller config in Pinot controller side to choose whether to use (i.e. Pinot only mode, Helix only mode or dual mode(default mode)) (https://github.com/apache/incubator-pinot/pull/4323).
Add logic in controller side to start checking whether new resource is enabled or not. Pinot controller will cache the partition number once it becomes master of the partition. If lead controller resource is yet disabled, controller won’t get any state transition messages (https://github.com/apache/incubator-pinot/pull/4323).
i. When there’s a state transition from Slave to Master for Partition_X: Cache Partition number X in Pinot controller.
ii. When there’s a state transition from Master to Slave for Partition_X: Remove Partition number X from cache in Pinot controller.
iii. When a periodic task is run, or real-time segment completion request is received:
Add logic in server side to look at new resource if it’s disconnected from Helix controller & new resource is enabled or not. Currently server side logic caches the previous lead controller. With this new feature, the caching logic will still be on, and new checks will happen only when disconnected or we get not_leader message back. Since Pinot server only fetches external view once and will cache the new leader information, it doesn't increase ZK reads by too much.

Pinot > Controller Separation between Helix and Pinot > Screen Shot 2019-05-22 at 9.42.42 PM.png

Migration Rollout Plan

The deployment plan consists of 4 steps.

Step 0

Roll out all the code changes and don’t enable the new resource yet. We won’t make any code changes after this step. Make sure to have the cluster up and running before these steps. Right now all the controllers are in dual mode; they will be Pinot-only mode controllers once rollout completed.

Step 1

Enable the new resource. The way of enabling it is to change the setting "RESOURCE_ENABLED" from false to true in ZNode CONFIGS/RESOURCE/leadControllerResource in ZK.
Here is the value of leadControllerResource before enabling the resource:

{
 "id" : "leadControllerResource",
 "simpleFields" :{ "RESOURCE_ENABLED" : "false" },
 "mapFields" : {},
 "listFields" : {}
}

Once the setting "RESOURCE_ENABLED" turned enabled, all the dual-mode controllers will be immediately registered as masters/slaves in the new resource. Periodic tasks and real-time segment completion will immediately be distributed. The following criteria must be met in order to test the robustness of this feature before we move on to the next step. It could take days, weeks or months depending on the installation:

All LLC and HLC tables have completed at least one segment and started new ones.
All tables are accounted for in all the periodic tasks (no table is ignored).
At least one round of rolling restart of pinot controllers is done, and criteria 1 and 2 are verified after the restart.
If any of these criteria goes wrong, disable the lead controller resource and everything comes back to the original state.

If you want to keep Helix controller and Pinot controller running in the same hardware, you can stop at this step. If you want to have these two controllers run in separate hardware, please follow the following steps below.

Step 2

After verifying everything working fine, we can add 1 or more Helix-only controllers to the cluster, so that they can be the candidates of the Helix cluster leadership.

Step 3

Restart all the dual mode controllers to Pinot-only mode one by one. After doing so, only Helix-only controller can be Helix leader, and all the Pinot-only controllers only work on Pinot’s workloads. Rollout finished.

Pinot > Controller Separation between Helix and Pinot > image2019-5-29_10-38-10.png

Rollback Plan

Rollback plan is the reverse of rollout plan. If anything goes wrong in the rollout plan, make sure that no more further step left before rolling back the current step.

Step 1

Restart all the Pinot-only controllers to dual-mode controllers.

Step 2

Shutdown Helix-only controllers.

Step 3

Disable lead controller resource. All the controller workload will be done by Helix leader.

Pinot > Controller Separation between Helix and Pinot > image2019-5-28_22-21-47.png

API Design

In order to get the information of table assignments for better debugging purpose, the following APIs are needed:

Get the leaders for all the tables in the cluster

GET /leader/tables

lead controller resource enabled lead controller resource disabled

{
  "leadControllerResourceEnabled": true,
  "leadControllerEntryMap": {
    "leadControllerResource_0": {
      "tableNames": [
        "testTable1_OFFLINE"
      ],
      "leadControllerId": "Controller_172.25.124.150_9000"
    },
    ...
    "leadControllerResource_23": {
      "tableNames": ["testTable2_REALTIME"],
      "leadControllerId": "Controller_172.25.124.150_9008"
    }
  }
}

{
  "leadControllerResourceEnabled": false,
  "leadControllerEntryMap": {}
}

Given a table name, return whether lead controller resource is enabled, the partition id and lead controller instance id

GET /leader/tables/{tableName}

lead controller resource enabled lead controller resource disabled

{
  "leadControllerResourceEnabled": true,
  "leadControllerEntryMap": {
    "leadControllerResource_7": {
      "tableNames": [
        "testTable_OFFLINE"
      ],
      "leadControllerId": "Controller_172.25.124.150_9000"
    }
  }
}

{
  "leadControllerResourceEnabled": false,
  "leadControllerEntryMap": {}
}

Test Plans and Schedule

Once the final plan has been adjusted and finalized, we can do the following steps.

Make the code changes mentioned in rollout plan section.
Implement new feature but disable it. Add unit tests.
Set up a local cluster with the new feature on one single Linux box. Test the functionality.
Set up a local cluster with resource disabled. Test the rollout plan. And then test the rollback plan.
Test the new feature in a testing cluster which uses the same hardware as the one in production. Test the rollout and rollback plan.
Try to bump up a Helix version with this feature enabled.
Rollout to the staging environment (from Step 0 to Step 2).
Rollout to a large cluster (from Step 0 to Step 2).
Rollout to all the clusters.

Future Work

The below is the future work that Pinot can benefit by changing the current codebase or leveraging some new features by Helix.

Remove the deployment dependency between Pinot components and Helix components, so that each will have its own deployment cadence, which will improve the debuggability and maintainability.
Currently Pinot uses standalone mode Helix controller, which requires all the Helix controllers to actively compete for the leadership and it may lead to race conditions. Migrating to distributed mode Helix controller helps reduce the potential issues.
Having a super Helix cluster (i.e. Helix as a service) can release the work of maintaining Helix components for Pinot administrators.
Lead controller resource being a resource is much more important than the other resources in Helix cluster. It’ll be of great benefit to prioritize this resource among all the resources and Helix team is working on that.
Constraint-based (FULL-AUTO) rebalancer which is in design in Helix will assign weight on partitions for better loading workload towards Pinot controllers.