Typical distributed data systems are clusters composed of a bunch of machines. If the dataset does not fit on a single machine, we usually shard the data into partitions, and each partition can have multiple replicas for fault tolerance. Partition management needs to ensure that replicas are distributed among machines as evenly as possible. More interestingly, when a node fails, it is important that the partitions hosted on that node are reallocated evenly among the remaining nodes, or after machines join or leave the cluster, these partitions need to be re-balanced across the remaining machines. The placement of partitions in a distributed data system is essential for the reliability and scalability of the system.

In addition to the topology aware we discussed in here, there are a few other requirements we considered when we designed our partition management system, these requirements are explained below:

Delayed partition movement for short-time node outage. There are many cases that a node temporarily goes down. For example, scheduled maintenance, node crash, long gc pause, intermittent network partition, or node crashes and recovers in a short period of time. In these scenarios, we would like to minimize the partition movements as much as possible while maintaining various constraints and availability needs from applications.

Throttled partition movement. Partition management logic typically moves partitions whenever nodes’ availability changes, or the load among existing nodes are imbalanced. This balancing procedures are usually transparent to the user and application. However, partition migrations carry some overhead in terms of bandwidth and workload, both of which can impact application performance. The partition management system should be able to throttle on the partition movements and move only a (configurable) number of partitions at a time.

Partition movement priority support. When partitions are being moved, certain types of partition migration should be treated as higher priority than others, for example, bringing up a primary (master) replica for a partition should always take higher priority than bootstraping a backup (slave) replica. If partition migration is throttled, partitions from these resources with higher priority should be performed migrated too.

Maintain availability during balance. The system should always try to maintain a minimal number of active replica when moving partitions across machines during balance process.

About Apache Helix

Apache Helix is a generic cluster management framework used for the automatic management of partitioned and replicated distributed systems hosted on a cluster of nodes. Helix automates reassignment of partitions in the face of node failure and recovery, cluster expansion, and reconfiguration.

Helix separates cluster management from application logic by representing an application’s lifecycle by a declarative state machine. In this state model, you describe the possible states and transitions between states of your resource. Helix will then trigger proper state transitions in order to achieve the ideal state or rollback to the initial state when retiring a server.

Helix is now widely used at linkedin to manages its critical infrastructures, such as Espresso, LinkedIn's horizontally scalable document store for primary data, Venice, our derived data serving platform for merging batch and streaming data sources, Ambry, LinkedIn’s distributed object store, Gobblin, a unified ingestion framework, our real-time analytics platforms like Pinot, and LinkedIn Platform as a Service (LPS).

The apache Helix website describes in detail the architecture of the system and how it can be used to manage distributed systems. This blog post walks through Helix’s concepts and design philosophy.

Helix Terminology

Resource: A resource is a logic group of data, service or tasks.
Partition: A resource is usually divided into a number of shards. each of these shards can be fit into one single machine. Such shard is referred as a partition in Helix.
Replica: Partitioning allows one to split the data/task into multiple sub-parts. To increase the availability of the system during failures, the common solution is to have multiple copies for each partition. Helix refers to the copy of a partition as a replica.
Replica State: also referred to as state. Logical tag corresponding to a replica of a partition (e.g. master or slave). Helix allows one to assign different states to each replica for a partition, for example, you can assign master and slave replicas, or leader and standby replicas, etc.
Top State: The top state for a replica according to its state model definition. For example, the master state is the top-state for the Master-Slave model, while leader state is the top-state for Leader-Standby model, etc.
Node/Instance - A physical or virtual machine that can host a set of replicas.
Live Node/Instance - A node that is currently available to host replicas and serve traffic.
Replication Factor (RF): The number of replicas Helix needs to maintain for each partition for a given resource.
Partition and State Assignment: Also referred as assignment. Assignment stands for a mapping of replicas and their states for each partition of a resource to a cluster of nodes.
IdealState: Ideal partition/state assignments for the cluster at given status, this reflects the target state Helix will try to bring the cluster to.
CurrentState: All replicas and their states on each instance in the cluster, this reflects the current state of partitions in the cluster.

Helix Rebalancer

Helix's rebalancer is one of its most critical components. It is tasked with determining an allocation of resources to nodes as the topology changes. For a resource broken into partitions, and each partition having multiple replicas, as nodes become available or unavailable, Helix will need to determine the best partition-to-node mapping given the current cluster state, and move partitions around nodes if needed to maintain all the constraints while ensuring overall load distribution across the cluster.

Helix rebalancer employs a rebalance strategy to compute the ideal partition and state assignment (IdealState) of the system. when the current partition and state mapping (CurrentState) differs from the IdealState, Helix uses the IdealState as the target state of the system and computes the appropriate transitions needed to bring the CurrentState to equal to it. More details on how Helix’s rebalancer workflow can be found at our previous blog here.

This blog mainly focuses on the discussions around the design of Helix rebalancer under full-auto rebalance mode (a.k.a, Helix Full-Auto rebalancer). Helix full-auto rebalancer moves partitions whenever nodes’ availability changes. Depending on the number of existing available replicas of a partition, rebalance of this partition can be categorized into two types: recovery balance and load balance.

Recovery Rebalance: When an instance goes offline, all replicas on that instance turn to inactive. Depends on the application's configuration, Helix may need to move the partitions to other online and healthy instances to maintain certain number of active replica for each partition. We refer to this type of partition movement as recovery rebalance or recovery partition movement, since it intends to recover these partitions from inactive node to maintain system’s availability.

Load Rebalance: Helix rebalancer also monitors the number of replicas on each instance. if the replicas distributed on instances of the cluster are imbalanced, Helix rebalancer attempts to automatically migrate replicas among instances to achieve a roughly even number of replicas per instance. This type of rebalance is for load-balance purpose, thus is referred as load rebalance or load balance partition movements. the load balance process is responsible for redistributing the partitions of a resource evenly among the instances for every resource.

Delayed Partition Movement

By default, as nodes become unavailable (offline), the replica on these nodes will be shuffled around to other nodes across the cluster immediately. However, there are many cases that a node can be disconnected from Helix but is expected to come back soon, these scenarios includes:

Scheduled maintenance.
Node crashes and restarts.
Participant experiences long stop-the-world GC.
Intermittent network partition.
Node crashes and recovers in a short period of time.

For such cases, if Helix immediately reshuffles all replicas in this node and bootstraps new replica in other instances. Bootstraping a new replica could cost several minutes or even hours, during this time, if the crashed node comes back, Helix may discard the ongoing bootstrap and reshuffle some of the replica back to the origin node (for load-balance purpose). This brings much unnecessary network traffic that usually could has negative impact to production performance.

With delayed partition movement, Helix full-auto rebalancer is able to minimize the re-shuffle of the resident replicas on a node in case of the node experiences a short-time outage, but meanwhile it also provides flexibility to allow applications to maintain its certain constraints and availability needs. More specifically:

Partition reshuffle is delayed upon node outage. If one instance goes offline (disconnected from Helix), replicas resident on this instance will not be immediately re-shuffled to other online instances. This is based on the assumption that the temporarily offline instance will usually come back online very quickly. In this case, delayed partition movements will avoid or reduce the cost to move and bootstrap all replicas in other instances.
- Within the delayed time period, if the offline node comes back online, Helix will bring all offline replicas on this instance back to its target state.
- If the node has been offline for a configured time period, Helix starts reshuffle the replicas on the node to other instances.
- The delayed time is configurable at both cluster level and resource level.

While Helix tries to delay the partition movements as much as possible, it maintains two kinds of constraints as well:

Minimal active replica count is maintained. Helix always tries to maintain a minimal number of active (non-offline) replicas for each partition for a given resource. This minimal active replica count is configurable at per resource level. This effectively means, when an node goes offline, if it causes a partition’s active replica count drop below its minimal active replica count, Helix immediately brings up as many replica on other instances to keep the active replica count no less than the required minimal count without any delay.
The top-state replica is always maintained. Helix will always try to maintain at least one replica is in its top-state as defined in its state model. For example, if the top-state replica was in the offline instance, Helix will immediate transit another existing replica to top-state or bring up a replica in new node and transit it to its top-state. No-delay is applied.

Partition Movement Throttling

As we discussed, Helix moves partitions in two scenarios: recovery balance and load balance. Both types of balancing procedures are transparent to the user and application layer. However, partition migrations carry some overhead in terms of bandwidth and workload, both of which can impact application performance. Helix rebalancer attempts to minimize the impact by:

Prioritizing on recovery balance over load balance.
Moving only a fix (and configurable) number of partitions at a time.

Helix is able to throttle the total number of ongoing partition movements at different levels, e.g, cluster level, resource level and instance level. For example, application can require no more than 100 partition movements can happen at same time in entire cluster, meanwhile, no more than 5 partition movement can happen at same time for database a.
Load-balance threshold: to minimize the impact of balancing on the cluster, Helix rebalancer only begins balancing after the distribution of partitions for a resource has reached certain threshold. the thresholds apply to the difference in number of partitions between the instance with the most partitions for the resource and the instance with the fewest partitions for that collection.

Helix enables the application to throttle partition movements for different rebalance type at different levels. Specifically, a maximal pending state-transition counts can be set at per cluster, per resource or per partition level. If the maximal pending transition counts are set at different levels, the throttling requirements will be all met when Helix generates the movement plan.

resource	at most x pending transitions per resource
instance	at most x pending transitions per instance
cluster	at most x pending transitions within the cluster

Partition Movement Priority

During rebalance, certain types of partition movements or transitions should be treated as high priority than others. For example, bringing a master replica for a partition should always take higher priority than bootstraping a slave replica. In addition, Helix allows application to specify a certain order for these resources within a cluster, such that, if a partition movement is throttled, partitions of these resources with higher priority are always rebalanced first.

Helix supports both resource-level priority and partition-level priority.

Resource level priority: To support resource level prioritization, Helix let application or sysops to provide a sorting field for each resource, and Helix will sort all resources within the cluster by this field in descending order, and then prioritize the state transitions at resource level by the same order.
Partition level priority: for a list of partitions that needs to rebalance, Helix should take these partitions that do not have any required top-state (e.g, master) replica first as high priority and rebalance them as soon as possible. in general:

1. The partition which does not have any top-state replica will be treated as highest priority.
2. For all partitions with top-state replica, the partition which has less than its required minimum active replicas will be selected as next higher priority.
3. The partition that only needs to be load-balanced will be in the lowest priority and rebalanced lastly.

Maintain Availability during Rebalance

During partition movement, Helix always bring up all new replicas before bringing down old replica during load-balance. For an extremely case, when multiple nodes are added during cluster expansion, some partitions could be assigned to new target instances that are completely different from existing assignments. since Helix could send all state transition in parallel, depends on the how fast each state transition takes place, there could be sometime that there is no live replica for this partition. this will has great impact on the availability of application service which should be avoided by Helix at its best.

Helix Full-Auto rebalance solves this issue by bringing the replica online on its newly assigned node before it actually drops the replica from its previously assigned node during partition migration. And to reduce the period time without top-state replica, Helix also try its best to choose top-state (e,g master) replica from only these replicas currently in its next state (e.g slave) if possible.

Summary

Partitions management is critical to distributed data systems and it is non-trivial work to build such logic into these systems. Helix's rebalancer is tasked with determining the best partition-to-node mapping as the cluster topology changes, and move partitions around nodes if needed to maintain all the constraints while ensuring overall load distribution across the cluster. We discussed how Helix’s full-auto rebalancer be designed and improved to meet various requirements we have seen as we have continued to build variety of distributed data systems. Helix provides a leverage for building such data systems within LinkedIn and for the community.

Blog

Recent Enhancements to Helix’s Auto Rebalancer