With the trend of exploding data growth and the systems in the NoSQL and Big Data space, the number of distributed systems has grown significantly. At LinkedIn, we have built a number of distributed systems over the years. Such systems run on a cluster of multiple servers and need to handle the problems that come with distributed systems. Fault tolerance – that is, availability in the presence of server failures and network problems — is critical to any such system. Horizontal scalability and seamless cluster expansion to handle increasing workloads are also essential properties.
Without a framework that provides these capabilities, developers of distributed platforms have to continually re-implement features that support them. From LinkedIn’s own experiences building various distributed systems and having to solve the same problems repeatedly, we decided to build a generic platform to simply this process – which we eventually called Helix (and which is now in the Apache Incubator)
What is Helix?
The LinkedIn infrastructure stack consists of a number of different distributed systems, each specialized to solve a particular problem. This includes online data storage systems (Voldemort, Espresso), messaging systems (Kafka), change data capture system (Databus), a distributed system that provides search as a service (SeaS), and a distributed graph engine.
Although each system services a different purpose, they share a common set of requirements:
- Resource management : The resources in the system (such as the database and indexes) must be divided among nodes in the cluster.
- Fault tolerance : Node failures are unavoidable in any distributed system. However, the system as whole must continue to be available in the presence of such failures, without losing data.
- Elasticity : As workload grows, clusters must be able to grow to accommodate the increased demand.
- Monitoring : The cluster must be monitored for node failures as well as other health metrics, such as load imbalance and SLA misses.
Rather than forcing each system to reinvent the wheel, we decided to build Helix, a cluster management framework that solves these common problems. This allows each Distributed System to focus on its distinguishing features, while leaving Helix to take care of cluster management functions.
Helix provides significant leverage beyond just code reuse. At scale, the operational cost of management, monitoring and recovery in these systems far outstrips their single node complexity. A generalized cluster management framework provides a unified way of operating these otherwise diverse systems, leading to operational ease.
Helix at Linkedin
Helix has been under development at LinkedIn since April 2011. Currently, it is used in production in three different systems:
- Espresso: Espresso is a distributed, timeline consistent, scalable document store that supports local secondary indexing and local transactions. Espresso runs on a number of storage node servers that store and index data and answer queries. Espresso databases are horizontally partitioned across multiple nodes, with each partition having a specified number of replicas. Espresso designates one replica of each partition as master (which accepts writes) and the rest as slaves; only one master may exist for each partition at any time. Helix manages the partition assignment, cluster-wide monitoring, and mastership transitions during planned upgrades and unplanned failure. Upon failure of the master, a slave replica is promoted to be the new master.
- Databus: Databus is a change data capture (CDC) system that provides a common pipeline for transporting events from LinkedIn primary databases to caches, indexes and other applications such as Search and Graph that need to process the change events. Databus deploys a cluster of relays that pull the change log from multiple databases and let consumers subscribe to the change log stream. Each Databus relay connects to one or more database servers and hosts a certain subset of databases (and partitions) from those database servers, depending on the assignment from Helix.
- SeaS (Search as a Service): LinkedIn’s Search-as-a-service lets other applications define custom indexes on a chosen dataset and then makes those indexes searchable via a service API. The index service runs on a cluster of machines. The index is broken into partitions and each partition has a configured number of replicas. Each new indexing service gets assigned to a set of servers, and the partition replicas must be evenly distributed across those servers. When indexes are bootstrapped, the search service uses snapshots of the data source to create new index partitions. Helix manages the assignment of index partitions to servers. Helix also limits the number of concurrent bootstraps in the system, as bootstrapping is an expensive process.
Try it out
We invite you to download and try out Helix. In the past year, we have had significant adoption and contributions to Helix by multiple teams at Linkedin. By open sourcing Helix, we intend to grow our contributor base significantly and invite interested developers to participate.
We will also be presenting a paper on Helix at the upcoming SOCC (ACM Symposium on Cloud Computing) at San Jose, CA on Oct 15th, 2012.
Project Summary
Helix is a generic cluster management framework used for automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. Helix provides the following features:
* Automatic assignment of resource/partition to nodes
* Node failure detection and recovery
* Dynamic addition of Resources
* Dynamic addition of nodes to the cluster
* Pluggable distributed state machine to manage the state of a resource via state transitions
* Automatic load balancing and throttling of transitions
Helix introduces the following terminologies that allow you to define a distributed system and express its behavior.
A set of autonomous processes, each of which is referred to as an instance, collectively constitute a cluster. The participants collectively perform or support a task, which is referred to as a resource. A resource is a logical entity that can span many Instances. For example, a database, topic, index or any computation/serving task can be mapped to a resource. A resource can be further sub-divided into sub-resources or sub-tasks, which are referred to as partitions. A partition is the largest part of a resource that can be associated with an instance, which implies a partition cannot span multiple instances. For fault tolerance and/or scalability, multiple copies of a partition are possible. Each copy is referred to as a replica.
The above terminology is sufficiently generic to be mapped to any distributed system. But distributed systems vary significantly in their function and behavior. For example, a search system that serves only queries might behave quite differently from a database that supports reads on mutable data. So, to effectively describe the behavior of a distributed system, we need to define various states for the resources and partitions of a distributed system and capture the description of all its valid states and legal state transitions. Helix accomplishes this task by providing the ability to plug in a Finite-State Machine (FSM).
The following figure illustrates Helix terminology. Replicas and nodes (aka instances) are color-coded to indicate the nodes they exist on; for example, P1.r2 is a slave and it sits on Node 4:
Distributed System Recipes
The Helix team has written a few recipes to demonstrate common usage patterns. Helix comes with pre-defined state models such as MasterSlave, OnlineOffline, and LeaderStandby and a recipe is available for each model.
- Distributed lock manager (LeaderStandBy model): With this recipe, you can simply define the number of locks needed and Helix will distribute them among the live nodes. Note: even though acalculator uses Helix , we never elect the leader using zookeeper ephemeral nodes. Instead, the leader is selected by the controller.
- Consumer grouping (OnlineOffline model): Consumer grouping is a simple recipe for load balancing the consumption of messages with fault tolerance from a pub/sub system. This recipe uses Rabbit MQ but can be applied to any pub/sub system.
- Rsync-based replicated file system (MasterSlave model): This recipe demonstrates a system where files written to one server are automatically replicated to another slave using rsync. The client always writes data to the Master and it gets replicated. It also provides fault tolerance — if a Master dies, another Slave becomes the Master.
[PDF]Untangling Cluster Management with Helix
Distributed data systems systems are used in a variety of settings like online ..... We next introduce how Helix lets DDSs define their specific cluster manager ...