Introduction

I introduced a few new components.

1) Scheduler

The scheduler is a new component it includes many critical features.

It is in charge of two major features, 1. queueing and routing activations 2. decide whether to add more containers based on the loads.

It has a sub-component `Queue`. The role of the queue is similar to the Kafka topic. A dedicated queue is created for each action.

Each queue will receive activation messages from the Kafka for a given action and send them to the ContainerProxy in response to requests from it.

2) ETCD

ETCD is a distributed reliable key-value store. etcd is mostly used for transactional support and information sharing among components.

3) Akka-grpc

Akka-grpc is introduced to replace Kafka based execution path.

In this version, I could not fully rule out it due to heavy dependencies. My final objective is to exclude it at least from the critical path.


4) Akka-remote

Akka-cluster is used for schedulers to communicate with each other.

Akka-grpc is required to define a grpc message, there are some advantages in akka-cluster when it is being used for simple inter-cluster communication.

It is also being used to send queue creation request to the scheduler.




The entire system is comprised of three flows, 1. queue creation flow, 2. container creation flow, 3. activation flow.


1. Queue Creation Flow

① Once a new request comes, there is no queue for the action, a PoolBalancer starts queue creation flow by randomly choosing a scheduler and send a CreateQueue request via akka-remote.

② If a scheduler receives a QueueCreation request, it tries ETCD transaction, and if succeeded, write the information in ETCD to describe queue creation is in progress. In the meantime, if other schedulers can receive queue creation requests then the transaction will be failed and they will respond with Success because queue creation is already in progress.

③ The QueueManager creates a queue.
④ Once a queue is created, it tries to store its endpoint to ETCD so that containers can connect to the queue. It sends queue endpoint data to DataManagementService.

⑤ A DataManagementService receives the data it stores it in ETCD. Since PoolBalancers are watching ETCD events for queue endpoints. So once a queue is created, it does not try to create a queue for the given action anymore and send activation messages to the scheduler which has the queue.


2. Container Creation Flow

① When a new queue is created, it immediately sends a container creation request to the ContainerManager because there is no container yet.

ContainerManager looks for available invokers. The word "available" means healthy invokers with enough resources.

ContainerManager selects one of the invokers and sends a ContainerCreation request via `invokerN` Kafka topic. While scheduling, it takes Blackbox and resource tags into account. Invokers can have heterogeneous resources and tags describe such resources.

③ At the same time, ContainerManager will register a job for the container creation request to CreationJobManager. 

CreationJobManager is in charge of managing container creation, it stores "in-progress" data for container creation in ETCD. In case container creation takes too long, a timeout happens for the job and CreationJobManager deletes the in-progress data so that more container creation can happen.

④ Invoker receives the ContainerCreation Message via Kafka. Above step ④ happens asynchronously, both steps(④) happens concurrently.

⑤ Invoker creates a ContainerProxy for the give action.

⑥ When container creation is finished or failed, the invoker sends the ack(Success | Rescheduling) to the ContainerScheduler via `creationAck` topic. The ContainerCreationMessage includes the target scheduler endpoint with the queue enabling containers to connect to the right scheduler. In case any error happens while accessing the scheduler, it tries to fetch the queue endpoint data from ETCD. Once the CreationJobManager receives the ack message. It finishes the ContainerCreation, and if it receives Rescheduling ack, it chooses another invoker and sendㄴ ContainerCreation to it.

 At the same time, once a new container is created, it stores its data to ETCD. This data is used while scheduling containers.


3. Activation Flow

① While the queue and the containers are being created, activation flows the system like above. PoolBalancer just sends the request to the scheduler topic in Kafka. The MessageConsumer who subscribes to the topic receives the request and deserializes it. The message is sent to the leader queue. If there is no queue for the given activation message, it will forward it to the proper scheduler by fetching the queue endpoint information from ETCD.

② A ContainerProxy requests activation messages via  FetchActivation API(Akka-grpc).

③ The ActivationService sends a GetActivation request to the proper queue in a long-poll way to avoid the busy-waiting. The singleton QueueManager object has the queue reference so that ActivationService can directly find the queue and send a request.

④ Once an activation is over, the ContainerProxy sends the results to PoolBalancer via `completedM` Kafka topic. After then, ContainerProxy continuously calls the fetchActivation API again whenever the invocation finished so that we can maximize the reuse of containers and take advantage of back-pressure.

⑤ PoolBalancer receives the activation result via `completedM` topic and responds to clients.




  • No labels