Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Below graph describes the lifecycle of a Samza application running on Kubernetes.

Image RemovedImage Added

Figure 2. Lifecycle of Samza applications running on Kubernetes

...

  • The run-app.sh script is started providing the location of your application’s binaries and its config file. The script instantiates an ApplicationRunner, which is the main entry-point responsible for running your application.

  • The ApplicationRunner parses your configs and writes them to a special Kafka topic named - the Coordinator Stream for distributing them. It proceeds to submit a request to Kubernetes API-server to launch the Samzajob-Operator coordinator Pod.

  • The Samzajob-Operator Pod coordinator Pod (The AM, in YARN’s parlance) is started, It is then responsible for managing the overall application. It reads configs from the Coordinator Stream and computes work-assignments for individual Pods.

  • It also determines the hosts each Pod should run on taking data-locality into account. It proceeds to send Pod creation requests to API-server.

  • The Kubelet will watch the requests and start the task Pods. If the application’s dependencies are hosted in remote artifact repositories like HDFS. They need to be downloaded to the nodes first. How to download?

    • M1: the task Pod can leverage the Kubernetes Init-container functionality to download the dependencies.

    • M2: the regular container can download the dependencies first before executing the core logic.

    • M3: the other way is to pre-bake all the dependencies into the container image itself, but that is less flexible as it requires all the code, configs to be available in the image. Regardless of M1 or M2, this method can always be used.

    • M1 vs M2: The Init-containers is ensured to be run before regular containers. In M1, if the regular container fails, the Init-container will not be re-run.  In M2, if the regular container fails, it needs to handle the case to not re-run the logic to download the resources.
    • M3: the other way is to pre-bake all the dependencies into the container image itself, but that is less flexible as it requires all the code, configs to be available in the image. Regardless of M1 or M2, this method can always be used.

  • When the task Pod is started, each Pod first queries the Samza Operator job-coordinator pod to determine its work-assignments and configs. It then proceeds to execute its assigned tasks.

  • The Samza Operator job-coordinator does the typical control-loop pattern, ensures the current state matching the desired state. e.g. It monitors how many task Pods are alive and creates new Pods to match the desired replicas .if any fails

Host Affinity & Kubernetes

...

  1. Prepare a base container image for Samza application including all the Samaza framework jars etc.

  2. The run-app.sh and ApplicationRunner needs to be modified to support submitting apps to Kubernetes api-server.

  3. Develop Implement a Samza OperatorKubeClusterResourceManager, similar to YARN AM, that creates task Pods in Kubernetes .and re-create if any task fails

  4. Prepare a container Develop a module that can download the dependencies from remote artifact repositories. The module can be in the init-container or embedded in the main/regular container.

    Samza-Operator Details

    The Samza operator

  5. Refactor Samza Core logic to support Samza on K8s and Samza on Yarn. 

Job-Coordiantor Details

The Job-Coordinator is very similar to YARN AM. When it starts, it first reads the JobModel from coordinator stream and then create pods from Kubernetes with the container information provided. The Kubelet will then start the containers.

...

Image Added



Figure 3. Workflow of Samza on Kubernetes

...

Note that there is a difference here between Kubernetes and YARN. In YARN,  there are usually two stages: AM first

...

requests the containers and then

...

launches the container on NodeManager

In Kubernetes, requesting containers and starting containers are done in a single call. The createPod API will request the pods and then the pods will be started by kubelet automatically.

Due to this difference, the current samza on YARN implementations regarding storing the resource requests and then match the received containers with the stored pending requests are not needed.

...

Comparison with Samza on Yarn

Image Added

Figure 4. Workflow of Samza on Yarn

By comparing Figure 3 and Figure 4, we can see the workflow of Samza on K8s is simpler than the workflow of Samza on Yarn. We need to refactor Samza Core logic to support both. 

Interfaces

WIP .. Coming soon

Reference

...