Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

One of the major benefits to using Helix to provision containers is that Helix can also schedule other tasks or resources to run inside those containers. Then, users don't need to guess how many resources their tasks will take up; Helix will automatically scale containers up or down to manage the load. This document describes the design of such a task execution interface and framework.

Problem Statement

Helix provisioning allows a target provider to dictate how many containers Helix should tell provisioners to have running at any given time. The CPU and memory requirements for these containers are the only constraints that a user should need to provide. However, it's not enough just to have allocated containers running. These containers still need to be able to do something meaningful. These containers may run some long-lived service, or they could support running an arbitrary set of tasks that are started on-demand or by a schedule. Generally, users don't know how many physical resources to allocate for these tasks, so we'd like to just schedule them to run and let Helix take care of optimizing the number of instances running to support the tasks. We should be able to schedule tasks inside of Helix-scheduled containers to support this.

Existing Work

There is currently a task framework checked into Helix trunk that does the following:

  1. Defines a workflow as a collection of tasks that will be started according to a DAG
  2. Each task is a single resource where the partitions are a single instance of the task to be run alongside a partition of a target resource
  3. As each task partition completes, the overall task status is updated. When all partitions are completed, the task is marked complete. When all tasks in a workflow are complete, the workflow is complete.

Proposed Design

Risks

Dependencies

Performance, Scalability, and Provisioning

Monitoring and Alerting

Roll-Out Plan

...

We can use the existing monitoring work to allow querying the monitoring provider for CPU/memory/load, enabling complex target providers.