Architectural differences
We currently basically have 2 ways of deploying OpenWhisk:
- On raw VMs or bare-metal machines, where we need to deploy the invokers to each of the worker nodes to be able to create containers locally and manage their lifecycles locally. In this case, the system has very low level access to container primitives to be able to make a lot of performance optimisations.
- Atop a container orchestrator, where we can transparently generate containers across a cluster of worker nodes. The concept of the invoker feels foreign in this case, because its implementation is geared towards use-case 1.
Yes, there are multiple flavors to both of those (there's an invoker-centric version of the Kubernetes deployment for instance), but in essence it boils to the two flavors described above.
The issue here is, that the needs and abstractions for these two types are different and make it hard to exist next to each other in one shared codebase. This causes more friction than necessary.
Proposal
I believe we can (and should) converge on an architecture that abstract the VM/bare-metal case away and give the Controller direct access to the containers for each action. Lots of proposals out there already go into that direction (see: Problem Analysis - Autonomous Container Scheduling v1, Clustered Singleton Invoker for HA on Mesos) and it is a natural fit for the orchestrator based deployments, which the whole community seems to move to.
The following is a proposal on where we could move to in the future. Please note that not all of it is completely thought out until the end and there are open questions. I wanted to get this out into the public though because I think there is some potential here for us to converge our topologies and get an overall better and cleaner picture.
The main goal of this document is to help us discuss on the overall architecture and then decide on a direction that we all think is viable to go in. It is not meant to be an exhaustive specification of that future architecture. All of the very fine details can be discussed in seperate discussions. It is meant though to identify early show-stoppers or concepts that we think cannot fly.
1. Overview
The overall architecture revolves around a clear distinction between user-facing abstractions (implemented by the Controller as of today) and the execution system, which is responsible for loadbalancing and scaling. On the hot path, only HTTP is spoken directly into the containers (similar to Problem Analysis - Autonomous Container Scheduling v1 in a sense). The Controller orchestrates the full workflow of an activation (calling `/run`, getting the result, composing an activation record, writing that activation record). This eliminates both Kafka and the Invoker as we know it today from the critical path. Picking a container boils down to choosing a free (warm) container from a locally known list of containers for a specific action in the ContainerRouter.
Container creation happens at a component called ContainerManager (which can reuse a lot of today's invoker's code, like the container factories). The ContainerManager is a cluster-singleton (see Clustered Singleton Invoker for HA on Mesos). It talks to the API of the underlying container orchestrator to create new containers. If the ContainerRouter has no more capacity for a certain action, that is, it exhausted its containers known for that action, it requests more containers from the ContainerManager.
The ContainerManagerAgent is an extension to the ContainerManager. It is placed on each node in the system to either orchestrate the full container lifecycle (no orchestrator case) or provide the means necessary to fulfill some bits of the container lifecycle (like pause/unpause in the Kubernetes case).
Kafka is still in the picture to handle overload scenarios (see Invoker Activation Queueing Change).
General dependencies between components
Full lifecycle of invocation and container creation
Invocation path in steady state
2. Loadbalancing:
Loadbalancing is no longer a concern of the Controller, this responsibility moves into the ContainerRouter. The ContainerRouter no longer need to do any guesstimates to optimise container reuse. They are merely routers to containers they already know are warm, in fact, the ContainerRouters only know warm containers.
Loadbalancing is more a concern of container creation time, where it is important to choose the correct machine to place the container on. Therefore, the ContainerManager can exploit the container orchestrator's existing placement algorithms without reinventing the wheel or can do the placement itself when we don't have an orchestrator running underneath.
Since loadbalancing happens at container creation time, the operation is a lot less time critical than scheduling at invocation time (like today). That is not to say it doesn't matter how long it takes, but when the whole container creation needs to be done anyway (which I assume is in the ballpark of roughly 500ms at least), we can easily afford to take 10-20ms to decide where to place that container. That opens the door for more sophisticated placement algorithms while not adding any latency to the critical warm path itself.
3. Scalability:
The Controllers and ContainerRouters in this proposal need to be massively scalable, since they handle a lot more than they do today. The Controllers have no state anymore, they can scale freely. The only state known to the ContainerRouters is which containers exist for which action.
As each of those containers can only ever handle C concurrent requests (where C in the default case is 1), it is of utmost importance that the system can **guarantee** that there will never be more than C requests to any given container. Therefore, shared state between the ContainerRouters is not feasible due to its eventually consistent nature.
Instead, the ContainerManager divides the list of containers for each action distinctively across all existing ContainerRouters. If we have 2 ContainerRouters and 2 containers for action A, each ContainerRouters will get one of those containers to handle himself.
Edge case: If an action only has a very small amount of containers (less than there are ContainerRouters in the system), we have a problem with the method described above. Since the front-door schedules round-robin or least-connected, it's impossible to decide to which Controller the request needs to go to hit that has a container available.
In this case, the other ContainerRouters (who didn't get a container) act as a proxy and send the request to a ContainerRouters that actually has a container (maybe even via HTTP redirect). The ContainerManager decides which ContainerRouters will act as a proxy in this case, since its the instance that distributes the containers.
Edge case diagram
4. Throttling/Limits:
Limiting in OpenWhisk should always revolve around resource consumption. The per-minute throttles we have today can be part of the front-door (like rate-limiting in nginx) but should not be part of the OpenWhisk core system.
Concurrency throttles and limits can be applied at container creation time. When a new container is requested, the ContainerManager checks the consumption of containers for that specific user. If she has exhausted her concurrency throttle, one of her own containers (of a different action) can be evicted to make space for that new container. If all containers are already of the same action, the call will result in a `429` response.
This section isn't fully fleshed out just yet and only contains basic ideas on how we could approach the issue.
5. Logs:
Logs are collected via the container orchestrator's API (see KubernetesContainerFactory) or via the ContainerManagerAgent, which is placed on each node.
In general, OpenWhisk should move to a more asynchronous log handling, where all the context needed is already part of each log line which can then be forwarded asynchronously on the node where the container is running rather than threading it through a more centralized component like the controller.
In theory this is a seperate discussion, it is very relevant to the viability of this proposal though.
6. Container Lifecyle:
Container creation is the responsibility of the ContainerManager. It talks to the underlying container orchestrator to generate containers (see ContainerFactorys like today). In the Kubernetes/Mesos case this is very straightforward, as they expose APIs to transparently generate containers across a cluster of nodes. For a potential case with no underlying orchestration layer, we build this layer ourselves by clustering the ContainerManagerAgents on all the nodes and talk to them to generate containers on the specific hosts. Unlike the Mesos/Kubernetes case, this warrants writing a placement algorithm as well.
The rest of the container lifecycle is responsibility of the ContainerRouters. It instructs the ContainerManagerAgent to pause/unpause containers. Removal (after a certain idle time or after a failure) is requested by the ContainerRouters from the ContainerManager again. The ContainerManager can also decide to remove containers, when it needs to make space for other containers. If it comes to that conclusion, it will first ask the ContainerRouters owning this Container to give it back (remove it from its list effectively). Only then can the container be removed.
Not very fleshed out either, mostly a suggestion. We need to make sure the network traffic generated from the ContainerRouters to all the containers to lifecycle the containers is viable. Under bursts though, that should not matter because of the pause grace.
Conclusion
The above gives us a unified architecture for both the cases where a container orchestrator is present and for the cases where an orchestrator is absent. That should result in less friction while developing for different topologies and it should be a more natural fit for container orchestrator's than what we have today.
On top of that, it moves a lot of work off the critical path (loadbalancing, placement decisions, Kafka) and makes the Controllers a lot more scalable than today, which is crucial for this architecture.
The path to optimize gets a lot thinner which could result in easier trackable performance bottlenecks and less code to optimize for performance. In this architecture, all deployment cases look at the very same path when it comes to optimizing burst performance for instance.
Possible execution backend implementations
- Raw VM/Bare-metal based: The ContainerManager and ContainerManagerAgent work in conjunction to create containers. The ContainerManager will need to implement a scheduling algorithm to place the container intelligently.
- Kubernetes based: The ContainerManager talks to the Kubernetes API to create new Pods.
- Mesos based: The ContainerManager talks to the Mesos API to create new containers.
- Knative based: The whole execution layer above vanishes and is replaced by Knative serving, which implements all the routing, scaling etc. itself.