Analysis and Design for updating the Management Agent

As ACE-342 points out, we must be able to update the management agent itself. This page captures the requirements, analysis and design for such a feature.

Requirements

The update mechanism should be robust and include a rollback mechanism if installing or starting the update fails for whatever reason. We must never end up in a situation where we no longer have a management agent.
It must be easy to deploy a new version of the management agent, and it would make sense to use ACE for this, so you can just add a new artifact like you would for any other bundle.

Analysis

Before we start the analysis and try to come up with different design alternatives, we first take a look at a process that is related to updating the management agent: initially installing it. It is related because this initial install is our "first" version of the agent:

Its bundle symbolic name determines the actual implementation of the management agent. In theory, targets can have different implementations of the management agent, customised for such targets. Using the bundle symbolic name to identify such implementations makes sense.
The initial version plays a role in any rollback we need to do as soon as we start updating. For a rollback, we need to know the bundle's location. Our best bet is to assume the location that was used to install it can be used. This is not always the case: if the bundle was installed by providing a custom input stream, we have no clue where that thing came from. The location is our best option. An alternative could be to query the OBR for the BSN and version of the initial version.

Conceptually, we have a couple of design alternatives that we can compare here, and the differences focus on how to check for updates and deliver them. The actual process of updating the management agent itself is covered after these alternatives and is the same for both.

Alternative A: Separate Update Process

Our starting point is the fact that bundles and deployment packages, from an updating point of view, are very similar. Both have versions, and both can be updated. The biggest difference is the actual update process: how the incoming stream is processed. That can be seen as an implementation detail. For deployment packages we already settled on a process that consists of the following steps:

Check for updates. Ask the server what versions it has and use that information to determine what to do.
Download an update. An optional step, because we can also install it directly from a remote location.
Install the update. Updates the management agent, rolling back to the previous version if the update fails.

Server side, for deployment packages, we use an endpoint that accepts:

deployment/targetID/versions to return all available versions of a deployment package (or target);
deployment/targetID/versions/version to return a specific deployment package version.

For updates of the management agent we could do something similar. We need to supply the agentID (the BSN of the agent). If we want to also be capable of approving updates for specific targets, we should probably also leave the targetID in there, so we could end up with an endpoint like this:

agent/targetID/agentID/versions
agent/targetID/agentID/versions/version

Note that at some point, it was considered to try and make one endpoint handle both cases, but that would lead to something like:

endpoint/targetID/dp/versions(/version)
endpoint/targetID/ma/agentID/versions(/version)

In other words, they would not be uniform anyway, because for the management agent we need an extra "ID".

For deployment packages, we have an explicit approval process, which allows us to approve (in the client) an update before it is sent to a target. For the management agent update, we can optionally also implement an explicit approval process:

Without an explicit approval process, we simply ignore the targetID, and look in the OBR for all bundles where BSN=agentID and use that to build a list of versions (or fetch a specific version).
With an explicit approval process, we need to do more work in the client, because now we need to keep track of approved versions per target. This means adding a new DeploymentVersionRepository (or something similar) and building an approval process on top of that.

Target side, we can come up with a similar control API (currently being redesigned as part of ACE-347). Note that this control API allows us to control how and when updates are installed.

Client side, all we need to do is to make sure our management agent bundle is in the OBR.

Alternative B: Custom Artifact Update Process

We can also treat a management agent as a new type of artifact and ship it as part of a normal deployment package. An artifact recognizer and helper could detect a bundle with a special header (to distinguish it from normal bundles). On the client, we have to know what agent was initially installed (its symbolic name has to match) and we then have to associate this management agent artifact with the target.

A resource processor would handle this bundle on the target. It would then park the bundle somewhere on the filesystem, because it cannot update the management agent during a deployment.

The management agent could then detect that there is a new bundle available on the local filesystem and trigger the update process. The target side control API can still be similar, but the check for updates would probably just check the local filesystem.

Alternative C: Custom Deployment Package

Another alternative would be to create a special target called "management agent" and associate the management agent bundle to it. This does not necessarily need special helpers or recognizers as the main difference is that this target definition is used by all "real" targets to simply fetch a deployment package that contains the management agent.

The installation of such a deployment package could then be completely different in that the deployment package is simply "stripped off" to reveal just the bundle.

Alternative D: Use a preprocessor to add the management agent to the deployment package

Here we insert the management agent into the deployment package on the server, just before it goes to the target. On the target we intercept the stream, remove the agent again and then send the resulting stream to the management agent. Like with B and C we then still need to "park" the agent somewhere and then start the update process for it.

Updating the Management Agent

The starting point for this update process is an available URL referring to the "new" management agent and, in case we need to rollback, a URL referring to the "current" management agent. The update process itself should be done by a different bundle, as bundles cannot reliably update themselves. So at the start of this process we create a new, temporary bundle. We have to hand this bundle both URLs so it can try to perform the update.

Updating the management agent is nothing more than invoking an update on the bundle using the "new" URL and update() is guaranteed to either perform the update, or, when that fails, leave the original bundle in place. However, assuming the management agent was active, we can also have a scenario where the bundle is updated but afterwards won't resolve or start. This is not considered as part of the update method itself (step 6 in 6.1.4.36) and it might still leave us with a non-functional management agent. Therefore we do need to take this into account and have the option to go back to the "current" version, using the supplied URL. If, for some reason, even that fails, we could try going back to the "initial" version (the location of the bundle). If that does not even work, we have to rely on some external "factory reset".

Finally, when the update is finished, the temporary bundle we had installed can be uninstalled again.

Conclusions

Comparing the positives and negatives of all alternatives gives us the following matrix:

Alternative	Positive	Negative
A	Same design as the update process for deployment packages. Simple management of agents, just upload them to the OBR.	Requires extra HTTP calls to check for updates (which can be solved by HTTP 1.1 pipelining).
B	We do not need new endpoints, since the transport of the agent is done as part of the deployment package.	Creating a whole new artifact helper, recognizer and resource processor seems overkill. Updates of just the management agent trigger an update of the target, which conceptually messes up the version number. We need to match up the agent BSN with the right agent artifact in ACE. Mismatch in semantics between artifact life cycle and management agent: you never want to delete it, and you never want to have more than one.
C	Like B, no new endpoints are needed.	We really misuse a target for something else. It has no audit log and different semantics. We need to match up the agent BSN with the right target name. Mismatch in semantics between artifact life cycle and management agent: you never want to delete it, and you never want to have more than one.
D	Transparent to the current process.	Post- and preprocessing deployment packages are relatively expensive extra steps. Hard to add an explicit approval process later. Deployment packages only get sent when there is some other update, so it's impossible to just deploy a management agent update.

The conclusion is that we should go for alternative A and to not yet implement a full approval process for the management agent. It is the most simple and elegant solution.

Design

Our design starts with the API that we offer to consumers that wish to control this process. Similar to the DeploymentHandler that is part of the AgentControl API we introduce a handler to update the agent:

public interface UpdateHandler {
  /** Returns the locally installed version of the agent. */
  Version getInstalledVersion();

  /** Returns the versions available on the server. */
  SortedSet<Version> getAvailableVersions() throws RetryAfterException, IOException;

  /** Returns an input stream for the update of the agent. */
  InputStream getInputStream(Version version) throws RetryAfterException, IOException;

  /** Returns a download handle to download the update of the agent. */
  DownloadHandle getDownloadHandle(Version version) throws RetryAfterException, IOException;

  /** Returns the size of the update of the agent. */
  long getSize(Version version) throws RetryAfterException, IOException;

  /** Installs the update of the agent. */
  void install(InputStream stream) throws IOException;
}

Most of this is fairly straightforward. Typically a consumer will start by comparing the installed version to the highest available one and then decide to either first download an update of the agent and then install it, or directly stream and install it. The actual update process of the management agent does deserve more design attention. In the analysis we already described the process under 'Updating the Management Agent'.

After invoking 'install' a temporary bundle is installed. It publishes an update service:

public interface UpdateService {
  /** Updates the agent from the current version to a new version. */
  public void update(InputStream from, InputStream to);
}

When the service appears, its 'update' method is invoked by the management agent. The method immediately returns, the real work is done asynchronously.

The update service looks up the management agent bundle and, using the 'to' stream performs an update. If it fails, or the new agent refuses to start, we proceed by trying to re-install the original 'from' stream. If that does not work, we have to rely on some external "factory reset". The reason for passing InputStreams here instead of (for example) URLs is that opening the URLs should be done by a connection factory (to facilitate authentication, if configured) and we want to make sure that has already happened before we even attempt the update.

Finally, when the update is finished, the temporary bundle we had installed can be uninstalled again.

Test scenarios

Update of the management agent works, the new agent cleanly installs.
Update of the management agent fails because the new agent does not resolve.
Update of the management agent fails because the new agent does not start.
Update of the management agent fails because of an IOException.
Update of the management agent fails because the framework is shutdown during the update.

Child pages

Requirements

Analysis

Alternative A: Separate Update Process

Alternative B: Custom Artifact Update Process

Alternative C: Custom Deployment Package

Alternative D: Use a preprocessor to add the management agent to the deployment package

Updating the Management Agent

Conclusions

Design

Test scenarios

5 Comments

Paul Bakker

Marcel Offermans

Paul Bakker

Paul Bakker

Wilfried S.