AgentManager interface is our abstraction layer that manages agent connection and message passing to agents. It's responsibility is as follows:
Let's break down some nomenclature:
A ServerResource in CloudStack is a translation layer between CloudStack operations and how to perform that operation on the physical resource that it interacts with. Examples of ServerResource are XenServer hypervisor, VmWare hypervisor, KVM hypervisor, F5, SRX, NetScaler, etc. The requirement for a ServerResource is for it to map a Command from CloudStack into operations performed on the physical resource without any database work. It is required that any ServerResource do not access the database.
An Agent in CloudStack is a container for an instance or instances of ServerResource. It's job is to serialize and deserialize the messages and make connections to the management server. You'll often see ServerResource and Agent usage being mixed up because Agent and instance of ServerResource are basically one to one today. However, that may not be true in the future. Just remember Agent is responsible for serialization and connection while ServerResource is responsible for execution. They are not the same.
Agents are broken down into different types.
Command and Answer is our pattern for message requests and responses. Each Command should have a corresponding Answer. I have seen code that skips that but that's wrong and should be corrected.
The CloudStack management server have two sources of load. Obviously, one source is the number of requests it gets via the web services api. That's outside the scope of this email but we can talk about how that works in a separate email. The other source is the number of resources the management server cluster has to manage. Our objective is to make sure that we can simply add management servers to scale with the number of resources it manages. The following ensures that. - Agent Load Balancing: As management servers are started and stopped, agent load balancing rebalances the number of agent each management server handles without interrupting the message passing.
Messages are serialized/deserialized to json format by gson libraries. This code is encapsulated in Request.java and Response.java. The actual content of the message depends on the Command and Answer that's being sent. You should look in com.cloud.agent.api package if you're interested in that. Note that the Commands and Answers are in the cloud-api.jar but Request.java and Response.java are in cloud-core.jar. We expect everyone who writes a resource to only depend on the cloud-api.jar. if you can't do that, then something is wrong with the design of the ServerResource.
There may be many reasons why an agent disconnected. It could be because the physical resource is down. It could be because the physical resource is disconnected via tcp. Could be bugs within CloudStack code. AgentManagerImpl.java has a disconnect method that all disconnect should go to. The job of this disconnect is to determine the appropriate action to take given what we know about the disconnects. VM HA is often triggered as part of this process.
AgentManager maintains an application ping with the agents. The ping interval is one minute but can be configured and the timeout is 2.5x the ping interval but can also be configured. If the application ping times out, AgentManager launches into an investigation process that tries to check if the physical resource is still alive. It does this by talking to a set of Investigators. Each Investigation can have three results: the resource is Up, the resource is Down, and I don't know. Upon receiving Up or Down, the AgentManager terminates the investigation and uses that state as the reason for the disconnect. Upon receiving "I don't know", AgentManager moves on to the next Investigator until it runs out of Investigators. It is crucial that the Investigator does not return a false positive or negative. For example, one particular Investigator pings the ip address of the physical resource. If it receives a ping response, then returns Up. However, if it doesn't receive a ping response, it should return "I don't know", because failure to respond to ping does not mean the physical resource is Down.
The life cycle of a particular resource really has the following states in ResourceState.java
Creating - Being Created
Enabled - Enabled and can perform all commands
Disabled - Disabled and should not accept commands requiring new resource
PrepareForMaintenance - preparing to go into maintenance mode
ErrorInMaintenance - error during preparation. Someone needs to act on this.
Maintenance - Maintenance mode so no commands at all
Error - Something is wrong with the resource.
Unmanaged - leave the resource alone and don't allow any commands to be sent to it.
These states represent the administrator's intent for that resource. It really should be broken down in more detail. For example it probably should be more like the following but it is what it is for now.
For agent states, we have the following in Status.java