This is a functional specification of the Open vSwitch (OVS) controller within CloudStack
- v 0.1 Initial cut
- v 0.99 First complete draft (04/04/2012)
- Cloud operator downloads CloudStack release and installs CloudStack. She wishes to use the GRE tunnel method of isolation. She enables this flag in the configuration database and restarts CloudStack management servers. When she creates a zone with Advanced Networking and creates a physical network within the zone, she is presented with a choice to use GRE isolation
- Cloud operator wants to enable firewall and load balancing service on top of the GRE-isolated network. She adds a public VLAN and a public IP range to this network. She creates a network offering with these services. End-users now use this offering to create networks.
- Cloud operator wants to enable firewall services using Juniper SRX on top of the GRE-isolated network. Not supported
- Cloud operator wants to enable both VLANs and GRE tunnels as isolation methods. Not supported
The goal of this feature is to broadly support all of CloudStack's virtual networking functions while removing the limitations associated with VLANs. Among the limitations are:
- Scaling: a maximum of 4094 VLANs per datacenter is possible. This number is however a theoretical maximum. The actual number of VLANs that can be configured is often limited by the capabilities of the physical switches in the data center, as they need to maintain a distinct forwarding table for each VLAN.
- Configuration complexity: VLAN information has to be consistently provisioned on all networking hardware
- Broadcast containment: broadcasts within one VLAN causes needless flooding on links not using that VLAN
- Flexibility: Since VLAN are terminated at layer-2, they do not allow to define virtual networks with span across different L2 broadcast domainsunless VLANs are allowed to transverse the aggregation and core layer of the data center, which can cause traffic "tromboning".
- Occasional packet loss / out of order delivery can occur on active connections while GRE tunnels are being created. For TCP connections, this is simply handled with re-transmission, leading to a negligible and temporary network performance degradation.
- The 'enableXenServerNetwork' script sometimes fails, leading to vm start failure detected by CloudStack management server. The HostAllocator picks another host in this case.(bug
- Part of the GRE key for the network is displayed in the VLAN column (see related bug 14501)
Failures in setting up the bridge, or configuring the GRE tunnels will not cause a failure of the VM startup process. The VM will be started anyway, even if networking might be compromised. When starting a subsequent VM, the tunnel manager will try again to create the tunnels which previously failed.
NOTE: This is the behaviour as currently implemented, but not yet committed. The alternative approach would be to fail VM startup if an error occurs while setting up either the OVS bridge, the tunnels, or the broadcast storm prevention rules. Alternately, a synchronization framework (like the one used by the SecurityGroupManager) can use eventual consistency to (re)create the tunnels.
Logging and Debugging
- For components running in the CS management server (OVS element, OVS tunnel manager, ServerResource), check vmops.log
- For components running on the hypervisor (ovstunnel plugin), log outputs by default to /var/log/ovstunnel.log. Logging on the hypervisor is now configurable, and can be tweaked by changing the appropriate configuration file.
Useful tips for debugging the OVS controller:
- Check for invocations of the _prepare _and _release _methods in com.cloud.network.element.OVSElement. They will invoke methods on the OVS tunnel manager for setting up and tearing down tunnels.
- The relevant commands classes start with the 'Ovs' prefix.
- The relevant xapi plugin is ovstunnel. However, most of the broadcast prevention code is in a ovs-vif-flows.pyscript, which is triggered by the hypervisor every time a VIF is plugged or unplugged from a OVS bridge managed by CS OVS controller. They can be easily spotted by querying XS networks for other-config:is-ovs-tun-network.
The OVS tunnel manager is disabled by default, and should be explicitly enabled in the configuration.
To this aim, the sdn.ovs.controller configuration flag should be set to true.
After enabling it, the management server should be restarted.
Also, a Vnet range should be configured. Vnet identifiers are used as GRE keys for tunnel networks. The network manager implementation has a check for validating the maximum vnet id. By default this maximum is 4096, unless GRE is explicitly specified as the isolation mode for the physical network on top of which the OVS controller operates.
- XenServer 5.6FP1 or newer.
- Open vSwitch networking stack. Enable it with:echo openvswitch > /etc/xensource/network.conf
Performance and scalability considerations
The implemented full mesh topology ensures each VM can be reached with at most 1 hop across different hosts. This means that bottleneck issues which are common in star or ring topologies do not occur in this case.
We deliberately avoided using STP for avoiding loops in the traffic. Instead, we prevent issues such as broadcast storm ensuring that broadcast on ingress tunnels are not forwarded on egress tunnels.
This solution scales much better than a traditional VLAN approach, as the GRE key is a 32-bit field whereas the VLAN id is a 12-bit field. It also scales better than approaches based on Q-in-Q as there's no constraint associated with the physical topology of the data center network, such as Top-of-Rack/Core switches.
The following factors however impact performance and scalability of virtual networks built using the GRE encapsulation technique:
- GRE overhead, due to encapsulation of the frame in a L3 payload with the GRE header. This overhead amounts to about 3% of the throughput when using standard frames, and to about 0.5% when using jumbo frames.
- Inability to leverage TSO (TCP segmentation offload). As L4-L7 traffic is encapsulated into L3 (GRE) envelopes, hardware TSO, which is provided by the vast majority of NICs, cannot be leveraged. This is a non-negligible scalability issue as Fragmentation is entirely performed in software. [Will add graphs and results from scalability analysis]. Adopting jumbo frames mitigates this problem.
Preamble: I am not a security expert, nor any security experts has performed a security assessment for this feature.
In the current implementation, for each network a distinct Open vSwitch bridge is created on each host; also, each network is assigned a distinct GRE key. This means that accidental traffic snooping is avoided at the edge of the network; malicious traffic snooping would require attackers to compromise, in the XenServer case, the dom0 where the vSwitch instances are running. Malicious users cannot influence the way in which bridges are created and GRE keys assigned.
We assume the physical infrastructure to be under exclusive control of the admin of the data center. Cloudstack does not encrypt or perform any operation for protecting the traffic once it has left the host.
Architecture and Design description
Software design and architecture:
- We still use vNets. However, the vNet ID represents a GRE kye and not anymore a VLAN ID. When a network is implemented, the OVS network guru allocates a vNet whose identifier will be used for allocating the GRE key for the network. The vNet identifier is a java integer (32 bit long as the GRE key - however as java does not support unsigned primitive types we only support a theoretical maximum of 231-1 distinct GRE keys)
- Cloudstack configures Open vSwitch bridge instances and GRE tunnels as required: configuration occurs only when VMs are actually started on hosts.
At VM startup, during network preparation for a given NIC, the OVS network element (com.cloud.network.OvsNetworkElement) is invoked for preparing the OVS network. The OVS network element then uses the OVS tunnel manager (com.cloud.network.ovs.OvsTunnelManager) for driving bridge configuration and tunnel creation. This is achieved by dispatching commands to the hypervisor resource using Cloudstack's agent framework.
When a VM is stopped or migrated, the dual process occurs. During NIC release phase, the OVS network element is invoked for releasing the NIC from the OVS tunnel network; if the NIC being removed is the last one on a given host, this will cause existing tunnels to be destroyed and the bridge to be unconfigured.
- The following commands are used by the OVS tunnel manager (all in com.cloud.api.network.ovs):
- OvsSetupBridgeCommand: Prepares an Open vSwitch bridge instance. It basically preconfigures the bridge before XenServer creates it when a VIF is actually plugged into the network. This enables us for creating tunnels independently of the NIC (VIF in XenServer terminology) being already plugged or not into the bridge.
- OvsDestroyBridgeCommand: Removes an Open vSwitch bridge instance when the last VIF on a given host is removed.
- OvsCreateTunnelCommand: Creates a tunnel across two Open vSwitch bridge instances on different hosts using a given GRE key.
- OvsDestroyTunnelCommand: Destroys a tunnel across two host when the last VIF on one of the two hosts has been removed.
- OvsFetchInterfaceCommand: Retrieves IP information about the physical interface where the tunnel network is built. These data are used by the OVS tunnel manager for configuring endpoints for GRE tunnels.
- Commands are then dispatched to the hypervisor resource (com.cloud.hypervisor.xen.resource). We currently support XenServer only. The resource manager interacts with the hypervisor using the xenAPI interface. The ovstunnel plugin provides the routines for setting up bridges and tunnels.
- The ovstunnel plugin has the following plugin:
- Setting up the OVS bridge. This involves creating the bridge, 'enabling' the network on a given host, and noting down that the OVS network has been configured on a given host in order to not perform the same sequence of operations more than once for each host.
- Creating GRE tunnels, and ensuring that ingress broadcast/multicast traffic is dropped by default in order to avoid it being propagated to egress tunnels thus potentially causing broadcast storms
- Destroying tunnels and bridges
Network design and architecture:
- The OVS tunnel manager implements a full mesh topology across hosts in a given zone. The tunnel mesh actually connects only hosts where VMs are deployed for a given network. For instance, in a 2000-host zone, with a network whose VMs are deployed only on 5 different hosts, only 10 tunnels will be configured, with 4 tunnels coming out of each OVS bridge. In the worst case, a VM on each of the 2000 hosts, the network will have 1,999,000 GRE tunnels, with 1,999 tunnels coming out of each OVS bridge. The number of GRE tunnels per bridge does not impact performance and scalability of the virtual network. The full mesh topology was preferred over other loop-free topologies, such as stars or rings. This because it is the only topology which ensures the destination VM can be reached with only 1 GRE-hop. Star topologies, for instance, require in many cases 2-hops, and have a serious bottleneck in the centre of the star.
- A distinct bridge is created for each virtual network. This automatically ensures logical traffic isolation within a single host; GRE keys are used for separating inter-host traffic belonging to different virtual networks.
- As the full-mesh topology is not loop-free we had to worry about broadcast storms, and evaluated three alternatives:
- STP: we preferred not to use Open vSwitch's builtin STP support. STP is not recommended in highly dynamic networks, such as the ones we build with the OVS tunnel manager.
- Separate loop-free topology for broadcast/multicast traffic: this could be achieved either by having a dedicate set of GRE tunnels defining a loop free topology for broadcast/multicast traffic, or by identifying a loop free subset of the fully-meshed topology. We rejected the second approach as it looked very similar to STP; the former approach, even if interesting, has the downside of requiring fairly complex management on the OVS bridges for forwarding bcast/mcast traffic on a specific GRE tunnel; also, bcast/mcast traffic will still suffer from the above mentioned issues regarding bottlenecks and multiple hops.
- Explicit suppression of broadcast/multicast propagation. This is the approach which has been implemented. It leverages the nature of the full mesh topology, which guarantees each possible destination VMs can be reached with a single hop. For this reason the network has been configured in order to guarantee ingress broadcast on a GRE tunnel are never replicated on an egress tunnel. This has been enforced in the following way:
- Baseline behaviour is to suppress all broadcast/multicast ingress traffic on GRE tunnels.
- Ingress bcast/mcast traffic on GRE tunnels is then explicitly forwarded only on ports where VM instances are connected
- Finally, bcast/mcast traffic generated from VM instance is allowed on egress GRE tunnels
- NOTE: per-VIF rules are not enforced by Cloudstack directly, but through udev rules. This was preferred to xapi event listeneres in order to reduce the amount of traffic between the management server and the hypervisor.
- All IPv6 traffic is currently suppresed in order to avoid broadcast storms from v6 protocols such as the neighbour discovery protocol.
No architectural pattern has been added. Bridge and tunnel configuration has been inserted into the network preparation/release mechanisms already provided by Cloudstack; command dispatching is still performed through Cloudstack's agent framework; commands are executed on hosts using xenapi plugins, a mechanism already widely adopted by Cloudstack.
Several alternatives were considered:
- Creating a sort of framework for pluggins alternative network managers, possibly provided by 3rd parties, adding a 'service provider' for basic network connectivity, and then have OVS tunnel networks as a network offering. This is something interesting but rather orthogonal to the subject of this FS. We are separetely working on it;
- Using programmatic interfaces exposed by Open vSwitch (OvsDB and Openflow) for manipulating virtual networks. This would have probably extended the timeline for the implementation of this feature, and therefore we decided to stay with xenapi plugins for this release. We are currently working with the XenServer engineering team on exposing the above mentioned interfaces in the hypervisor layer.
CREATE TABLE `cloud`.`ovs_tunnel_interface` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`ip` varchar(16) DEFAULT NULL,
`netmask` varchar(16) DEFAULT NULL,
`mac` varchar(18) DEFAULT NULL,
`host_id` bigint(20) DEFAULT NULL,
`label` varchar(45) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8;
INSERT INTO `cloud`.`ovs_tunnel_interface` (`ip`, `netmask`, `mac`, `host_id`, `label`) VALUES ('0', '0', '0', 0, 'lock');
CREATE TABLE `cloud`.`ovs_tunnel_network`(
`id` bigint unsigned NOT NULL UNIQUE AUTO_INCREMENT,
`from` bigint unsigned COMMENT 'from host id',
`to` bigint unsigned COMMENT 'to host id',
`network_id` bigint unsigned COMMENT 'network identifier',
`key` int unsigned COMMENT 'gre key',
`port_name` varchar(32) COMMENT 'in port on open vswitch',
`state` varchar(16) default 'FAILED' COMMENT 'result of tunnel creatation',
PRIMARY KEY(`from`, `to`, `account`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `cloud`.`ovs_tunnel_network` (`from`, `to`, `network_id`, `key`, `port_name`, `state`) VALUES (0, 0, 0, 0, 'lock', 'SUCCESS');
Code is available in the salvatore-ovs-tunnel-mgr branch on git.cloud.com
NOTE: A prototype for this feature was originally developed by Chiradeep; the proposed implementation has been developed starting from that prototype. Not all code relevant to that prototype has been removed, even though it is not anymore in use. When looking at the branch on the git server, please disregard all classes pertaining to OVS not included in the list above.
Web services APIs
No changes introduced to the API
Exactly the same as the flow for starting a VM instance.