This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • Enhancements to GRE-based SDN overlay
Skip to end of metadata
Go to start of metadata


This is a summary of possible improvements on the OVS tunnel manager.

Please note: Enhancements are not listed in order of priority or importance. 

  • Prevent transient storms caused by unicast floods
    To be completed according to findings from Darragh O'Reilly.
  • KVM support
    • The Open vSwitch tunnel manager currently supports XenServer only.
      It is important that OVS tunnel manager functionality is extended to KVM as well.
      On the KVM side, this will imply:
      1. Updating the Agent in order to implement the commands for setting up OVS bridges and creating GRE tunnels.
      2. Trying to reuse as much as possible the scripts developed for XenServer for manipulating OVS layout and flow tables. In theory quite a good amount of code would be reusable, even if it will probably need to be duplicated over different places in the cloudstack source tree anyway.
      3. Investigating whether udev rules for setting up broadcast prevention could be reused or whether an alternative mechanism is required

On the Cloudstack management server, we will need to make sure that XenServer-specific code, such as the one for creating or finding a network, is either adapted to be hypervisor-agnostic, or refactored with an hypervisor-agnostic interface and hypervisor-specific drivers.

  • IPv6 support
    • IPv6 traffic is currently blocked across OVS overlay networks. As IPv6 support for guest networks is likely to be available in the near future in Apache Cloudstack, it is important to unblock IPv6 traffic.
      In the current implementation, we have blocked it as a cautionary measure. For restoring it is important to make sure for every IPv6 protocol which sends multicasts messages, that the corresponding multicast addresses are properly included in the broadcast prevention rules.
  • Basic connectivity as-a-service
    • COMING SOON
  • Reduce broadcast traffic removing ARP and DHCP traffic from the physical network
    • The overlay network creates a virtual layer-2 broadcast domain spanning over the hosts where a VM for a given network is deployed. This could result in a fairly large virtual layer-2 broadcast domain.
      In order to improve the performance of the network it would be vital to ensure the impact of broadcasts on network throughout is minimized.
      To this aim, Cloudstack knowledge about the topology of the cloud can be used for reducing the amount of broadcast traffic in the following way:
      • DHCP traffic - DHCP requests could be redirected to a local tap port which is connected to a dnsmasq process populated and updated by Cloudstack. DHCP traffic could then be squelched over the tunnels. Another benefit is that the initial IP configuration for the NICs might happen independently of the state of the tunnel mesh.
      • ARP traffic - A similar approach can be adopted for ARP broadcast, which can be redirected to a tap port connected to a process which reads ARP requests and sends ARP replies reading info from a cache maintained by Cloudstack itself

While this clearly reduces the amount of broadcast traffic on the network, it increases the management burden for Cloudstack. It is vital that entries are added and invalidated into this cache in an appropriate way. While invalidation should always occur in cases such as VM stop, VM pause, and VM migration, there are several strategies for populating this cache, for instance:

      1. Cloudstack could add entries to the cache as soon as the deployment plan for a VM is known
      2. The cache upon a miss, might send an update request to the Cloudstack management server, which will provide the requested information (e.g.: MAC corresponding to a given IP).
  • Scalable state management for overlay networks
    Despite overcoming limitations of VLAN, the overlay network approach still poses scalability challenges, albeit of a different nature.
    The most important issues are:
    1. Managing the state of the overlay networks, in term of hosts, endpoints, active tunnels, as well as rules for enforcing a correct behaviour of the network (e.g: broadcast storm prevention)
    2. Ensure the overlay network scales even when the number of hosts it spans, and the number of VMs communicating over it, increase

In the current implementation the overall number of tunnels is quite a limit to the scalability of the overlay network; this because information about each tunnel needs to be stored in the management server DB. The number of tunnels per tenant is also a concern as we need update this part of the state every time a VM is started/migrated/resumed on a ‘new’ host or the last VM on a given host is stopped/paused/moved/terminated. We currently can do a lot of more in terms of managing state updates; there is large room for improvement over the way we do it today as we currently create tunnels serially and wait for a response from the host for each tunnel.

The actual number of tunnels is given by SUM[i=0...m, ni(ni-1)] where m is the number of networks and ni is the number of hosts where vms for the ith network are deployed. We could calculate the total number of tunnels as a statistical distribution, but this is going to be very complex. The worst case is when VMs for each network are spread across all hosts, in which case we’ll have, for instance, about 50M tunnels per network on 7K hosts. In the best case however, when each network is deployed on a distinct host, we’ll have no tunnels at all.
The numbers can look quite scary; the current database structure is, after all, very trivial. It is however useful as it allows us to not create a tunnel between two host if it is already in place.  The size of the state currently grows linearly with the number of networks (tenants), and in a quadratic way with the number of hosts where VMs in a given network are deployed. 

Here are some examples:

Hosts

Networks

Hosts/Network

Tunnels

Tunnels/host

500

1000

10

90,000

180

5000

10000

10

900,000

180

500

100

100

990,000

1980

5000

50

2000

199,900,000

9798.6

So things are obviously much worse as networks start to get spread over a large number hosts. This is typical of scenario with fewer, larger tenants. 
A tradeoff that it might be considered is whether we want to have a single mesh for all networks, and then distinguish each tenant’s traffic within the mesh, or have a distinct mesh for each tenant as we do today.

  • Numeric examples:

    Hosts

    Networks

    Hosts/Network

    Tunnels (per tenant mesh)

    Tunnels (single mesh)

    500

    1000

    10

    90,000

    249,500

    5000

    10000

    10

    900,000

    24,995,000

    500

    100

    100

    990,000

    249,500

    5000

    50

    2000

    199,900,000

    24,995,000

    It is interesting how the single mesh would work better in cases with fewer but larger tenants. Distinguishing each tenant's traffic within the mesh is also a non-negligible problem. VLAN tags can be used, as long as the number of tenants within each mesh is less than 4K. 


Anyway, a better way for storing mesh state is certainly required.  For instance we might have an entry for each host, and a field describing which tunnels departing from that host are not working or are not yet confirmed to be working. This alone would reduce the size of the state to manage by an order or magnitude.


The number of flow table rules might actually become another issue, but at this time we have no data points on their impact on throughput. With the current implementation, on each vswitch bridge, we will have 4 rules for each tunnel, and 2 rules for each VIF.

Here are some example numbers:

Hosts/Network

VMs/Host

Total Flow table entries

10

2

4*9 + 2*2 = 40

20

2

4*19 + 2*2 = 80

10

4

4*9 + 2*4 = 44

20

4

4*19 + 2*4 = 84


  • Enable Security Groups
    • There are several issues with the current implementation of security groups over OVS networks, mainly concerning the inability of OVS to forward packets for processing to netfilter. Apart from that, it would be better from an architectural point of view to implement security groups within the OVS flow table. 
      The most important issue here is that OVS does not allow to specify port ranges in flow table entries (CIDRs can be used for network addresses). This means that rules which apply to port ranges needs to be translated into multiple flow table entries. 
      This has two consequences: i) managing abstract rules and their mapping into OVS flow entries; ii) a potentially very high number of flow table entries which can cause a consistent number of user-mode/kernel-mode context switches due to the limited size of the kernel-level flow table. At the moment we don't have any solution to the management problem, neither we do have an estimate of the performance impact due to the increased number of rules in the flow table. To cut a long story short, we probably need a prototype implementation to state whether an OVS-based approach for security groups is feasible or not.
      It is worth noting that with the latest release of Open vSwitch, which introduces bitmask-like mechanisms for specifying source and destination port attribute, the total number of rules needs is significantly reduced. 
  • Bypass the virtual router
    • In a nutshell, this enhancement is aimed at allowing inter-subnet routing without going through the virtual router. Once routing rules have been configured through the API, the OVS flow table can be programmed in order to send packet to the appropriate destination host, even if that host is in a different tunnel mesh (virtual subnet). 
      For more information about inter-network routing, a feature which is currently being implemented on the VLAN backend, please have a look at the following specification from Alena: Inter-VLAN Routing functional spec
  • Consider alternatives to GRE overlays
    • GRE does not scale particularly well when the traffic directed to the physical tunnel interface increases. Apart from the performance overhead due to the GRE header, which amount to about 3%, the real issue is the fact that with GRE encapsulation it is not possible to leverage TCP segmentation offload. This means that GRE packets are not fragmented in hardware, but in software. This affects the overall throughput, which decreases as the amount of traffic sent over the tunnel increases. Please see the attached documents (GRE-overhead-analysis.docxGRE-overhead-analysis-no-TSO.docx) for more information.

In order to mitigate this issue, Jumbo Frames can be enabled. However, this should be considered carefully due to possible interoperability issues. Ideally, a different technique for implementing overlays might be consider. The latest open vSwitch release supports STT (Stateless Transport Tunnelling - IETF proposal available here: http://tools.ietf.org/html/draft-davie-stt-01), which is a viable alternative to GRE. Among the other alternatives VxLAN (IETF proposal: http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-00) is definitely worth being considered. As of today VxLAN is not yet available for Open vSwitch. The adoption of this protocol is therefore subject to it being supported in OVS; however an alternative would be building overlays using Cisco N1kV, which apparently will be soon available for KVM (It's being currently demoed at CLUS running on KVM with Openstack Quantum) - and probably for XenServer as well. Even if OVS, being Open Source and multi-layer, remains our first choice, Cisco Nexus 1000v shoul be followed closely as well, as it might provide a easy way for building overlay networks across all the hypervisors supported by Apache Cloudstack. 

  • Implement controller model
    • The Cloudstack management server currently drivers Open vSwitch as any other resource; to be more precise it currently handles Open vSwitch as a part of the "XenServer" resource, but as discussed previously in this document, it could also handle it as a separate resource. 
      Open vSwitch is currently driven by a XenAPI plugin which uses CLI utilities for configuring it. A better architectural solution would consist in the anagement server driving Open vSwitch through its interfaces, namely the OpenFlow interface and the OVS-DB interface. While the former can manipulate the flow table, and for instance configure the broadcast storm prevention rules, the latter governs the layout of the switch and could therefore be used for setting up the tunnel mesh as well as the bridge configuration.
      However, at the moment this is a feature that it is not likely to be implemented in the short or medium term. 
  • No labels