The purpose of this document is describe the current redundant router mechanism and design choice behind it.

Basic Mechanism

  1. RvR is built with keepalived and conntrackd running in the two virtual machine. Keepalived is responsible for implementing VRRP protocol, conntrackd is responsible for syncing the connection status(e.g. TCP) across the different instance of VR, trying best to make sure fail-over is flawless.
  2. RvR can completely maintain their own status without the intervening of mgmt server. The basic mechanism as follow:
    1. VR in MASTER state would send out broadcast message(VRRP protocol) every second, show the existence of MASTER router. Conntrackd would also broadcast the latest connections' state.
    2. VR in BACKUP state would listen to broadcast message, and update the connections' state according to the broadcast.
    3. Once VR in BACKUP state hadn't received VRRP protocol for 3 seconds, it would assume VR in MASTER state is dead. Then it would switch to MASTER state.
  3. When one VR switch to MASTER, it would:
    1. Enable all the public interfaces,
    2. Send out gratuitous ARP to public gateway to update ARP cache
    3. Start password, dnsmasq and VPN services.
    4. Update conntrackd state to "primary".

Target

  • We're trying to cover following scenarios:
    1. One VR is down
    2. One VR lost network connection(permanently or temporarily)
    3. The network storage used by one VR is corrupted
    4. The host running one VR is down(permanently or temporarily)
    5. Control Network down, e.g. lost connection to the host running VR(permanently or temporarily)
    6. Guest Network down(permanently or temporarily)
  • The most challenge parts are temporarily issues, since the original VRRP designed for real hardware, which most likely cannot get back online with old data once it's broken.

CloudStack Design

Priority of keepalived process

  1. Each keepalived process would come up with a priority(1 ~ 255), in case two different VR both declaim as MASTER in the network, then the VR with lower priority would yield. 
  2. In order to prevent dual-MASTER state, each running VR has been assigned a different priority value by CloudStack.

Rule Programming

  • If CS want to program a new rule to VR(e.g. firewall rules), but found the host running MASTER VR is down and it can only program the BACKUP VR, it would:
    1. Program the BACKUP router with the latest rule
    2. Bump up BACKUP VR's priority to over MASTER's priority, thus the newly programmed VR would become MASTER. This is used to prevent the MASTER VR would come back with staled data.
    3. Set old MASTER VR's stop_pending flag to true, so whenever mgmt resume the connection to the VR(or the host that VR is running), it would stop the VR ASAP, and expect admin to fix the issue.
    4. Note: An additional problem here is, we can only bump up the priority once. So when both MASTER and BACKUP are priority bumped up, we would reboot BACKUP router automatically, program it a new priority value(which is not bumped up), thus restore it's ability to deal with the situation above. But if the VRs are rebooted many times(40 times in fact), it's probably running into the boundary of priority value(which is 1~255). At that moment, we ask admin to restart the network with cleanup=true, because we assume that in a normal condition, the router won't be rebooted so often.

Keepalived process failure detection

  • All the scripts and log are in the a ramdisk created specifically for keepalived. This is used to mitigate the problem if network is down, since mostly VR's storage is in the network(NFS or others). In the ramdisk, as long as VR is up, the log and time stamp is available.
  • When keepalived is running, it would write to a timestamp file every 10 seconds. And there is cron job calling check_heartbeat.sh to check if keepalived is dead. Use "ps" to list processes is not reliable because sometime after storage back online, we found keepalived is not working anymore but it's still listed in the process list, somehow trapped into an abnormal state. check_heartbeat.sh is supposed to run every 60 seconds(it's possible for cronjob to delay), and would check if time hasn't passed for 30 seconds(suppose to be 60 seconds or longer). If so, then we determine keepalived process is dead and we would kill it, shutdown the service and transfer to FAULT state.

Status update

  • Mgmt server would poll RvR's state by default every 30 seconds. The polling is designed to be a producer/consumer model. By default, there are 10 threads dedicated to this job. All these parameters can be modified in global configuration.

Redundant router recovery

  • If one of the VR router is completely broken(e.g. due to broken storage after network down), it can be destroyed and recreated.
    1. Stop the router(if mgmt unable to connect to the host, use "force=true", otherwise use "force=false").
    2. Destroy the router.
    3. Restart the network, with cleanup=false(the other router must be present). CS would create another VR in (likely) another host.

Deployment for RvR

  • Mgmt server would try to deploy two VR in the physical devices as far apart as possible. It would try different pod, different cluster, different storage, different host first, until there is none of above condition can be met, it would deploy both of them in the same host.
  • No labels