Introduction

Purpose

Currently CS support volume snapshot, which is an EC-2 like public cloud solution.

It addresses problems like ‘what if my volume lost or broke down, or what if my primary storage got an unrecoverable disruption’, in other words, it’s more like a backup solution, and it does take considerable long time to backup and restore, especially for large volumes which are unfortunately favored by customers.

There are growing needs for VM snapshot, just like what Xenserver and VMware ESXi do.

It addresses requirement such as 'I want to save everything right now so that I can revert back in the future, and both operations can be done within seconds’, mainly used for private cloud.

References

Document History

Glossary

  • VM snapshot: snapshot on entire VM, including its volumes, memory and CPU state, resides on primary storage. Mainly used for revert purpose.
  • Volume snapshot: a backup of volume resides in secondary storage. Mainly used for restore purpose.

Use cases

  • Create snapshot for a specified VM
  • Revert VM to a specified snapshot
  • Delete a specified snapshot
  • List snapshots for a specified VM
  • Support creating of 'VM' snapshots (“preserve the state and data of a VM at a specific point in time.“) of both a powered on and powered off VM
    • Able to provide choices for  a) if memory state is needed b) if file system needs to be quiesced if the VM is powered on
  • Remove a snapshot and delete any associated storage
  • Remove all snapshots of a VM
  • Revert to a snapshot
  • Admin can place a limit on the number of stored snapshots per user
  • Users can create snapshots manually or by setting up automatic recurring snapshot policies** Snapshots can be created on an hourly, daily, weekly, or monthly interval. One snapshot policy can be set up per VM
  • With each snapshot schedule, users can also specify the number of scheduled snapshots to be retained** Older snapshots that exceed the retention limit are automatically deleted. 
    • This user-defined limit must be equal to or lower than the global limit set by the CloudStack administrator. 
    • The limit applies only to those snapshots that are taken as part of an automatic recurring snapshot policy. Additional manual snapshots can be created and retained

Feature Specifications

VM Snapshot creation

  • VM snapshots form a tree structure, each VM snapshot can have one(or zero) parent snapshot.
  • A current snapshot refers to the most recent snapshot compared to the current state of the VM (although a domain might have snapshots without a current snapshot, if snapshots have been deleted in the meantime)
  • Two types of snapshots: disk, which takes snapshot of all disks of specified VM; disk and memory, which takes CPU/memory snapshot in addition to disks snapshot.
  • Support disk snapshot when specified VM is in running and stopped state
  • Support disk and memory snapshot when specified VM is in running state

VM Snapshot limitations

  • Detaching/attaching VM volume is not allowed if there are VM snapshots, because any changes to the disk layout will break the semantics of VM-based snapshot
  • VM's memory snapshots will be automatically discarded if VM's service offering is upgraded.
  • Volume snapshot operations and VM snapshot operations can not be performed concurrently.
  • For one VM, only one VM snapshot operation is allowed at a time. (no concurrent operations)
  • Customers should only use CS to take snapshot. CS maintains the tree in database, out-of-band snapshots will not be tracked or sync to CS
  • Please see following table for the snapshot type each Hypervisor support
  • Limit per account not supported
  • Recurring snapshot not supported
  •  

 

Disk-only Snapshot
as VM in running state

Disk-only Snapshot
as VM in Stopped State

Memory-Disk snapshot

Xenserver Free Edition

Yes

Yes

No

Xenserver Enterprise Edition

Yes

Yes

Yes

KVM

No

Yes

Yes

VMware

Yes

Yes

Yes

VM Snapshot deletion

  • Deleting a snapshot should not have any impact to its subsequent snapshots
  • Snapshots will be destroyed when VM is destroyed

VM Snapshot revert

  • Revert VM from running/stopped to a disk+memory snapshot, result in running state
  • Revert VM from running/stopped to a disk snapshot, result in stopped state

VM Snapshot List

  • Can list with commonly used parameters, like vmId, account, domainId, state..etc
  • Support query by keyword (unimplemented)

Performance consideration

  • Both create and revert should be completed in the scale of seconds
  • As the number snapshots for one VM grows, performance may downgrade. Users should have the awareness to control the the length of VM snapshot chain.

Event

  • Generate VM_SNAPSHOT related events

Capacity

  • VM snapshots reside on primary storage and occupy extra space, this should be reflected in capacity used statistics
  • For Xenserver, snapshot consists of a list of VDIs, snapshot leaf node VDIs, parent of snapshot leaf node (base copy VDI), suspended image VDIs, and active VDIs (current volume), the total used capacity is the sum of all VDI nodes physical size, except for active VDI.
  • For VMware, snapshots consists of a list of vmdk files and a list of vmsn files (memory image), the total used capacity is the sum of size of these files, except for the current active vmdk.

Global config/Limit

  • Add a global configuration for maximum number of VM snapshots a VM can support
  • Domain/account limit for VM snapshot (TBD)

Usage

  • when create or delete a VM snapshot, used capacity of the snapshot is re-calculated and an event is published for each of VM's volume, the size of memory image is added to ROOT volume
    because volume could reside on different kind of storage pool, which suggests VM snapshot usage is tracked on a volume basis, a usage record will include: volume_id, zone_id, account_id, domain_id, vm_id, size, disk_offering, startdate and enddate

Restriction on VM with vmsnapshots

  • attach/detach volume to/from this VM is not allowed
  • attach/detach this VM to/from network is not allowed.
  • volume resize for this VM is not allowed.
  • change offering is not allowed.
  • volume snapshot is not allowed, which is not major use case, we may revisit this later
  • volume migration for this VM is not allowed
  • VM storage migration for this VM is not allowed
  • VM scale for this VM is not allowed.
  • VM reset for this VM is not allowed.
  • restriction of reverting to vm snapshot, right now, revert to vm snapshot doesn't go through planer, vm status change and capacity is not tracked, so revert to vm snapshot, which causes vm status change, is not allowed, will revisit later.
    1. running VM can't revert to vm snapshot without memeory(change vm status from running to stopped)
    2. stopped VM can't revert to vm snapshot with memeory(change vm status from stopped to running)

Architecture and Design description

API

API

parameter

response

createVMSnapshot

  • vmId (required)

vmSnapshot

deleteVMSnapshot

  • vmSnapshotId (required)

jobid

listVMSnapshot

  • id (optional)
  • domainid (optional)
  • state (optional)
  • accountId (optional)
  • vmId (optional)

vmSnapshot[]

revertToVMSnapshot

  • vmSnapshotId (required)

VM

UI Change

  • Add snapshot action and [view snaptshots] in VM detail page

      

  • Snapshots List:

      

  • VM snapshot detail

      

Database Schema

New tables: vm_snapshots (only import columns are listed here)

column

comment

id

primary key, auto-increment

uuid

unique key

name

unique internal name generated by system, like i-2-58-TEST_VS_20121118140427

display_name

snapshot name provided by user when creating VM snapshot

description

a short description provided by user when creating VM snapshot

account_id

owner

domain_id

 

vm_snapshot_type

enum {Disk, DiskAndMemory}

state

VM snapshot state

parent

parent VM snapshot Id

current

if this VM snapshot is current

vm_id

VM id

updated

 

created

 

revmoed

 

HighLevel WorkFlow

VMSnapshot state machine

createVMSnapshot:

Common workflow

  1. check authority, concurrency, existence...
  2. allocate VM snapshot entry in DB
  3. transit vm and vmsnapshot state to snapshotting/creating
  4. prepare TO object and CreateVMSnapshotCommand
  5. send command to agent
  6. update DB, like current/parent fields or volume table, depending on CreateVMSnapshotAnswer and TO object
  7. transit vm and vmsnapshot state

Xenserver

  1. check if this vm snapshot already exists, if yes, return suceeded
  2. check if there are existing snapshot task for this vm snapshot, if yes, it means a re-entrant method call from fullsync, skip creation and wait for this task
  3. find target VM, or build a worker VM on the fly if it does not exist
  4. depends on snapshot type, call corresponding Xenserver APIs
  5. Xenserver does not change volumes' path after take a VM snapshot, no need to pack volumeTO into answer object

KVM

  1. check if this vm snapshot already exists, if yes, return suceeded
  2. find target VM, or build a worker VM on the fly if it does not exist
  3. based on VMSnapshotTO object in command, re-define parent snapshots metadata chain on the fly
  4. call libvirt API to take snapshot

VMware

  1. check if this vm snapshot already exists, if yes, return suceeded
  2. check if there are existing vm.snapshot task for this vm snapshot, if yes, wait for it and skip snapshot creation
  3. call vmware sdk to take snapshot
  4. because volumes path will be changed after taking snaphot, return new volumes paths in answer

revertToVMSnapshot:

Common workflow

  1. check authority, concurrency, existence.
  2. call advanceStart or advanceStop first if revert will change vm's state; for example, when reverting a stopped VM to a DiskAndMemory snapshot, we will start this VM first and then revert it.
  3. transit vm/ vmsnapshot state to reverting
  4. prepare TO objects and send command
  5. update DB with information from Answer object
  6. transite vm/vmsnapshot state

Xenserver

  1. build worker VM if target VM does not exist
  2. call revert plugin
  3. update volumeTO

KVM

  1. find target VM, or define a worker VM on the fly if it does not exist
  2. based on VMSnapshotTO object in command, re-define parent snapshots metadata chain on the fly
  3. call libvirt API to revert

VMware

  1. check if there are existing revert task for this vm, if yes, wait for it
  2. call vmware sdk to revert
  3. update volumeTO

deleteVMSnapshot:

Unlike VM expunging, VM snapshot deletion is designed as a sync operation, there is no daemon thread scanning and expunging them.

the implemention is fairly straightforward:

  1. transit vmsnapshot to expunging state
  2. prepare TO object and send command,
  3. update snapshots tree 
  4. mark as removed

VMSnapshotSync:

  1. Add vm snapshot sync to fullSync and fullHostSync.
  2. It will check if there are any vm snapshot in transient states.
  3. Transient state found during host connection usually means mgmt server restart/outrage, or hypervisor cluster down. Because mgmt server has no idea if those tasks succeed or not, it will re-send the command in question

Enable/disable on a per hypervisor*:*

Add enable/disable by hypervisor_capabilities,

Add a new column ` vm_snapshot_enabled` in table `hypervisor_capabilities`, and change related VO/Dao

Set vm_snapshot_enabled = 1 for VMware/Xenserver

Check hypervisor_capabilities when createVMSnapshot

Testing

Suggest following (but not limited) basic test scenarios

Create one VM snapshot with snapshotMemory (on, off) for (vmware, xenserver, KVM) when VM is (running, stopped)

Revert to previous snapshot when VM is (running, stopped)

Create multiple VM snapshot with snapshotMemory (on, off, mixed) for (vmware, xenserver, KVM) when VM is (running, stopped), the snapshots should form a tree hierarchy, such as:

    A

  /    \

B     C

Revert to any snapshots in the tree when VM is (running, stopped)

Delete (current, any, all) VM snapshots for (vmware, xenserver, KVM)

Attach/detach a volume to a VM when this VM has VM snapshots.

Upgrade VM serviceOffering when VM has snapshots with snapshotMemory (on, off)

take Volume Snapshot when associated VM has VM snapshots

  • No labels

2 Comments

  1. In VMSnapshotSync, details on how to manage out-of-band changes from vCenter side, not only a state of a particular snapshot, but the possible changes in snapshot tree. From this perspective, let CloudStack work as a transparent proxy to vCenter would make the solution more integrated with vCenter, would we consider that?

    1. yes, for VMware, we can retrieve the tree from vcenter and sync the information to database, but for Xenserver/KVM, there is no way to track out-of-band changes. we have to force customers to only use Cloudstack to take snapshot.

      Currently, VMSnapshotSync only handles VMSnapshots in transient states, (snapshoting, reverting, expunging..), this sync will be started in fullSync, because snapshots staying in transient status usually happen when mgmt restart or cluster down.