Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Data copy could take hours to complete depending on the volume of data each instance has and network bandwidth limitations. If source instance is down and new instance takes multiple hours to come live, then the destination instance would be lagging behind a lot. The time gap between source instance going down and destination instance coming up can be minimized by initiating first rount round of data copy before bringing the source Cassandra instance down. Source Cassandra process can continue writing data and add/delete/change some files (due to compaction/repairs/writes etc...) during the first data copy. Once the initial data copy is complete, source can be brought down and another data copy (a final one) can be initiated so that destination instance has same data as source. This final data copy will be quick compared to initial data copy as good amount of data already exists at destination. Now this procedure looks as follows:

...

  1. First identify the destination instance. One may make changes to their configuration like topology to bring-in the destination instance into the cluster. How these configuration changes are made and applied are not in the scope of this document.
  2. Keep the Cassandra process running as usual at source. Do not start Cassandra process on destination but start Sidecar on destination host/instance. Submit data copy request to destination Sidecar to pull data from source using Sidecar.
    1. During data copy, the destination Sidecar first pulls the list of files present at the source at that point of time.
    2. Then, the destination sidecar checks the list of files present locally against the list of files at the source.
      1. If a file is present at the source but not at the destination, then that file will be added to the download list.
      2. If a file is present at the destination but did not match (by size or timestamp) with the source file, then local file is deleted and added to list of files to download. It also avoids filling up the disk with un-necessary files. If the file has matched, then it will be excluded from the download list.
      3. Downloads the shortlisted files.
    3. While copying the data, source continues to write/update the data leading to add more SSTables or delete existing SSTables due to compaction etc... It is fine to be in this state at this stage.
    4. After downloading the list of files, the destination may not have some files or may have some files that are no longer present at the source. It may not be possible to ensure that 100% of the data at the destination matches the source as the source continues to run and makes changes. We can relax 100% matching of the data to some lower threshold to consider the data copy a successful one. This success threshold can be specified as part of the request payload.
    5. It could be possible that, after downloading the files once, the threshold was not met. This process can be repeated multiple times to meet the threshold. The number of iterations can be specified as part of the request payload. If the threshold is met in an iteration, then the data copy operation ends with a success status.
    6. If the threshold is not met even after multiple iterations, then the data copy task fails. The operators can still continue the migration if they want, because the data will be copied again after bringing down the source.
  3. Now, bring down the Cassandra process at source, keeping the Sidecar up and running. Once the source Cassandra process goes down, we can expect that there will be no changes made to the files. How the instances are brought down is not in scope of this document.
  4. Initiate the final data copy at destination with 100% success threshold.
    1. Destination pulls a list of files, deleting unnecessary files, skipping matching files, and downloading the required files. Afterwards, it compares the files with the source and checks if there is a 100% match.
    2. If the time to perform the final copy is less, then the time difference between the source instance going down and the destination coming up is also less.
  5. The operator can ensure that files in the destination matches with the source. In the first iteration of this feature, an api API is introduced to calculate digest for the list of file names and their lengths to identify any mismatches. It does not validate the file contents at the binary level, but, such feature can be added at a later point of time.
  6. Now, bring up the Cassandra instance at the destination. At this moment, the destination will have the same data as its source and will be equivalent to an instance coming up with a new IP address. The Sidecar at the source can now be brought down.
  7. How to bring up/down Cassandra/Sidecar instances or making/applying config changes are outside the scope of this document.

...

Tests will be added in Cassandra Sidecar convering covering end-to-end flow.

Rejected Alternatives

...