SIP-12 proposes a collection of changes to Solr's existing backup/restore functionality. Among these is support for a new backup format that will allow backups to be done incrementally. This page is intended to detail the specific file layout being proposed. For other aspects of the incremental backup proposal or other aspects of the SIP, see the main proposal page here.

High-Level Layout

Backups under the proposed incremental format would create a file tree like the one shown below at a given "location" (i.e. location parameter value) in the backup repository. Each "location" can store an arbitrary number of backups, organized at the highest level by the "backup name" (usually but not always the name of the collection being backed up).

Overall File Layout

/backup_location
    /techproducts
	    /backup_0.properties
	    /backup_1.properties
	    /shard_backup_metadata
		    /md_shard1_id_0.json
		    /md_shard2_id_0.json
		    /md_shard1_id_1.json
		    /md_shard2_id_1.json
	    /index
		    /0DD2971A-53D6-4224-A49B-8AC90D158F97
		    /1AA2CF56-BFA0-40D5-8B9B-5CAD47B07396
            ...
	    /zk_backup_0
            conf/
                <configset files>
            state.json
            ...
	    /zk_backup_1
            ...

The file listing above shows a single backup "location" which contains two incremental backups ("0", and "1"). Several different classes of files can be seen:

Collection Backup Metadata Files (e.g. backup_0.properties) - top-level metadata about a collection backup. Contains pointers to the related ZooKeeper files and per-shard metadata files. This file is the last written during a backup, to ensure that backups are only advertised when all underlying files are already in place.
Per-Shard Backup Metadata Files (e.g. shard_backup_metadata/md_shard1_id_0) - contains pointers to all index files for a specific shard, along with metadata used for validation purposes (checksum, etc.). Live in the containing "shard_backup_metadata" folder
Individual Index Files (e.g. index/0DD2971A-53D6-4224-A49B-8AC90D158F97) - backed up Lucene index files. All index files from all shards live together in a single containing "index" folder. UUIDs are used for file names in the backup repository, to avoid naming conflicts that would otherwise arise in multi-shard collections, etc. (The UUID-to-original-filename mapping is recorded in per-shard metadata files.)
ZooKeeper Config Backup (e.g. zk_backup_0/...) - ZK config affecting the collection being backed up. Includes a clone of the configset, the DocCollection/collection-state, and any Collection Properties for the collection in question. No attempt is made to be "incremental" in the ZK data backed up - it is refetched and stored for each backup.

Example Metadata Files

Most of the files in the listing above are either Lucene index files or ZooKeeper data backups. The only new files are the collection and shard-level metadata files for each backup. These are worth a closer look.

Collection-Level Metadata File

The "collection-level" metadata file for each backup includes a variety of information about the backup, as shown below.

Collection-Level Metadata File

backupName:<...>
collection:<name of the collection>
collectionAlias:<..>
collection.configName:<..>
startTime:<time of the backup creation>
index.version:<LUCENE_8_2_0>
shard1.md:<metadata file for shard1>
shard2.md:<metadata file for shard2>
numberOfIndexFiles:<>
indexSize:<>

This file holds a mix of metadata (collection, startTime, index.version, etc.) and navigational pointers used to lookup information for each shard (shard1.md, shard2.md). When creating a backup this file is written last, allowing backup code to rely on its presence as an indicator that the backup was completed successfully.

Shard-Level Metadata File

An example "shard-level" metadata JSON file is shown below. It holds information about each index file required to restore the shard in question. This includes all index files just uploaded as well as any index files uploaded by previous backups but that are still used by the shard. Metadata is stored for each Lucene index file, including its original filename, the unique name given to the file for storage in the backup repository, and checksum and size information.

The unique filenames are required to avoid name conflicts between identically named files from different shards. (Fun-fact: name-conflict scenarios also crop up with single-shard collections following leadership changes, decisions to delete the entire index, etc.

The checksum and size information stored with each file allows Solr to tell which files have changed since the last incremental update, and avoid uploading these files.

Shard-Level Metadata File

  "0DD2971A-53D6-4224-A49B-8AC90D158F97" : {
    "fileName" : "segments_10"
    "checksum" : "1238971231e6239"
    "size" : 101013
  },
  "1AA2CF56-BFA0-40D5-8B9B-5CAD47B07396" : {
    ...
  },
  ...
}

Walkthrough Scenario: Repeated Backups on a Changing Collection

As a way to show how this file format allows incremental backups to be done, let's walk through two backups done on the 'techproducts' collection. This walkthrough focuses narrowly on the files read and written by Solr during the backup process. It omits details such as communication between Solr nodes, overseer messages, etc. for brevity and because these pieces are mostly unaffected by this SIP.

Initial Backup

An excited first-time Solr user has just stood up a 'techproducts' collection on their SolrCloud cluster. They want to take a snapshot before tweaking some settings.

Solr receives a request to backup the single-shard "techproducts" collection.
Solr looks at the chosen repository, location, and collection/backup name to find the most recent backup available. (Unfortunately this does require a repository "list" operation on "/backup_location/techproducts" to identify the most recent backup_N.properties file. The cloud storage offered by many cloud providers is "eventually consistent", so these list operations are avoided whereever possible.) The returned file listing informs Solr that there are currently no backups for the specified collection at the specified location, so the current backup will be "0".
Solr gathers the index files on the shard-leader. It gives each a UUID and uploads each file to /techproducts/index/<UUID>, computing a checksum and remembering the size as each file is uploaded.
Solr uses the information computed during file-upload to create a shard-level metadata file, with pointers to each Lucene index file. This file is uploaded to the repository as /techproducts/shard_backup_metadata/md_shard1_id_0
With all index data uploaded, Solr creates the "zk_backup_0" directory under the root location, fetches all necessary data from ZK, and stores it there.
With all other backup information persisted to the repository, Solr persists the collection-level metadata file "backup_0.properties" to advertise that a completed backup is now available.

Second Backup

The Solr user further grooms their techproducts catalog. ("What are those currency docs doing in there anyways?") They're happy with the results and want to backup again.

Solr receives a request to backup the single-shard "techproducts" collection.
Solr looks at the chosen repository, location, and collection to find the most recent backup available. (As before, this requires a "list" operation.) The file listing informs Solr that there is an existing backup "0", making the current backup "1" accordingly.
Solr reads backup_0.properties. In this file, Solr reads the pointer to the shard-metadata file for "techproduct"s only shard: '/techproducts/shard_backup_metadata/md_shard1_id_0'. Solr fetches this file as well.
Solr gathers the index files on the shard leader. For each, it checks whether the file has already been uploaded according to the records in md_shard1_id_0. If the shard-metadata file has an entry for a given local file, and the recorded checksum and file-size match those exhibited by the local file, the local file is skipped over. Otherwise the file is uploaded as in step (3) from "Initial Backup".
Solr builds the md_shard1_id_1 file based on data computed from the just-uploaded files, and entries that matched from the previous backup. This file is uploaded as /techproducts/shard_backup_metadata/md_shard1_id_1
ZK data and the collection-level metadata file are created and stored as in the concluding steps of the "Initial Backup".

Space shortcuts

Page tree

High-Level Layout

Example Metadata Files

Collection-Level Metadata File

Shard-Level Metadata File

Walkthrough Scenario: Repeated Backups on a Changing Collection

Initial Backup

Second Backup

Space shortcuts

Page tree

Incremental Backup File Format

High-Level Layout

Example Metadata Files

Collection-Level Metadata File

Shard-Level Metadata File

Walkthrough Scenario: Repeated Backups on a Changing Collection

Initial Backup

Second Backup