DRAFT (this answer is to be reviewed)
Question
Deleted documents have an overhead in CouchDB because a tombstone document exists for each deleted document. One consequence of tombstone documents is that compaction gets slower over time.
Three options for purging tombstone documents from a CouchDB are:
- Create a new database for every N time period (and delete that database when the period expires)
- Filtered replication
- Do nothing
How can I choose which option is the most suitable?
Answer
Each approach is described below. Note that you may need to use a combination of both approaches in your application. Alternatively, you may find through testing that your tombstone documents don't add significant overhead and can just be left as is.
Create a new database for every N time period
When to use this approach?
This approach works best when you know the expiry date of a document at the time when the document is first saved.
How does it work?
Each document to be saved that has a known expiry date will be stored in a database that will get dropped when its expiry date has been reached.
When the document is being saved, if the database doesn't already exist then a new database must be created.
The rationale of this approach is that dropping a database is an in-expensive operation and does not leave tombstone documents on disk.
Gotchas
It is not possible to query across database in Cloudant/CouchDB. Cross database queries will need to be performed in the application itself. This will be an issue if the cross database queries require aggregating lots of data.
Filtered replication
When to use it
This approach works best when you don't know the expiry date of a document at the time when the document is first saved, or if you would have to perform cross database queries that would involve moving lots of data to the application so that it can be aggregated.
How does it works?
This approach relies on creating a new database at an opportune time (NOTE 1) and by replicating all documents to it except for the tombstone documents. A validate_doc_update (VDU) function is used so that deleted documents with no existing entry in the target database are rejected. When replication is complete (or acceptably up-to-date if using continuous replication), switch your application to use the new database and delete the old one. There is currently no way to rename databases but you could use a virtual host which points to the "current" database.
An example of such a VDU function is below (Source: http://markmail.org/message/vti566mjxmb5g2d7)
Code Block | ||
---|---|---|
| ||
function (newDoc, oldDoc, userCtx) { // any update to an existing doc is OK if(oldDoc) { return; } // reject tombstones for docs we don't know about if(newDoc["_deleted"]) { throw({forbidden : "We're rejecting tombstones for unknown docs"}) } } |
Gotchas
TODO what are the gotchas?
NOTE 1: TODO what is an opportune time? Probably before the database being replicated runs into compaction problems if it was left alone?