Overview

One of the novel features of ApacheTM NiFiTM is its notion of Data Provenance, also known as Data Lineage or Chain of Custody.  The term Data Provenance is used most often, because it can be thought of as a superset of Data Lineage. While lineage encapsulates the idea of how data was derived, Data Provenance also encapsulates the state of that data at each stage along the way.

The default implementation of the Data Provenance Repository is the PersistentProvenanceRepository class. At this time, there is no well-written documentation on how it all works, but we do have some notes that I have jotted down while explaining it to someone else. These should be formalized into some sort of paper or blog post at least. For now, though, here are the notes on how it works.

Design of Persistent Provenance Repository

Design Goals

Updating Repository

Recovering After Restart

Retrieving Events Sequentially

Expire Data