Page properties | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Goals
Provide the ability to replicate FlowFile content and attributes in such a way that if a node in a NiFi cluster is lost, another node or nodes can quickly and automatically resume the processing of the data
Background and strategic fit
In today's world, it is becoming increasingly important for many organizations that data processing is both timely and highly available, even in the face of failures. The current design and implementation of the Content and FlowFile Repositories is such that if a NiFi node is lost, the data will not be processed until that node is brought back online. While this is acceptable for many use cases, there are many other use cases in which this is not acceptable.
While using a RAID configuration can mitigate the fears of data loss by providing high-performance local replication, we have heard from several organizations that in order to make use of NiFi, they need the system to automatically failover to a new node if one node is lost.
Assumptions
Requirements
# | Title | User Story | Importance | Notes |
---|---|---|---|---|
1 | Replicate FlowFile Content | If a particular NiFi node is lost (due to machine failure, etc.) the data needs to be made available in such a way that other nodes in the cluster can retrieve the data and process it themselves | Must Have | |
2 | Replicate FlowFile Attributes | If a particular NiFi node is lost (due to machine failure, etc.) the FlowFile information (attributes, current queue identifier, metadata, etc.) needs to be made available in such a way that other nodes in the cluster can retrieve the information and process the FlowFiles themselves | Must Have | |
3 | Ability to Failover | At any time, all nodes in a NiFi cluster must be able to begin processing data that previously was "owned" by another node. This must be orchestrated in such a way that the data is processed only once and if the original "owner" returns to the cluster that it does not re-process the data itself | Must Have | |
4 | Automatic Failover | In addition to the ability of a node to begin processing data from another node, this failover must happen quickly and in an automated fashion. This should not require human intervention. | Must Have | |
5 | Replicate "Swap Files" | If one of the NiFi queues reaches a certain (configurable) capacity, NiFi will "swap out" the FlowFiles, serializing the FlowFile info to disk, and removing them from the Java heap in order to ensure that the JVM does not run out of memory. This must also be replicated in order to ensure that another node that resumes the work of a failed node is also able to recover those FlowFiles without running out of heap space. | Must Have |
User interaction and design
Replicating FlowFile Content
...
There also exists another Feature Proposal for State Management. If we implement this first, it will largely aid in the implementation of communicating with ZooKeeper.
Questions
Below is a list of questions to be addressed as a result of this requirements document:
Question | Outcome |
---|---|