Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


New contributors to NiFi often have the same question: Where can I start?  When you look at a project like NiFi, one can imagine that there are quite a few moving parts, so understanding howall how all of these parts fit together can in itself become pretty daunting.


This  This documentation is intended to be a very high level description of the components that make up NiFi and how thosecomponents those components fit together. It is not intended to go into great detail about any of these components. There are other documentsthat documents that exist (and more to come) that delve a bit deeper into the design of some of these individual components. The goal ofthis of this document is to help developers who are new to NiFi come up to speed on some of terminology and understand howthe how the different components that make up the platform interact with one another. 


We will begin the discussion with the FlowFile. This is the abstraction that NiFi provides around a single piece of data. A  A FlowFile may represent structured data, such as a JSON or XML message, or may represent unstructured data, such asan as an image. A FlowFile is made up of two parts: content and attributes. The content is simply a stream of bytes, which canrepresent can represent any type of data. The content itself is not stored within the FlowFile, however. Rather, the FlowFile simply stores areference a reference to the content, which lives within the Content Repository. Content is accessed only by means of the Process Session,which  which is capable of communicating with the Content Repository itself. The attributes are key-value pairs that are associatedwith associated with the data. These attributes provide context along with the data, which allows the data to be efficiently routed and reasonedabout reasoned about without parsing the content. 


This is the most commonly used component in NiFi and tends to be the easiest place for newcomers to jump in.A  A Processor is a component that is responsible for bringing data into the system, pushing data out to other systems,or  or performing some sort of enrichment, extraction, transformation, splitting, merging, or routing logic for a particular piece of data.  Common Design Patterns for Processors are discussed in the Developer Guide.


The Processor is an extension point, and its API will not change from one minor release of NiFi to another but may change with a newmajor new major release of NiFi. 

Processor Node

The Processor Node is essentially a wrapper around a Processor and maintains state about the Processor itself. The ProcessorNode Processor Node is responsible for maintaining, among other things, state about a Processor's positioning on the graph, the configuredproperties configured properties and settings of the Processor, its scheduled state, and the annotations that are used to describe the Processor.


By abstracting these things away from the Processor itself, we are able to ensure that the Processor is unable to change thingsthat things that it should not, such as the configured values for properties, as allowing a Processor to change this information can lead toconfusionto confusion. Additionally, it allows us to simplify the code required to create a Processor, as this state information is automaticallymanaged automatically managed by the framework. 

Reporting Task

A Reporting Task is a NiFi extension point that is capable of reporting and analyzing NiFi's internal metrics in order to provide the information to externalresources external resources or report status information as bulletins that appear directly in the NiFi User Interface. Unlike a Processor, a Reporting Task does not have accessto access to individual FlowFiles. Rather, a Reporting Task has access to all Provenance Events, bulletins, and the metrics shown for components on the graph, suchas such as FlowFiles In, Bytes Read, and Bytes Written.


The Reporting Task is an extension point, and its API will not change from one minor release of NiFi to another but may change with a newmajor new major release of NiFi. 

Controller Service

The Controller Service is a mechanism that allows state or resources to be shared across multiple components in the flow. The SSLContextService, for instance,allows  allows a user to configure SSL information only once and then configure any number of resources to use that configuration. Other Controller Services are used toshare to share configuration. For example, if a very large dataset needs to be loaded, it will generally make sense to use a Controller Service to load the dataset. This allowsmultiple allows multiple Processors to make use of this dataset without having to load the dataset multiple times. 

The Controller Service is an extension point, and its API will not change from one minor release of NiFi to another but may change with a newmajor new major release of NiFi. 

Process Session

The Process Session (often referred to simply as a "session") provides Processors access to FlowFiles and provides transactional behavior acrossthe across the tasks that are performed by a Processor. The session provides get() methods for obtaining access to FlowFiles that are queued up for a Processor,methods  methods to read from and write to the contents of a FlowFile, add and remove FlowFiles from the flow, add and remove attributes from a FlowFile,and  and route a FlowFile to a particular relationship. Additionally, the session provides access to the ProvenanceReporter that is used by Processors toemit to emit Provenance Events. 

Once a Processor is finished performing its task, the Processor has the ability to either commit or rollback the session. If a Processor rolls back the session,the  the FlowFiles that were accessed during that session will all be reverted to their previous states. Any FlowFile that was added to the flow will be destroyed.Any  Any FlowFile that was removed from the flow will be re-queued in the same queue that it was pulled from. Any FlowFile that was modified will have both itscontents its contents and attributes reverted to their previous values, and the FlowFiles will all be re-queued into the FlowFile Queue that they were pulled from. Additionally,any  any Provenance Events will be discarded. 

If a Processor instead chooses to commit the session, the session is responsible for updating the FlowFile Repository and Provenance Repository withthe with the relevant information. The session will then add the FlowFiles to the Processor's outbound queues (cloning as necessary, if the FlowFile was transferred toa to a relationship for which multiple connections have been established).  

Process Context

The Process Context provides a bridge between a Processor and its associated Processor Node. It provides information about about the Processor's currentconfigurationcurrent configuration, as well as the ability to "yield," or signal to the framework that it is unable to perform any work for a short period of time so the framework should notnot not not waste resources scheduling the Processor to run. The Process Context also provides mechanisms for accessing the Controller Services that are available,so  so that Processors are able to take advantage of shared logic or shared resources. 

FlowFile Repository

The FlowFile Repository is responsible for storing the FlowFiles' attributes and state, such as creation time and which FlowFile Queuethe Queue the FlowFile belongs in. The default implementation is the WriteAheadFlowFileRepository, which persists the information to a write-aheadlog ahead log that is periodically "checkpointed". This allows extremely high transaction rates, as the files that it writes to are "append-only," so theOutputStreams the OutputStreams are able to be kept open. Periodically, the repository will checkpoint, meaning that it will begin writing to new write-ahead logs,write  write out the state of all FlowFiles at that point in time, and delete the old write-ahead logs. This prevents the write-ahead logs from growingindefinitelygrowing indefinitely. For more information on the design and implementation of this repository, see the NiFi Write-Ahead Log Implementation page.


Note: While the FlowFile Repository is pluggable, it is considered a 'private API' and its interface could potentially be changed betweenminor between minor versions of NiFi. It is, therefore, not recommended that implementations be developed outside of the NiFi codebase. 

Content Repository

The Content Repository is responsible for storing the content of FlowFiles and providing mechanisms for reading the contentsof contents of a FlowFile. This abstraction allows the contents of FlowFiles to be stored independently and efficiently based on the underlyingstorage underlying storage mechanism. The default implementation is the FileSystemRepository, which persists all data to the underlying file system. 

Note: While the Content Repository is pluggable, it is considered a 'private API' and its interface could potentially be changed betweenminor between minor versions of NiFi. It is, therefore, not recommended that implementations be developed outside of the NiFi codebase.


Provenance Repository

The Provenance Repository is responsible for storing, retrieving, and querying all Data Provenance Events. Each time that a FlowFile isreceivedis received, routed, cloned, forked, modified, sent, or dropped, a Provenance Event is generated that details this information. The event containsinformation contains information about what the Event Type was, which FlowFile(s) were involved, the FlowFile's attributes at the time of the event, details about theeventthe event, and a pointer to the Content of the FlowFile before and after the event occurred (which allows a user to understand how that particularevent particular event modified the FlowFile). 

The Provenance Repository allows this information to be stored about each FlowFile as it traverses through the system and provides a mechanismfor mechanism for assembling a "Lineage view" of a FlowFile, so that a graphical representation can be shown of exactly how the FlowFile was handled. In orderto order to determine which lineages to view, the repository exposes a mechanism whereby a user is able to search the events and associated FlowFileattributesFlowFile attributes. 

The default implementation is PersistentProvenanceRepository. This repository stores all data immediately to disk-backed write-ahead log andperiodically and periodically "rolls over" the data, indexing and compressing the data. The search capabilities are provided by an embedded Lucene engine.For  For more information on how this repository is designed and implemented, see the Persistent Provenance Repository Design page. 

Note: While the Provenance Repository is pluggable, it is considered a 'private API' and its interface could potentially be changed betweenminor between minor versions of NiFi. It is, therefore, not recommended that implementations be developed outside of the NiFi codebase. 

Process Scheduler

In order for a Processor or a Reporting Task to be invoked, it needs to be scheduled to do so. This responsibility belongs to the Process Scheduler. In addition toscheduling to scheduling Processors and Reporting Task, the scheduler is also responsible for scheduling framework tasks to run at periodic intervals and maintaining theschedule the schedule state of each component, as well as the current number of active threads. The Process Scheduler is able to inspect a particular component in order todetermine to determine which Scheduling Strategy to use (Cron Driven, Timer Driven, or Event Driven), as well as the scheduling frequency.


FlowFile Queue

Though it sounds sufficiently simple, the FlowFile Queue is responsible for implementing quite a bit of logic. In addition to queuing the FlowFiles for another componentto component to pull from, the FlowFile Queue must also be able to prioritize the data following the user's prioritization rules. The queue keeps state about the number of FlowFilesas FlowFiles as well as the data size of those FlowFiles and must keep state about the number of "in-flight" FlowFiles - those that have been pulled from the queue but have not yetbeen yet been removed from the system or transferred to another queue.


When an instance of NiFi has a very large number of active FlowFiles, the attributes associated with those FlowFiles can be quite a burden on the JVM's heap. To alleviatethis alleviate this problem, the framework may choose to "swap out" some of the FlowFiles, writing the attributes to disk and removing them from the JVM's heap when a queue grows verylarge very large and later "swap in" those FlowFiles. During this process, the FlowFile Queue is also responsible for keeping track of the number of FlowFiles and size of the FlowFiles'contents  contents so that accurate numbers can be reported to users.


Finally, the FlowFile Queue is also responsible for maintaining state about backpressure and FlowFile Expiration. Backpressure is the mechanismby mechanism by which a user is able to configure the flow to temporarily stop scheduling a given component to run when its output queue is too full. By doing this,we  we are able to cause the flow to stop accepting incoming data for a short period, or to route data differently. This provides us the ability to preventresource prevent resource exhaustion. In a clustered environment, this also allows a particular node that is falling for one reason or another to avoid ingesting data sothat so that other nodes in the cluster that are more capable can handle the workload.

FlowFile Expiration is the mechanism by which data is eventually purged from the system because it is no longer of value. It can be thought of as aflowa flow's pressure release valve. This is used, for instance, in an environment when there is not enough bandwidth to send all of the data to its desireddestinationdesired destination. In such a case, the node will eventually run out of resources. In order to avoid this, the user can configure a queue to expire data thatreaches that reaches a certain age. For example, the user can indicate that data that is one hour old should automatically be purged. This capability is then coupledwith coupled with the ability to prioritize the data so that the most important data is always sent first and the less important data eventually expires. 

FlowFile Prioritizer

A core tenant of NiFi is that of data prioritization. The user should have the ability to prioritize the data in whatever order makes sense fora for a particular dataflow at a particular point in time. This is especially important for time-sensitive data and becomes even more important whenprocessing when processing data in an environment where the rate of data acquisition exceeds that rate at which the data can be egressed. In such an environment,it  it is important to be able to egress the data in such a way that the most important data is sent first, leaving the less important data to be sentwhen sent when the bandwidth allows for it, or eventually be aged off.


This tenant is realized through the user of FlowFile Prioritizers. The FlowFile Queue is responsible for ensuring the data is ordered in the way thatthe that the user has chosen. This is accomplished by applying FlowFile Prioritizers to the queue. A FlowFile Prioritizer has access to all FlowFile informationbut information but does not have access to the data that it points to. The Prioritizer can then compare two FlowFiles in order to determine which should be madeavailable made available first. 

The Prioritizer is an extension point, and its API will not change from one minor release of NiFi to another but may change with a newmajor new major release of NiFi. 

Flow Controller

In order for NiFi's User Interface to display the wealth of information that it renders, it must have some place to gather that information.The  The Flow Controller can be thought of as the bridge between the Web Tier and the back end components of a NiFi node. It is largelyresponsible largely responsible for the initialization of the flow, the creation of components in the flow, and the coordination between those components. Ina In a more general sense, the Flow Controller is responsible for maintaining system-wide state. This includes the state of the flow itself, andis and is responsible for coordinating connection and participation in a NiFi cluster. 

Cluster Manager

Whereas the Flow Controller is responsible for maintaining system-wide state about a particular node, the Cluster Manager is responsiblefor responsible for maintaining system-wide state about an entire cluster. This includes information such as which nodes are in the cluster and the states ofthose of those nodes. Additionally, the Cluster Manager is responsible for acting as the bridge between the Web Tier and the cluster state. This includesreplicating includes replicating requests from the user to all nodes in the cluster and merging their responses. 

An instance of NiFi will create either a Cluster Manager or a Flow Controller but never both. Which component is created is driven by theconfiguration the configuration found in the file.


Authority Provider

When NiFi is configured, it can be configured to run in secure mode, using SSL to access the web endpoints or to run in non-secure mode,where  where all endpoints are access anonymously. If running in secure mode, the Authority Provider is responsible for determining which usershave users have access to perform which operations. It accomplishes this by providing a list of Authorities, or Roles, that a given user is allowed topossessto possess. Note, it does not indicate which Roles a user actually has, only which Roles a user is allowed to have. A NiFi user who has the RoleRole ADMIN is then able to grant each of those Roles to the given user. In this way, a central authority service can be used to dictate which Rolesa Roles a given user is allowed to have, but just because a user is allowed to be granted the ADMIN role, for example, does not mean that the usershould have user should have the ADMIN role for a particular instance. 

Each endpoint that is defined must also define which Roles are authorized to access the endpoint. A user is then able to access anendpoint an endpoint if and only if the user has one of the specified Roles. 


The Authority Provider is an extension point, and its API will not change from one minor release of NiFi to another but may change with a newmajor new major release of NiFi. 


NiFi's User Interface shows only information that is available via the NiFi RESTful API. This is accomplished by accessing the different endpointsthat endpoints that are defined in the *-Resource classes. For example, the ProcessorResource class is responsible for defining the endpoints that are used tointeract to interact with the different Processors in the flow, including the addition and removal of Processors. The ProvenanceResource class is responsiblefor responsible for defining the endpoints that are used to interact with the Provenance Repository, such as executing queries.


All of the Resource components are found in the nifi-web-api module, in the org.apache.nifi.web.api package. 


In order for an organization to be able to depend on NiFi to handle their dataflows in an automated fashion, the organization needs to be ableto able to rely on NiFi to handle unexpected conditions and to be able to startup and shutdown in an easy-to-use, consistent manner. The Bootstrap moduleis module is responsible for achieving these goals. The Bootstrap allows the user to easily configure how the NiFi JVM will be started by relying on theconfiguration the configuration provided in the file. This allows the application to easily be started with Remote Debugging enabled,with  with specified minimum and maximum heap sizes, and with other important JVM options. 

Once the NIFi application has been started, the bootstrap is then responsible for monitoring NiFi and restarting the service as needed in orderto order to ensure that the application continues to provide reliable dataflow. 


In a containerized environment like NiFi, it is important to allow different extension points to have arbitrary dependencies without those dependenciesaffecting dependencies affecting other, unrelated extension points. In Java, the mechanism for doing this is the ClassLoader. NiFi defines its own implementation of theClassLoaderthe ClassLoader, the NarClassLoader. Each NiFi Archive (NAR) has its own NarClassLoader that is responsible for loading the classes defined in thatNARthat NAR. Additionally, it is responsible for providing native library isolation, so that a Processor, for example, can depend on native libraries that are notshared not shared with other components. This is accomplished by placing those native libraries in the NAR's native directory. 

For more information about NiFi Archives and how they are used, see the NiFi Archives (NARs) section in the Developer Guide.