Note: This document is considered a living document and will evolve as the community guides the development of the effort moving forward. The initial content is based heavily around the Java agent and scope will later grow to increase other agents.
Given the commonalities between the MiNiFi effort and that of NiFi, a similar structure will be provided for the Java version, as a Maven project, comprised of:
Guiding Points of Design
- Installation, provision, and establishing of agents
- Upgrading agents, inclusive of both functionality in terms of flow/processing, and the agent binary and associated libraries
- Tolerance for asynchronous and optimistic upgrades
- Realization of agent capabilities
- Possible taxonomy of agents and functionalities that may be driven by available software and hardware
- Mediating different agent versions
- Observation of status
- Provenance Generation and Transmission to some endpoint store Considerations for supporting replay and buffering of data
- Prioritization of Data with Back Pressure and Pressure Release
- Supported Operating Systems and the requirements levied:
- Linux, and
- Networking and communication protocols
- Traversal of non-direct network hops and relay functionality
- Traversal of non-direct network hops and relay functionality
- Manager provides a consistent user experience for agent groups as it does processors
Guidelines for MiNiFi C++
The following guidelines represent a list of guidelines that may be useful for MiNiFI C++ development and design. These stem from the environments which
are potential targets. Their basis stems from typical C++ development practices and design goals that lend toward maximizing development efforts.
- Minimize Memory footprint and management of memory within modules.
- Limit data and access patterns that are deemed risky
- Limit failure with testing at all stages; however, when testing occurs we should aim for recoverability.
- We should follow the open/closed principle to avoid churn and support changes as they occur.
- Try to be a good cog – Problems will occur on devices and thus we can’t trust anything, even a malloc
- Executable that is command line driven
- Installable as a service
- Establishes a two process mechanism similar to that of existing NiFi:
- Bootstrap Process: controls the instantiation and execution of the flow process and aids in receiving configuration changes (products of design and deploy approach)
- Flow Process: handles the actual collection and transmission of data
- Makes use of a configured state to drive the process of starting a flow, this should be extensible to allow various implementations of inputs
Agents will have a defined taxonomy and capabilities associated with them. These properties will aid in the agent being able to communicate what items are possible and aid flow designers in the process of creating flows for various agent classes. Said capabilities will be communicated with a manager for the sake of understanding what is possible with various agents. Capabilities and capacities may change over time and this information will be continually registered with associated systems
Longer term, agents should be able to convey their capabilities as a result of items such as environment, version of software, networking, and hardware for establishing configuration of flow and collected data from a manager perspective.
Configuration - Bootstrap Agent Executable
Primarily handles the bootstrapping of the process and the configuration of the JVM which is monitoring and controlling the flow process. This will receive configuration changes and affect the associated flow process to provide these updates.
- ConfigurationChangeListener - Provides the handling of updates to the agent from an external source
- In the simplest case, this would be evaluating changes to a configuration file
Configuration - Processing Flow
- Design and Deploy driven where the associated flow is provided via the bootstrap process
The FlowFile format has been the core serialization format of NiFi and provides structure that allow for ease of files traversing a given flow and exploit pass by reference semantics in routing operations. Of interest is the handling of information with the core FlowFile format as metadata is transmitted from the agent to a receiving node/system. This may be out of band or as an augmentation to the FlowFIle format.
Provides a means for introducing data into the system and currently maps data to existing processors in the system. Given the desire to make use of existing libraries and functionalities when developing the initial agent offering, focus will be provided to the core use cases, mapping to existing processors, this would be comprised of:
- Files (Tail, Get)
- Logs (Listed Syslog, UDP
Egress is viewed as high level terminology for getting data from an agent to an associated system. The complexity and needs for this functionality may vary across environments and may have complex networking schemes required
Communication and Protocols
For the existing proof of concept and for establishing an agent to make larger architectural decisions, the Java agent can make use of the existing Site to Site protocol and functionality to communicate with an endpoint system.
A lightweight process, capable of being constructed for acquiring information from a host system(s) and providing this information to another system for consumption. This process provides provenance, a directed graph of processing, and extensibility to map to various data formats, schemas, and protocols.
Functionality that a given agent is able to perform. In some contexts, this may be communicating with specific devices, handling a certain nature or complexity of data, compute power, or serving specific roles in the data ingress/egress process from generation to consumer
An aggregation of one or more capabilities that allows specific agents to carry out a given processing graph. For example, a high-level view of a "File Forwarder" class would require the capability to both interact with the file system to get files and additionally have one or more egress methods to return information to a desired consumer
A generic term for providing information from an agent to one or more consumers. In simplest form, this is a direct line through networking to send data to a desired target. In more complex environments, this may require an n-hop network relying on several other agents to relay the data throughout the network traversal.