Goals
- Provide a mechanism for Processor developers to create Processors that are event-driven in such a way that the framework is able to more efficiently trigger the Processor to run
Background and strategic fit
Currently, Processors are scheduled to be triggered based on a timer or CRON schedule. There is a mechanism for scheduling processors to run via an Event-Driven schedule. However, this scheduling strategy has been 'experimental' since it was developed. It is known to be slower and less efficient than the Timer-Driven strategy. This is due largely to the complex data structure that is used to keep track of how many 'tasks' can/need to be performed for each Processor, how many threads can be used, etc.
This proposal offers a new design, which is to create a new first-class component: EventProcessor. This would differ from the existing Processor in that rather than having the following method signature:
void onTrigger(ProcessContext context, ProcessSessionFactory sessionFactory) throws ProcessException
the new EventProcessor would have a slightly different method signature:
void onTrigger(ProcessContext context, ProcessSession session, FlowFile flowFile) throws ProcessException
This offers a slight benefit to Processor developers in that there is no need to call ProcessSession.get() and check if the FlowFile is null. Instead, the Processor would simply be handed the FlowFile and the ProcessSession to use. However, this would allow the framework to do some things much more efficiently than it currently is capable of doing.
When using the TImer-Driven scheduling strategy with a run schedule of 0 seconds, the framework has to make a decision about how often to call the onTrigger method. While there is data to process and there is no backpressure applied, it calls onTrigger every 30 microseconds by default. However, if back pressure is applied or there is no incoming data, it will wait 10 milliseconds (by default). This is bad for two reasons. First, it introduces 'artificial delay' in these cases. It should be able trigger the Processor as often and as quickly necessary but not more than that. Secondly, it results in a lot of resource usage to 'check if there is work to do'. Using an Event-based approach can allow us far better performance.
Looking more toward the future, this also opens up more possibilities for things like "session migration" or "branch prediction" where the framework could determine that since FlowFile A was transferred to Relationship 1, if the next Processor is Side-Effect Free, the framework could allow the next processor to become processing the FlowFile even before the session is committed. Additionally, once a session is committed, under certain conditions, we can pass the session along from one Processor to another (migrating the session). This allows multiple Processors to perform their action without committing the session.
Assumptions
Requirements
# | Title | User Story | Importance | Notes |
---|---|---|---|---|
1 | Introduce new EventProcessor interface | MUST | ||
2 | Refactor existing AbstractSessionFactoryProcessor so that all of the common methods are implemented at a higher level | MUST | ||
3 | Introduce an AbstractEventProcessor class | MUST | ||
4 | Update UI to remove the Scheduling Strategy for Processors of this type | MUST | ||
5 | Update Developer Guide to document these changes | MUST | ||
6 | Refactor existing Processors to use this pattern where applicable | SHOULD |
User interaction and design
Difference to the user interaction should be minimal. The UI should no longer show a Scheduling Strategy for processors of this type.
This would not allow Processors such as MergeContent to use this interface, as MergeContent takes in a batch of FlowFiles at a time and also also makes use of the @TriggerWhenEmpty annotation. However, most other non-source Processors could make use of this pattern.
We should think through the notion of allowing such a Processor to register to receive non-FlowFile events. For instance, a Processor could receive a Provenance Event instead of a FlowFile, or a notification that a directory changed. We would require a discreet set of events that the Framework allows Processors to register for. We could also potentially allow Processors to trigger an event. This would greatly ease the development of Processors that implement patterns such as hold/release. In this case, instead of being handed a FlowFile in onTrigger, we would change it to an Event. One type of Event would be a FlowFileEvent, from which the developer could call getFlowFile().
Questions
Below is a list of questions to be addressed as a result of this requirements document:
Question | Outcome |
---|---|
Not Doing