Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Target release 
Epic 
Document statusDRAFT
Document owner

Mark Payne

Designer 
Developers 
QA 

 

Goals

  • Provide a mechanism for Processor developers to create Processors that are event-driven in such a way that the framework is able to more efficiently trigger the Processor to run

Background and strategic fit

Currently, Processors are scheduled to be triggered based on a timer or CRON schedule. There is a mechanism for scheduling processors to run via an Event-Driven schedule. However, this scheduling strategy has been 'experimental' since it was developed. It is known to be slower and less efficient than the Timer-Driven strategy. This is due largely to the complex data structure that is used to keep track of how many 'tasks' can/need to be performed for each Processor, how many threads can be used, etc.

 

This proposal offers a new design, which is to create a new first-class component: EventProcessor. This would differ from the existing Processor in that rather than having the following method signature:

void onTrigger(ProcessSessionFactoryProcessContext sessionFactorycontext, ProcessContextProcessSessionFactory contextsessionFactory) throws ProcessException

the new EventProcessor would have a slightly different method signature:

void onTrigger(ProcessContext context, ProcessSession session, FlowFile flowFile, ProcessContext context) throws ProcessException

 

This offers a slight benefit to Processor developers in that there is no need to call ProcessSession.get() and check if the FlowFile is null. Instead, the Processor would simply be handed the FlowFile and the ProcessSession to use. However, this would allow the framework to do some things much more efficiently than it currently is capable of doing.

When using the TImer-Driven scheduling strategy with a run schedule of 0 seconds, the framework has to make a decision about how often to call the onTrigger method. While there is data to process and there is no backpressure applied, it calls onTrigger every 30 microseconds by default. However, if back pressure is applied or there is no incoming data, it will wait 10 milliseconds (by default). This is bad for two reasons. First, it introduces 'artificial delay' in these cases. It should be able trigger the Processor as often and as quickly necessary but not more than that. Secondly, it results in a lot of resource usage to 'check if there is work to do'. Using an Event-based approach can allow us far better performance.

Looking more toward the future, this also opens up more possibilities for things like "session migration" or "branch prediction" where the framework could determine that since FlowFile A was transferred to Relationship 1, if the next Processor is Side-Effect Free, the framework could allow the next processor to become processing the FlowFile even before the session is committed. Additionally, once a session is committed, under certain conditions, we can pass the session along from one Processor to another (migrating the session). This allows multiple Processors to perform their action without committing the session. 

 

Assumptions

Requirements

#
Title
User Story
Importance
Notes
1Introduce new EventProcessor interface MUST 
2

Refactor existing AbstractSessionFactoryProcessor so that all of the common methods are implemented at a higher level

 MUST 
3Introduce an AbstractEventProcessor class MUST 
4Update UI to remove the Scheduling Strategy for Processors of this type MUST 
5Update Developer Guide to document these changes MUST 
6Refactor existing Processors to use this pattern where applicable SHOULD 

User interaction and design

Difference to the user interaction should be minimal. The UI should no longer show a Scheduling Strategy for processors of this type.

This would not allow Processors such as MergeContent to use this interface, as MergeContent takes in a batch of FlowFiles at a time and also also makes use of the @TriggerWhenEmpty annotation. However, most other non-source Processors could make use of this pattern.

We should think through the notion of allowing such a Processor to register to receive non-FlowFile events. For instance, a Processor could receive a Provenance Event instead of a FlowFile, or a notification that a directory changed. We would require a discreet set of events that the Framework allows Processors to register for. We could also potentially allow Processors to trigger an event. This would greatly ease the development of Processors that implement patterns such as hold/release. In this case, instead of being handed a FlowFile in onTrigger, we would change it to an Event. One type of Event would be a FlowFileEvent, from which the developer could call getFlowFile().

Questions

Below is a list of questions to be addressed as a result of this requirements document:

Question
Outcome
  

Not Doing

...