Introduction

This document outlines current approach used by NiFi to document components such as Processors, ControllerServices etc. as well as proposes improvements. Its purpose is to facilitate the discussion within NiFi community with hopes of finding a more sustainable and clean solution that would benefit both the developers and users alike.

Problem

Current mechanism to document NiFi components such as  Processors, ControllerServices etc., is API-based. A typical pattern that a developer of a component follows is to define documentable elements (PropertyDescriptor(s), Relationship(s) etc) as static final variables of a component and then provide implementation of instance methods to access their values. Such methods are:

Set<Relationship> getRelationships();
List<PropertyDescriptor> getSupportedPropertyDescriptors().

There are couple of issue with such mechanism:

  1. Usage of instance methods to essentially access static fields. Not a big issue and for some may not be an issue at all, but worth pointing out as it may be confusing to others.
  2. Repetitive code in getRelationships() and getSupportedPropertyDescriptors()
  3. Because instance methods are used to access documentable elements, an instance of the component must be created. This fact presents a number of issues:
    1. Security. As NiFi evolves it is conceivable to assume that at some point one of the security requirements could be that an instance of a component may only be created by an authorized principal (e.g., run as)
    2. Questionable code in default constructor. Unrelated (un-used) and poorly implemented component may render the entire system unavailable by having "questionable code" inside its constructor. And while "questionable code" can also be present in static initializer and cause the same issues during class loading, it's a less common pattern and falls in the realm of general coding practices (outside of NiFi).
    3. Throwaway instances. In fact a running NiFi without a flow will create and throw away 2 instances of each discovered component, regardless of how many will actually be used. And if component is already in the flow 5 instance of it will be created (see  NIFI-1318 - Getting issue details... STATUS )
    4. Leaky abstractions (of sort):
      1. developer now must be aware that they should not implement default constructor or any constructor for that matter.
      2. if they do they must be aware of any potential side-effects of creating multiple instances of a component, since component is naturally singleton in the process space. 
  4. Reliance on the ServiceLoader to discover components.
    1. Current extension model of NiFI assumes local availability of extensions (e.g., some local dir such as 'work') where they can be discovered. ServiceLoader helps with such discovery. But one of the side-effects of the ServiceLoader is a subsequent creation of a component's instance, regardless if the actual instance needs to be created. 

    2. As NiFi moves toward the concept of an ExtensionRegistry the assumptions of having components available locally is no longer valid, since implementation of ExtensionRegistry may not be based on locally available components. 

  5. Generation and re-generation of all documents during NiFi startup
    1. While current model works for a current state/size of NiFi, it is unsustainable for both NiFi distribution and NiFi documentation when one deals with thousands of components, versions etc. 
    2. In the world of multiple instances of NiFi each instance essentially has to generate its own documentation which is identical to that of another instance. 

Possible improvements

  1. To embrace convention over configuration approach which if addressed properly would greatly simplify user(dev) experience and would help to address the following:
    1. Confusion with using instance methods to access static fields. User would no longer have to implement those methods and they can be deprecated.
    2. Repetitive code in getRelationships() and getSupportedPropertyDescriptors().
      1. Analyzing many processors it becomes clear that the code in these methods is very repetitive as seen in example below. A very typical pattern is to assemble Collection in init() method or default constructor, leaving getRelationships() and getSupportedPropertyDescriptors() to simply return the instance variable representing such collections: 

        @Override
        protected void init(final ProcessorInitializationContext context) {
            final List<PropertyDescriptor> descriptors = new ArrayList<>();
            descriptors.add(FILE_SIZE);
            descriptors.add(BATCH_SIZE);
            descriptors.add(DATA_FORMAT);
            descriptors.add(UNIQUE_FLOWFILES);
            this.descriptors = Collections.unmodifiableList(descriptors);
            final Set<Relationship> relationships = new HashSet<>();
            relationships.add(SUCCESS);
            this.relationships = Collections.unmodifiableSet(relationships);
        }
        @Override
        protected List<PropertyDescriptor> getSupportedPropertyDescriptors() {
            return descriptors;
        }
        
        @Override
        public Set<Relationship> getRelationships() {
            return relationships;
        }
      2. Repetitive code could be easily handled by Reflection-based mechanisms already used by NiFi and many other projects and products. An example of one such way could be seen here (NIFI-1384 PR). Further more, for enhanced control over which element are meant to be documented, a new @Documentable annotation could be introduced. This would remove a burden from the user to implement these methods making component development much simpler.
      3. The aforementioned methods could be deprecated and eventually removed.
  2. While we can recommend best practices, developers could feel free to implement their component as they wish. For example, implement default constructor with framework assurance that if they do something "questionable" in it, it will have no impact on the overall system until such component is introduced in the process space, at which point it's an expression of intent to use.

  3. Have documentation generated as part of the build process. There are couple of benefits to that:
    1. NiFi instance would not have to generate/re-generate documentation every time it starts
    2. Documentation would essentially be re-used by multiple instances since it would exist in the NAR.
    3. Within the concept of ExtensionRegistry NiFi user will be able to access documentation before a component is introduced to the process space.
      1. such documentation would be available outside of NiFi instance and/or UI (e.g., conventional browser)
      2. there would be no pollution and stagnation of documentation artifacts that are not used in the flow. In other words access pattern to documentation would be "get-from-local-if-available-otherwise-call-remote".

 

  • No labels

10 Comments

  1. I'm a +1 for the idea of generating docs at build time. Though I think this was looked at a while back and became problematic because the maven plugin that would generate the docs must depend on nifi-api, and the nifi-api module is separated from the maven plugins repo. Not insurmountable but might get a bit tricky.

    I'm a -1 on the idea of using reflection to take the place of methods that I think belong as part of the API. Changing to using reflection here I feel results in an unclean mechanism that will be confusing for developers not used to the framework and even for those who are used to the framework. There are also times when we want to define Property Descriptors that may or may not be returned the list of Property Descriptors. This is especially true when creating an Abstract class and then having multiple concrete implementations of that class - some of the concrete implementations may choose to use one of the Property Descriptors while others may not. We may also have Property Descriptors that we don't want to declare as static. But my biggest concern here, though, is that looking at the Processor it is not clear how the properties that are defined by a processor make their way to the user. With a method like getSupportedPropertyDescriptors() it is very clear that I am returning this List of Property Descriptors. If reflection is used, it becomes a bit too magical.

    Re: #2 - I don't personally feel a new to provide assurance that if a developer does something "questionable" in the default constructor that it will have no impact on the overall system. Developers are provided lifecycle management facilities already. Developers should adhere to those. Since there's no way to avoid preventing "questionable" code to be run n the static initialization, I don't think we want to go out of our way to prevent it from happening in the constructor.

  2. to #1, I don't personally care too much about the implementation, but I would support some mechanism to remove to boiler plate code you mentioned, and standardize how property lists and relationship lists are constructed.  Some classes do it in init(), some do it in the constructor.

     

    to #2 no comment here.

     

    to #3 I am a +1 to generating docs at build time, as this should also make it MUCH easier to keep the documentation on our website in sync with each build(because the documentation artifacts are ready right after a build).  As Mark Payne said, there of the dependency issues that make this a bit problematic.  I took a stab at a proof-of-concept of this in (https://github.com/apache/nifi/commits/NIFI-725-master) and it didn't prove to be as difficult as I originally imagined.  Feel free to look at / use that code if you want to go down this approach.

    Going back through my emails and notes, there is a ticket discussing this specific part ( NIFI-725 - Getting issue details... STATUS ), again feel free to use that as a guide, or don't.  I haven't had time to work on this in months, but I recall having something mostly working when I left off.  So I would back you picking this up or starting over, either way I still see there being benefit it someone doing this.

     

     

  3. For the noted issues:

    Item #1: Could you more fully articulate the concern?  It was noted that some won't see a problem at all but that some might.  For those some that might what is the issue they will see and how does that affect their ability to understand or utilize the API or its ability to be efficient at runtime?  In the case of 'getRelationships' that is a method of the Processor interface and it is not a front for static values.  Relationships could change depending on the properties of the processor.  In the case of 'getSupportedPropertyDescriptors' that is from the AbstractConfigurableComponent and is designed to be strictly adherent to 'open/closed' principle.

    Item #2: Agreed

    Item #3/Security: It is hard to understand how this relates to NiFi in any form discussed thus far and such a model as suggested sounds like a pretty substantive departure from the current articulated goals for NiFi.  Should we need different components operating in different process spaces (ie run-as) then we're talking a rather different concept altogether.  I suggest we remove this unless we can more directly link the concern to NiFi as we know it or have proposed it.

    Item #3/Questionable code in...: The idea of questionable code is present in the entirety of the system.  The bottom line is we are delegating control to extension-code and when that happens everything is questionable.  Why is having it in the component constructor worse than anywhere else?

    Item #3/Throwaway instances: Does this create an actual issue or is it simply that we could do it with N-M instances and it would be better?

    Item #3/Leaky abstraction: The framework publishes/describes the supported method of component initialization.  Specifically we adhere to the ServiceLoader API.  We are clearly advising on component initialization and lifecycle.  What is being leaked?  The abstraction is a NiFi processor/component with well defined initialization and lifecycle management.

    Item #4/Reliance on ServiceLoader:  Another way of phrasing this is 'Leveraging a well defined mechanism of Java known as ServiceLoader'.  When written that way it sounds less like a liability and more like leveraging a standard feature.  I do not see how the default no args instance it creates results in any problem.  I also don't see how one conceived implementation for the registry that is challenged by the current approach warrants throwing the approach out.  Why not change the implementation idea for the registry to take advantage of this already existing model?

    Item #5/Generation on Startup:  That this doesn't scale doesn't require it being thrown out.  The documents could be generated during the build, at startup, on-demand, etc..  The current documentation model doesn't require it to occur at startup.  That was just the implementation chosen after some discussion on these very concepts.  The current approach scales to a very large number of components but one might choose to make documentation generation occur during the build phase instead of startup for instance.

     

    Improvement #1: Moving from getRelationships and getSupportedPropertyDescriptors to instead using an annotation driving method of finding relationships and descriptors addresses things which are truly static as suggested previously.  For relationships which get created at runtime how would those be articulated?  The same question would apply for dynamic properties but given that we already have facilities to support dynamic properties perhaps this one is easier to understand.  If then the answer for relationships would look like getSupportedDyanamicRelationships() and we just used annotation driven methods for finding the base properties and relationships I can see how that helps remove two methods that are common and the associated code.  However, the juice/squeeze ratio feels off for something that would effect every existing component someone might have built outside of the apache community.

    Improvement #2: There is a really slippery slope here.  Again so much of the stability of the overall systems depends on someone not writing 'questionable' code.  It isnt' clear to me how the proposal solves that in any substantive way as to warrant leaving the ServiceLoader API.

    Improvement #3: This can be accomplished without any of the proposed changes.  Simply moving the time that documentation is generated up in the build lifecycle solves this.  In fact, it seems like there are really two different proposals being made here.  The first is changing the API to remove boilerplate code specifically for getRelationships and getSupportedPropertyDescriptors.  The second is moving the time at which documentation is built.  I don't think they need to be considered to be related to one another.  The ExtensionRegistry is a very important concept and it isn't clear that our current documentation model harms that in any way.

     

  4. Thanks a lot to all who contributed with comments. Seems like there are lots of opinions, so I want to concentrate on the most contentious once at the moment as I am sure the rest will fall in once we get some consensus ;)
    So, let's talk about Object instances and ServiceLoaders.

    We need to look at it from the global perspective which includes ExtensionRegistry which based on implementation may be advertising components that are not available locally (e.g., Maven-based registry).
    ServiceLoader depends on availability of the components on the classpath prior to it being called. Asking users to download and install these components and then restart NiFi all to support the inner workings of the ServiceLoader seems a bit heavy and as described in ExtensionRegistry I was kind of hoping that user experience will not change when ExtensionRegistry is introduced (other then more info and conveniences) and framework will manage the download and caching of the components (a very straight forward task if we keep our options open). The main thing I want to make sure is that this conversation is not centered around how to save a ServiceLoader, but around features and best way to implement them and if "the best" way is still a ServiceLoader then so be it, but at the moment I am not yet convinced that it is and would like to keep an open head. 

    Even with documentation concerns aside, ServiceLoader will create an instance of every component available in the classpath; 100, 1000, 1000000 etc. even when you only new a few. And we can certainly debate the "so what" argument and the classpath scanning and calls to init() and throwaway instances, but as recently as last week we had a question from the user who could not start NiFi due to  some rogue/corrupted JAR from which NiFi was creating an instance of completely unrelated to his use case component and dying with NPE? I just think its wrong but most importantly confusing since you may be looking at the stack trace that makes no sense in the scope of what you are currently doing.

    From the cleanliness perspective, creating an object instance is an expression of intent to use it for what it was intended to be used. NiFi even has a convention of embedding the intent's name into the component's name (i.e., PutKafka, GetHttp). On top of that creation of an object instance implies creation of an object state - initial state or the state left from the previous run.
    These two distinctions are important for two reasons:

    1. A new component that may or may not be introduced into the system has no state (not even initial) yet its static documentation still needs to be available. Creating instance just for that seems like an overkill and we actually create 2-5 instances of each (thankfully that we can fix). But as mentioned earlier with Extension Registry such component and its dependencies may not even be available locally to create an instance.
    2. A component that is already present in the process space does have a state (e.g., configuration state) that may provide additional documentation relevant only to its particular instance. So two components of the same type may in fact have different instance-based documentation while its static parts are the same.

    Even if cleanliness and correctness is not accepted as enough of an argument, by not embracing this distinction we are simply locking ourselves out of a very valuable feature.

    Anyway, I know I've not responded to all points and as I mentioned that is intentional since I really want to burry these two cats (wink)

    1. Let's assume for continuation of the discussion of the ExtensionRegistry that documentation is already made available during build-time so we can eliminate that and the instance creation for it from this discussion.

      Let's consider the user experience we want to provide for the extension registry:

      • In a given NiFi cluster the user should be able to search for and reference processors/components just like they do today
      • Except they could now have the potential of electing to use a different version of a given component they use and could have multiple versions loaded at once and need to see the proper docs for each, etc..
      • When they choose to use a processor/component that isn't already available on their classpath we'd load it and away they go

      Let's consider a possible implementation of that:

      • We hotload the appropriate NAR for a given component and version
      • We hot-unload the appropriate NAR for a given component/version assuming no components in that nar are being referenced any longer.  Or we just mark it is a nar that will not reload on startup next time.  In fact, we never load nars on startup that don't have valid/current references to any components and we clear them from the cache (or only after being unused for a period of time, whatever)
      • When a user selects a component they don't have a local cache of we pull the appropriate nar and cache it locally

      This all seems pretty doable and I must be forgetting some critical detail because it even seems like it will be 'easy' to do.

      Are we in agreement up to this point?

      As for the cleanliness, correctness, and lock-out of valuable feature comment I do not understand what that is referencing.  Is that about ServiceLoader?  Could you share an example of a processor as-is and then after your proposed change?  And to be clear, I am not defending ServiceLoader.  If there is an option which achieves its goals as needed for us in NiFi which is easier and cleaner then certainly I'm good with that.

  5. Getting closer. . . much closer

    "Let's consider the user experience we want to provide for the extension registry:" - agreed on all 3 bullet points. The only thing I would like to add (and this and previous comments probably have to be linked to the ExtensionRegistry page) is that component's documentation must be available through NiFi UI and outside of NiFi UI. This statement alone sort of mandates that documentation must exist in a published NAR, hence generated at build time. And it appears that there is no contention on this point, we just need to decide on the mechanism. 

    "Let's consider a possible implementation of that:" - not sure about the first two points so will skip to the third one, since I totally agree with it. But going back to instance creation and it may speak directly to the first two bullets is that only components that are part of your flow has to be loaded. Components that are not in your flow yet in your cache have no business to be loaded. Think of a flow as a directive that tells NiFi what needs to be loaded. And I really want to stop talking  about "so what if its loaded" anymore, rather then - what is the purpose of loading something that won't be used? Yes, we started by saying that loading was required to produce documentation, but if documentation is going to be pre-produced (as we all seem to agree) that whole discussion is no longer valid. 

    "This all seems pretty doable" - yes, and relatively simple.

    "Are we in agreement up to this point?" - if my math is good I think we are at 80% (wink)

    "As for the cleanliness, correctness. . ." - No it has nothing to do with ServiceLoader. All I was trying is to draw the difference between component instance and component definition, and that we may consider a case where instance may have additional documentation that is only available through NiFi UI since it's he only system that maintains component instance. But I am willing to yield on this one if no one sees benefits in that. My second point was that we should at least acknowledge the  difference between component instance and component definition, so if we do decide at some point in the future to introduce instance specific documentation option we are not locked out of doing it.

  6. Ok so looks like 'generate documentation at build time' is the primary outcome.

    No API changes should be needed to accomplish that.

    If API changes (removing methods from the API or moving away from ServiceLoader) are warranted for other reasons let's discuss that back on the mailing list or in a separate feature/improvement.

    Thanks

     

    1. I absolutely agree - very well said!

  7. Without changing the API or replacing ServiceLoader, we can still deal with the throwaway instances in the meantime. I can see the need for at most two instances of each processor:

    1) The one created by the ServiceLoader at startup for the purposes of doc and other static methods

    2) The one created when the processor is dragged onto the canvas (actually, one for each instance on the canvas of course)

    If we are creating new instances for the purposes of calling a static method (or for any other reason that doesn't require a fully configured and unique instance of the processor), we could keep a "template map" of the original instances created by the ServiceLoader on startup. Any code not needing its own instance can get the "template" instance from the map.

  8. Matt, those are good points for the "meantime", but your comment made me think again about how wrong it sounds to create an instance for the purpose of executing static methods even though for our case its not exactly that. It's more of a pattern that is used all over NiFi where properties are defined as static variables, yet access to them is exposed via instance methods.

    Anyway, with the Extension Repositories (aka Extension Registry) for Dynamically-loaded Extensions, the ServiceLoader model will naturally go away anyway, since we will have a totally different mechanism to discover components and their documentation that are not currently in the classpath. Basically bringing hot-deploy option to NiFi where a component is discovered from the remote registry and brought into the running instance of NiFi. And with that even those components that have already been brought in would still need to be loaded on demand only since it would be pointless to maintain two different discovery mechanism (i.e., local vs remote)