Chris Mattmann and I had a discussion earlier over instant-messenger, about whether to write a customized ProductCrawler for CAS-Crawler or use other extension points, such as crawler actions or preconditionals.
This conversation delves into the advantages and appropriateness of each approach. Hopefully you might find this discussion insightful for your own needs!
9:48:26 PM rverma: Hey Chris
9:51:36 PM matt3467: yo!
9:51:50 PM rverma: hey
9:51:56 PM rverma: so was playing around with crawler stuff
9:52:47 PM rverma: and turns out to use preconditions (preCondId) with StdProductCrawler - I'm going to have to extend the StdProductCrawler class anyhow because it doesn't implement a required method (setPreCondId)
9:53:21 PM matt3467: yap
9:53:23 PM matt3467: don't use
9:53:25 PM matt3467: StdProductCrawler
9:53:31 PM matt3467: use MetExtractorProductCrawler
9:53:35 PM matt3467: and then use the MetReaderExtractor
9:53:40 PM matt3467: well it's just called MetReader
9:53:43 PM matt3467: that will achieve
9:53:43 PM rverma: yeah that supports it I saw
9:53:44 PM matt3467: the same capability
9:53:48 PM matt3467: as StdProductCrawler
9:53:50 PM matt3467: and give you
9:53:51 PM matt3467: the preCondIds
9:53:52 PM matt3467: tie in
9:53:54 PM matt3467: make sense?
9:54:42 PM rverma: yeah…
9:59:23 PM rverma: so after playing around with this stuff today, I was looking into the various options, using preIngestId/actionId… and I concluded just writing a custom crawler class was the simplest way to implement as well as invoke what I need…
9:59:50 PM matt3467: i bet you did it simple
9:59:52 PM matt3467: i'm trying to figure out
9:59:54 PM matt3467: why actions are hard
9:59:57 PM matt3467: or compartators
10:00:00 PM matt3467: and to improve our software
10:00:02 PM matt3467: would you say
10:00:05 PM matt3467: it's the bean definitions
10:00:08 PM matt3467: and stuff like that?
10:00:13 PM matt3467: or just too hard
10:00:15 PM matt3467: to inject
10:00:19 PM matt3467: into the crawler workflow
10:00:20 PM matt3467: what you want to do?
10:01:33 PM rverma: hmm …
10:01:35 PM rverma: well ..
10:04:50 PM rverma: I mean simple as in how well it is aligned with the technology that it is using to accomplish the job and also in terms of how to invoke/use it when needed… meaning, what I'm really trying to do is make a custom crawling process that can be used in any crawler..
10:05:28 PM rverma: I don't think actionIds/preIngestIds are difficult per se, they are all just Java classes that I'd be extending, but the intention behind those extendable tools is a bit different than what I want to do?
10:05:48 PM matt3467: hmm
10:05:49 PM matt3467: is it?
10:05:57 PM matt3467: aren't you just trying to create
10:06:00 PM matt3467: met files
10:06:05 PM matt3467: or stop the creation of them
10:06:08 PM matt3467: on a per file basis?
10:06:20 PM matt3467: or to stop crawling
10:06:25 PM matt3467: or enact it based on their presence?
10:07:30 PM rverma: basically trying to generate met files, yes, before proceeding with ingest operations.
10:08:03 PM rverma: but I feel like that's an implementation detail… having that functionality encased within a custom Crawler makes the user of that crawler not have to worry about how the process works
10:08:14 PM matt3467: well
10:08:17 PM matt3467: that's where i assert
10:08:18 PM matt3467: that functionality
10:08:20 PM matt3467: is already present
10:08:22 PM matt3467: that's the *precise*
10:08:25 PM matt3467: responsibility
10:08:28 PM matt3467: of precondition comparators
10:08:31 PM matt3467: they block the creation
10:08:32 PM matt3467: of met files
10:08:36 PM matt3467: (i.e. block or allow)
10:08:39 PM matt3467: met extraction
10:08:40 PM matt3467: which in turn
10:08:42 PM matt3467: block or allow ingestion
10:08:46 PM matt3467: that's the use case
10:08:47 PM matt3467: they are for?
10:09:09 PM rverma: isn't that on a per file basis?
10:09:21 PM matt3467: yeah isn't that
10:09:23 PM matt3467: what you are trying to do?
10:09:25 PM matt3467: well
10:09:28 PM matt3467: let me retract that
10:09:33 PM matt3467: yeah isn't exaxctly the case
10:09:37 PM matt3467: i could write a precondition comparator
10:09:40 PM matt3467: that writes a signal file
10:09:42 PM matt3467: to block the next N
10:09:43 PM matt3467: calls to it
10:09:46 PM matt3467: you know what i mean?
10:09:48 PM matt3467: i can signal things
10:09:54 PM matt3467: too for downstream calls to it
10:10:01 PM matt3467: on subsequent files that the crawler encourters
10:10:03 PM matt3467: it's just a Java class
10:10:05 PM matt3467: which was my point before
10:10:10 PM matt3467: anything you can do by extending
10:10:10 PM matt3467: the crawler
10:10:15 PM matt3467: you can do in a precondition comparator
10:10:18 PM matt3467: and i think it's more in flow
10:10:21 PM matt3467: with their intention
10:10:22 PM matt3467: in the architecture
10:10:26 PM matt3467: btw
10:10:29 PM matt3467: if this is getting too meta
10:10:31 PM matt3467: feel free to jump in
10:10:33 PM matt3467: with some code
10:10:43 PM rverma: hahah
10:15:44 PM rverma: hm.. well.. I guess I was looking at it like this: actionIds/preCondIds are meant for per-product related actions (since only singular products are passed into their respective methods - for checks to be performed)…. what I want is (1) a meta check to see if the dataset is any good and (2) an aggrageted action to generate met for all products. When looking at the code for ProductCrawler (specifically "crawl") I can see the flow goes from root-level directory checking to individual product checking. I can place my special code right before the individual product checking and that's all that is needed.
10:16:08 PM matt3467: ah
10:16:10 PM matt3467: gotcha
10:16:15 PM matt3467: so the assumption that
10:16:21 PM matt3467: actionIds/preCondIds only operate on a file
10:16:23 PM matt3467: is wrong
10:16:25 PM matt3467: take the case where
10:16:28 PM matt3467: a Product is a directory
10:16:29 PM matt3467: of files
10:16:39 PM matt3467: e.g., Product.STRUCTURE_HEIRARCHICAL
10:16:46 PM matt3467: remember File.isDirectory() in that case
10:16:48 PM matt3467: would be true
10:16:50 PM matt3467: from a Java perspective
10:16:51 PM matt3467: right
10:16:58 PM matt3467: so even though PreConditionComparators
10:17:00 PM matt3467: and Actions
10:17:04 PM matt3467: are still in the File-space
10:17:08 PM matt3467: that is by design
10:17:10 PM matt3467: b/c a File in java
10:17:13 PM matt3467: != a single file
10:17:17 PM matt3467: it could be a dir too right?
10:17:28 PM matt3467: so we had a case on OCO where we used hierarchical directory products
10:17:33 PM matt3467: and all of our actions and met extractors
10:17:35 PM matt3467: were based on a directory
10:17:38 PM matt3467: all you do
10:17:40 PM matt3467: in that scenario
10:17:43 PM matt3467: is turn --noRecur
10:17:44 PM matt3467: on
10:17:50 PM matt3467: from the crawler perspective
10:17:52 PM matt3467: and then you turn on
10:17:58 PM matt3467: the flag that tells it to look for directories
10:18:00 PM matt3467: umm what's it called
10:18:27 PM matt3467: --crawlForDirs
10:18:28 PM matt3467: that's it
10:18:33 PM matt3467: so...
10:18:42 PM matt3467: a crawler with --noRecur and --crawlForDirs
10:18:43 PM matt3467: will require
10:18:51 PM matt3467: 1. met extractors that operate on an entire directory
10:18:59 PM matt3467: (works fine because java.io.File can represent a dir)
10:19:03 PM matt3467: 2. precondition comprators
10:19:04 PM matt3467: that do the same
10:19:06 PM matt3467: and 3.
10:19:07 PM matt3467: actions that do the same
10:19:09 PM matt3467: does that make sense?
10:19:43 PM rverma: hmmm… yeah File could be a directory… see your point there.
10:20:01 PM rverma: so here's the thing.. ProductCrawler
10:20:02 PM rverma: is passed
10:20:05 PM rverma: rootDir
10:20:16 PM rverma: that's easy.. that's the only thing I need
10:20:21 PM rverma: can take things from there
10:21:07 PM rverma: how do you pass rootDir.. or something to that equivalent via preIngestId method?
10:21:53 PM rverma: or .. not to mean "pass".. but how do you get a piece of code to operate (1) only once and (2) only for a single "file" ?
10:22:08 PM matt3467: does that single file
10:22:09 PM rverma: (via the preIngestId method)?
10:22:11 PM matt3467: have anything special about it?
10:22:17 PM matt3467: that you can search for
10:22:19 PM matt3467: in your action
10:22:20 PM rverma: yes
10:22:22 PM matt3467: or precondComparator
10:22:22 PM matt3467: ok
10:22:26 PM matt3467: so you just wire that in
10:22:30 PM matt3467: to your action
10:22:31 PM matt3467: or comparator
10:22:34 PM matt3467: make the action return false
10:22:37 PM matt3467: if the work has already been done
10:22:41 PM matt3467: or the preconditioncomparator
10:22:43 PM matt3467: to do the same thing
10:22:44 PM matt3467: (return false)
10:22:46 PM matt3467: if work has been done
10:22:48 PM matt3467: true otherwise
10:22:50 PM matt3467: (and do the work)
10:23:54 PM rverma: so, the comparator will perform this check for every "file" in the ingestDir ..?
10:25:19 PM matt3467: yeah think of comparator
10:25:22 PM matt3467: as something that the crawler calls
10:25:24 PM matt3467: first to determine
10:25:25 PM matt3467: whether or not
10:25:27 PM matt3467: metadata is extractor
10:25:30 PM matt3467: err extracted
10:25:33 PM matt3467: it calls the set of identified
10:25:36 PM matt3467: pre cond comparator benas
10:25:39 PM matt3467: err beans
10:25:40 PM matt3467: in order
10:25:43 PM matt3467: so there can be > 1
10:25:46 PM matt3467: per file encountered
10:25:49 PM matt3467: if they all pass
10:25:55 PM matt3467: then met file is generated (by the extractor)
10:25:59 PM matt3467: at that point you enter preIngest phase
10:26:03 PM matt3467: where pre ingest actions are run
10:26:12 PM matt3467: and then postIngestSuccess
10:26:17 PM matt3467: OR postIngestFailure
10:26:18 PM matt3467: see here
10:26:26 PM matt3467: http://oodt.apache.org/components/maven/crawler/user/
10:27:34 PM rverma: ok
10:28:12 PM rverma: so lemme ask you this, what would be an appropriate scenario to write a custom ProductCrawler?
10:28:35 PM matt3467: if you can make the argument
10:28:41 PM matt3467: that the workflow provided by StdProductCrawler
10:28:45 PM matt3467: MetExtractorProductCrawler
10:28:51 PM matt3467: and/or AutoDetectProductCrawler
10:28:52 PM matt3467: won't cut it
10:28:54 PM matt3467: but IMHO
10:29:00 PM matt3467: those have dealt with 99.999%
10:29:03 PM matt3467: of the scenarios i've seen
10:29:06 PM matt3467: within the last 7 years
10:29:07 PM matt3467:
10:29:12 PM matt3467: since i designed them
10:29:15 PM matt3467: the innovation is in the actions and crawlers
10:29:20 PM matt3467: i've seen a few folks go down
10:29:22 PM matt3467: the road of custom crawler
10:29:26 PM matt3467: only to be like, crap
10:29:30 PM matt3467: why did i do that
10:29:35 PM matt3467: now, one or two folks
10:29:39 PM matt3467: have also just made their custom crawler
10:29:39 PM matt3467: and moved on
...