Blog

Chris Mattmann and I had a discussion earlier over instant-messenger, about whether to write a customized ProductCrawler for CAS-Crawler or use other extension points, such as crawler actions or preconditionals.

This conversation delves into the advantages and appropriateness of each approach. Hopefully you might find this discussion insightful for your own needs!


9:48:26 PM rverma: Hey Chris

9:51:36 PM matt3467: yo!

9:51:50 PM rverma: hey 

9:51:56 PM rverma: so was playing around with crawler stuff

9:52:47 PM rverma: and turns out to use preconditions (preCondId) with StdProductCrawler - I'm going to have to extend the StdProductCrawler class anyhow  because it doesn't implement a required method (setPreCondId)

9:53:21 PM matt3467: yap

9:53:23 PM matt3467: don't use

9:53:25 PM matt3467: StdProductCrawler

9:53:31 PM matt3467: use MetExtractorProductCrawler

9:53:35 PM matt3467: and then use the MetReaderExtractor

9:53:40 PM matt3467: well it's just called MetReader

9:53:43 PM matt3467: that will achieve

9:53:43 PM rverma: yeah that supports it I saw

9:53:44 PM matt3467: the same capability

9:53:48 PM matt3467: as StdProductCrawler

9:53:50 PM matt3467: and give you

9:53:51 PM matt3467: the preCondIds

9:53:52 PM matt3467: tie in

9:53:54 PM matt3467: make sense?

9:54:42 PM rverma: yeah… 

9:59:23 PM rverma: so after playing around with this stuff today, I was looking into the various options, using preIngestId/actionId… and I concluded just writing a custom crawler class was the simplest way to implement as well as invoke what I need…

9:59:50 PM matt3467: i bet you did it simple

9:59:52 PM matt3467: i'm trying to figure out

9:59:54 PM matt3467: why actions are hard 

9:59:57 PM matt3467: or compartators

10:00:00 PM matt3467: and to improve our software

10:00:02 PM matt3467: would you say

10:00:05 PM matt3467: it's the bean definitions

10:00:08 PM matt3467: and stuff like that?

10:00:13 PM matt3467: or just too hard

10:00:15 PM matt3467: to inject

10:00:19 PM matt3467: into the crawler workflow

10:00:20 PM matt3467: what you want to do?

10:01:33 PM rverma: hmm … 

10:01:35 PM rverma: well ..

10:04:50 PM rverma: I mean simple as in how well it is aligned with the technology that it is using to accomplish the job and also in terms of how to invoke/use it when needed… meaning, what I'm really trying to do is make a custom crawling process that can be used in any crawler..

10:05:28 PM rverma: I don't think actionIds/preIngestIds are difficult per se, they are all just Java classes that I'd be extending, but the intention behind those extendable tools is a bit different than what I want to do?

10:05:48 PM matt3467: hmm

10:05:49 PM matt3467: is it?

10:05:57 PM matt3467: aren't you just trying to create 

10:06:00 PM matt3467: met files 

10:06:05 PM matt3467: or stop the creation of them

10:06:08 PM matt3467: on a per file basis?

10:06:20 PM matt3467: or to stop crawling

10:06:25 PM matt3467: or enact it based on their presence?

10:07:30 PM rverma: basically trying to generate met files, yes, before proceeding with ingest operations. 

10:08:03 PM rverma: but I feel like that's an implementation detail… having that functionality encased within a custom Crawler makes the user of that crawler not have to worry about how the process works

10:08:14 PM matt3467: well

10:08:17 PM matt3467: that's where i assert

10:08:18 PM matt3467: that functionality

10:08:20 PM matt3467: is already present

10:08:22 PM matt3467: that's the *precise*

10:08:25 PM matt3467: responsibility

10:08:28 PM matt3467: of precondition comparators

10:08:31 PM matt3467: they block the creation

10:08:32 PM matt3467: of met files

10:08:36 PM matt3467: (i.e. block or allow)

10:08:39 PM matt3467: met extraction

10:08:40 PM matt3467: which in turn

10:08:42 PM matt3467: block or allow ingestion

10:08:46 PM matt3467: that's the use case

10:08:47 PM matt3467: they are for?

10:09:09 PM rverma: isn't that on a per file basis? 

10:09:21 PM matt3467: yeah isn't that

10:09:23 PM matt3467: what you are trying to do?

10:09:25 PM matt3467: well 

10:09:28 PM matt3467: let me retract that

10:09:33 PM matt3467: yeah isn't exaxctly the case

10:09:37 PM matt3467: i could write a precondition comparator

10:09:40 PM matt3467: that writes a signal file

10:09:42 PM matt3467: to block the next N

10:09:43 PM matt3467: calls to it

10:09:46 PM matt3467: you know what i mean?

10:09:48 PM matt3467: i can signal things

10:09:54 PM matt3467: too for downstream calls to it

10:10:01 PM matt3467: on subsequent files that the crawler encourters

10:10:03 PM matt3467: it's just a Java class

10:10:05 PM matt3467: which was my point before

10:10:10 PM matt3467: anything you can do by extending

10:10:10 PM matt3467: the crawler

10:10:15 PM matt3467: you can do in a precondition comparator

10:10:18 PM matt3467: and i think it's more in flow

10:10:21 PM matt3467: with their intention

10:10:22 PM matt3467: in the architecture

10:10:26 PM matt3467: btw

10:10:29 PM matt3467: if this is getting too meta

10:10:31 PM matt3467: feel free to jump in

10:10:33 PM matt3467: with some code 

10:10:43 PM rverma: hahah

10:15:44 PM rverma: hm.. well.. I guess I was looking at it like this: actionIds/preCondIds are meant for per-product related actions (since only singular products are passed into their respective methods - for checks to be performed)…. what I want is (1) a meta check to see if the dataset is any good and (2) an aggrageted action to generate met for all products. When looking at the code for ProductCrawler (specifically "crawl") I can see the flow goes from root-level directory checking to individual product checking. I can place my special code right before the individual product checking and that's all that is needed.

10:16:08 PM matt3467: ah

10:16:10 PM matt3467: gotcha

10:16:15 PM matt3467: so the assumption that 

10:16:21 PM matt3467: actionIds/preCondIds only operate on a file

10:16:23 PM matt3467: is wrong

10:16:25 PM matt3467: take the case where

10:16:28 PM matt3467: a Product is a directory

10:16:29 PM matt3467: of files

10:16:39 PM matt3467: e.g., Product.STRUCTURE_HEIRARCHICAL

10:16:46 PM matt3467: remember File.isDirectory() in that case

10:16:48 PM matt3467: would be true

10:16:50 PM matt3467: from a Java perspective

10:16:51 PM matt3467: right

10:16:58 PM matt3467: so even though PreConditionComparators

10:17:00 PM matt3467: and Actions

10:17:04 PM matt3467: are still in the File-space

10:17:08 PM matt3467: that is by design 

10:17:10 PM matt3467: b/c a File in java

10:17:13 PM matt3467: != a single file

10:17:17 PM matt3467: it could be a dir too right?

10:17:28 PM matt3467: so we had a case on OCO where we used hierarchical directory products

10:17:33 PM matt3467: and all of our actions and met extractors

10:17:35 PM matt3467: were based on a directory

10:17:38 PM matt3467: all you do

10:17:40 PM matt3467: in that scenario

10:17:43 PM matt3467: is turn --noRecur

10:17:44 PM matt3467: on 

10:17:50 PM matt3467: from the crawler perspective

10:17:52 PM matt3467: and then you turn on 

10:17:58 PM matt3467: the flag that tells it to look for directories

10:18:00 PM matt3467: umm what's it called

10:18:27 PM matt3467: --crawlForDirs

10:18:28 PM matt3467: that's it

10:18:33 PM matt3467: so...

10:18:42 PM matt3467: a crawler with --noRecur and --crawlForDirs

10:18:43 PM matt3467: will require

10:18:51 PM matt3467: 1. met extractors that operate on an entire directory

10:18:59 PM matt3467: (works fine because java.io.File can represent a dir)

10:19:03 PM matt3467: 2. precondition comprators

10:19:04 PM matt3467: that do the same

10:19:06 PM matt3467: and 3.

10:19:07 PM matt3467: actions that do the same

10:19:09 PM matt3467: does that make sense?

10:19:43 PM rverma: hmmm… yeah File could be a directory… see your point there. 

10:20:01 PM rverma: so here's the thing.. ProductCrawler

10:20:02 PM rverma: is passed

10:20:05 PM rverma: rootDir

10:20:16 PM rverma: that's easy.. that's the only thing I need

10:20:21 PM rverma: can take things from there

10:21:07 PM rverma: how do you pass rootDir.. or something to that equivalent via preIngestId method?

10:21:53 PM rverma: or .. not to mean "pass".. but how do you get a piece of code to operate (1) only once and (2) only for a single "file" ?

10:22:08 PM matt3467: does that single file

10:22:09 PM rverma: (via the preIngestId method)? 

10:22:11 PM matt3467: have anything special about it?

10:22:17 PM matt3467: that you can search for

10:22:19 PM matt3467: in your action

10:22:20 PM rverma: yes

10:22:22 PM matt3467: or precondComparator

10:22:22 PM matt3467: ok

10:22:26 PM matt3467: so you just wire that in

10:22:30 PM matt3467: to your action

10:22:31 PM matt3467: or comparator

10:22:34 PM matt3467: make the action return false

10:22:37 PM matt3467: if the work has already been done

10:22:41 PM matt3467: or the preconditioncomparator

10:22:43 PM matt3467: to do the same thing

10:22:44 PM matt3467: (return false)

10:22:46 PM matt3467: if work has been done

10:22:48 PM matt3467: true otherwise

10:22:50 PM matt3467: (and do the work)

10:23:54 PM rverma: so, the comparator will perform this check for every "file" in the ingestDir ..?

10:25:19 PM matt3467: yeah think of comparator

10:25:22 PM matt3467: as something that the crawler calls

10:25:24 PM matt3467: first to determine

10:25:25 PM matt3467: whether or not

10:25:27 PM matt3467: metadata is extractor

10:25:30 PM matt3467: err extracted

10:25:33 PM matt3467: it calls the set of identified

10:25:36 PM matt3467: pre cond comparator benas

10:25:39 PM matt3467: err beans

10:25:40 PM matt3467: in order

10:25:43 PM matt3467: so there can be > 1

10:25:46 PM matt3467: per file encountered

10:25:49 PM matt3467: if they all pass

10:25:55 PM matt3467: then met file is generated (by the extractor)

10:25:59 PM matt3467: at that point you enter preIngest phase

10:26:03 PM matt3467: where pre ingest actions are run

10:26:12 PM matt3467: and then postIngestSuccess

10:26:17 PM matt3467: OR postIngestFailure

10:26:18 PM matt3467: see here

10:26:26 PM matt3467: http://oodt.apache.org/components/maven/crawler/user/

10:27:34 PM rverma: ok

10:28:12 PM rverma: so lemme ask you this, what would be an appropriate scenario to write a custom ProductCrawler?

10:28:35 PM matt3467: if you can make the argument

10:28:41 PM matt3467: that the workflow provided by StdProductCrawler

10:28:45 PM matt3467: MetExtractorProductCrawler

10:28:51 PM matt3467: and/or AutoDetectProductCrawler

10:28:52 PM matt3467: won't cut it

10:28:54 PM matt3467: but IMHO 

10:29:00 PM matt3467: those have dealt with 99.999%

10:29:03 PM matt3467: of the scenarios i've seen

10:29:06 PM matt3467: within the last 7 years

10:29:07 PM matt3467: 

10:29:12 PM matt3467: since i designed them

10:29:15 PM matt3467: the innovation is in the actions and crawlers

10:29:20 PM matt3467: i've seen a few folks go down

10:29:22 PM matt3467: the road of custom crawler

10:29:26 PM matt3467: only to be like, crap

10:29:30 PM matt3467: why did i do that

10:29:35 PM matt3467: now, one or two folks

10:29:39 PM matt3467: have also just made their custom crawler

10:29:39 PM matt3467: and moved on

...