...
e.g. http://www.xyz.org/ nutch.score=10 nutch.fetchInterval=2592000 userType=open_source
Nutch 1.x
Usage:
bin/nutch inject [-D...] <crawldb> <url_dir> [-overwrite|-update] [-noFilter] [-noNormalize] [-filterNormalizeAll] <crawldb> Path to a crawldb directory. If not present, a new one would be created. <url_dir> Path to URL file or directory with URL file(s) containing URLs to be injected. A URL file should have one URL per line, optionally followed by custom metadata. Blank lines or lines starting with a '#' would be ignored. Custom metadata must be of form 'key=value' and separated by tabs. Below are reserved metadata keys: nutch.score: A custom score for a url nutch.fetchInterval: A custom fetch interval for a url nutch.fetchInterval.fixed: A custom fetch interval for a url that is not changed by AdaptiveFetchSchedule Example: http://www.apache.org/ http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source -overwrite Overwite existing crawldb records by the injected records. Has precedence over 'update' -update Update existing crawldb records with the injected records. Old metadata is preserved -nonormalize Do not normalize URLs before injecting -nofilter Do not apply URL filters to injected URLs -filterNormalizeAll Normalize and filter all URLs including the URLs of existing CrawlDb records -D... set or overwrite configuration property (property=value) -Ddb.update.purge.404=true remove URLs with status gone (404) from CrawlDb
<crawldb>: The directory containing the crawldb
...