Nutch Command Line Options of bin/nutch
The following is a complete list of Nutch command line options. That is to say that some or all of the options may not be available in the particular version of Nutch you are using. For version specific options please see the relevant check box, once you know that such a command exists for your particular Nutch distribution, you can navigate to the relevant wiki entry for a detailed descritpion of the tool.
The script bin/nutch is a helper which picks different java classes to "run".
The crawl script NUTCH-1087] [https://issues.apache.org/jira/browse/NUTCH-1087 replaces the bin/nutch crawl command used up to versions 1.7 and 2.2.1.
Note: Most commands print help when invoked w/o parameters.
See each entry for details of the command arguments and options.
command | function | version | |
1.x | 2.x | ||
Read / dump crawl db | X | X | |
Merge crawldb-s, with optional filtering | X | ||
Read / dump link db | X | ||
Inject new urls into the database | X | X | |
Generate new segments to fetch from crawldb | X | X | |
Generate new segments to fetch from text files | X | ||
Fetch a segment's pages | X | X | |
Parse a segment's pages | X | X | |
Read / dump segment data | X | ||
Merges multiple segments, with optional filtering and slicing | X | ||
Update crawldb (from segments if in 1.x) after fetching | X | X | |
Update hostdb after fetching | X | ||
Create a linkdb from parsed segments | X | ||
Merge's linkdb-s, with optional filtering | X | ||
Run the elastic search indexer on parsed batches | X | ||
Run the solr indexer on parsed segments and linkdb - DEPRECATED use the index command instead | X | X | |
Removes duplicate documents from solr - DEPRECATED use the dedup command instead | X | X | |
Removes HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead | X | ||
Run the plugin-based indexer on parsed segments and linkdb | X | ||
Deduplicate entries in the crawldb and give them a special status | X | ||
Remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins | X | ||
Checks the parser for a given url | X | X | |
Checks the indexing filters for a given url | X | ||
bin/nutch normalizerchecker | Checks URL normalizers for given URLs | X | |
Calculates domain statistics from crawldb | X | ||
Generates a web graph from existing segments | X | ||
Runs a link analysis program on the generated web graph | X | ||
Updates the crawldb with linkrank scores | X | ||
Dumps the web graph's node scores | X | ||
Loads a plugin and run one of its classes main() | X | X | |
run a (local) Nutch server on a user defined port | X | X | |
run a (local) Nutch WebApp GUI on port 8080 | X | ||
Runs the given JUnit test | X | X | |
Dump out Nutch segments into Common Crawl data format | X | ||
run the class named CLASSNAME | X | X |
Webgraph classes
- bin/nutch org.apache.nutch.scoring.webgraph.WebGraph
- bin/nutch org.apache.nutch.scoring.webgraph.Loops
- bin/nutch org.apache.nutch.scoring.webgraph.LinkRank
- bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater
- bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper
- bin/nutch org.apache.nutch.scoring.webgraph.NodeReader
- bin/nutch org.apache.nutch.scoring.webgraph.LoopReader
- bin/nutch org.apache.nutch.scoring.webgraph.LinkDumper
Useful Plugin Classes
- bin/nutch plugin urlnormalizer-regex org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
Other Classes
- bin/nutch org.apache.nutch.net.URLFilterChecker
- bin/nutch org.apache.nutch.net.URLNormalizerChecker
- bin/nutch org.apache.nutch.tools.CrawlDBScanner
- bin/nutch org.apache.nutch.protocol.RobotRulesParser
back to FrontPage