This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Nutch Command Line Options of bin/nutch

...

See each entry for details of the command arguments and options.

command

function

version

 

 



1.x

2.x

bin/nutch readdb

Read / dump crawl db

X

X

bin/nutch mergedb

Merge crawldb-s, with optional filtering

X

 


bin/nutch readlinkdb

Read / dump link db

X

 


bin/nutch inject

Inject new urls into the database

X

X

bin/nutch

hostinject

Inject new urls into the hostdatabase

 

X

bin/nutch

generate

Generate new segments to fetch from crawldb

X

X

bin/nutch freegen

Generate new segments to fetch from text files

X

 


bin/nutch fetch

Fetch a segment's pages

X

X

bin/nutch parse

Parse a segment's pages

X

X

bin/nutch readseg

Read / dump segment data

X

 


bin/nutch mergesegs

Merges multiple segments, with optional filtering and slicing

X

 


bin/nutch updatedb

Update crawldb (from segments if in 1.x) after fetching

X

X

bin/nutch updatehostdb

Update hostdb after fetching

 


X

bin/nutch invertlinks

Create a linkdb from parsed segments

X

 


bin/nutch mergelinkdb

Merge's linkdb-s, with optional filtering

X

 


bin/nutch elasticindex

Run the elastic search indexer on parsed batches

 


X

bin/nutch solrindex

Run the solr indexer on parsed segments and linkdb - DEPRECATED use the index command instead

X

X

bin/nutch solrdedup

Removes duplicate documents from solr - DEPRECATED use the dedup command instead

X

X

bin/nutch solrclean

Removes HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead

X

 


bin/nutch index

Run the plugin-based indexer on parsed segments and linkdb

X

 


bin/nutch dedup

Deduplicate entries in the crawldb and give them a special status

X

 


bin/nutch clean

Remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins

X

 


bin/nutch parsechecker

Checks the parser for a given url

X

X

bin/nutch indexchecker

Checks the indexing filters for a given url

X

 


bin/nutch normalizercheckerChecks URL normalizers for given URLsX

bin/nutch domainstats

Calculates domain statistics from crawldb

X

 


bin/nutch webgraph

Generates a web graph from existing segments

X

 


bin/nutch linkrank

Runs a link analysis program on the generated web graph

X

 


bin/nutch scoreupdater

Updates the crawldb with linkrank scores

X

 


bin/nutch nodedumper

Dumps the web graph's node scores

X

 


bin/nutch plugin

Loads a plugin and run one of its classes main()

X

X

bin/nutch nutchserver

run a (local) Nutch server on a user defined port

 


X

bin/nutch webapp

run a (local) Nutch WebApp GUI on port 8080

 


X

bin/nutch junit

Runs the given JUnit test

X

X

bin/nutch commoncrawldump

Dump out Nutch segments into Common Crawl data format

X

 


bin/nutch CLASSNAME

run the class named CLASSNAME

X

X

Webgraph classes

  • bin/nutch org.apache.nutch.scoring.webgraph.WebGraph
  • bin/nutch org.apache.nutch.scoring.webgraph.Loops
  • bin/nutch org.apache.nutch.scoring.webgraph.LinkRank
  • bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater
  • bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper
  • bin/nutch org.apache.nutch.scoring.webgraph.NodeReader
  • bin/nutch org.apache.nutch.scoring.webgraph.LoopReader
  • bin/nutch org.apache.nutch.scoring.webgraph.LinkDumper

Useful Plugin Classes

  • bin/nutch plugin urlnormalizer-regex org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer

Other Classes

back to FrontPage