Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update description; bin/nutch command-line help as of 1.21

...

Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additonally, pluggable indexing exists for Apache Solr, Elastic Search, SolrCloud, etc. We can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. This tutorial explains how to use Nutch with Apache Solr. Solr is an open source full text search framework, with Solr we can search pages acquired by Nutch. Apache Nutch supports Solr out-the-box, simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a binary release from from here.

Learning Outcomes

By the end of this tutorial you will

...

  • run "bin/nutch" - You can confirm a correct installation if you see something similar to the following:
No Format
nutch 1.21
Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]...
where commandCOMMAND is one of:
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
 (Crawl commands)
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
freegen  fetch         generate new segments to fetch froma textsegment's filespages
fetch  parse             fetchparse a segment's pages
  updatedb          update crawl db from segments after fetching

 (CrawlDb commands)
...

Some troubleshooting tips:

...

No Format
<property>
 <name>http.agent.name</name>
 <value>My -Nutch -Spider</value>
</property>

...

  • A URL seed list includes a list of websites, one-per-line, which nutch Nutch will look to crawl
  • The file conf/regex-urlfilter.txt will provide Regular Expressions that allow nutch Nutch to filter and narrow the types of web resources to crawl and download

...

  • mkdir -p urls
  • cd urls
  • touch seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutch to crawl).
No Format
httphttps://nutch.apache.org/

(Optional) Configure Regular Expression Filters

...

Duplicates (identical content but different URL) are optionally marked in the CrawlDb and are deleted later in the Solr index.

...

  • Map: Identity map where keys are digests and values are CrawlDatum records
  • Reduce: CrawlDatums with the same digest are marked (except one of them) as duplicates. There are multiple heuristics available to choose the item which is not marked as duplicate - the one with the shortest URL, fetched most recently, or with the highest score.

...