DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
...
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additonally, pluggable indexing exists for Apache Solr, Elastic Search, SolrCloud, etc. We can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. This tutorial explains how to use Nutch with Apache Solr. Solr is an open source full text search framework, with Solr we can search pages acquired by Nutch. Apache Nutch supports Solr out-the-box, simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a binary release from from here.
Learning Outcomes
By the end of this tutorial you will
...
- run "
bin/nutch" - You can confirm a correct installation if you see something similar to the following:
| No Format |
|---|
nutch 1.21 Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]... where commandCOMMAND is one of: readdb read / dump crawl db mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db (Crawl commands) inject inject new urls into the database generate generate new segments to fetch from crawl db freegen fetch generate new segments to fetch froma textsegment's filespages fetch parse fetchparse a segment's pages updatedb update crawl db from segments after fetching (CrawlDb commands) ... |
Some troubleshooting tips:
...
| No Format |
|---|
<property> <name>http.agent.name</name> <value>My -Nutch -Spider</value> </property> |
...
- A URL seed list includes a list of websites, one-per-line, which nutch Nutch will look to crawl
- The file
conf/regex-urlfilter.txtwill provide Regular Expressions that allow nutch Nutch to filter and narrow the types of web resources to crawl and download
...
mkdir -p urlscd urlstouch seed.txtto create a text fileseed.txtunderurls/with the following content (one URL per line for each site you want Nutch to crawl).
| No Format |
|---|
httphttps://nutch.apache.org/ |
(Optional) Configure Regular Expression Filters
...
Duplicates (identical content but different URL) are optionally marked in the CrawlDb and are deleted later in the Solr index.
...
- Map: Identity map where keys are digests and values are CrawlDatum records
- Reduce: CrawlDatums with the same digest are marked (except one of them) as duplicates. There are multiple heuristics available to choose the item which is not marked as duplicate - the one with the shortest URL, fetched most recently, or with the highest score.
...