DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
...
- run "
bin/nutch" - You can confirm a correct installation if you see something similar to the following:
| No Format |
|---|
Usage: nutch COMMAND where command is one of:
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
...
|
...
- Run the following command if you are seeing "Permission denied":
| No Format |
|---|
chmod +x bin/nutch
|
- Setup
JAVA_HOMEif you are seeingJAVA_HOMEnot set. On Mac, you can run the following command or add it to~/.bashrc:
| No Format |
|---|
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/11/Home
# note that the actual path may be different on your system
|
On Debian or Ubuntu, you can run the following command or add it to ~/.bashrc:
| No Format |
|---|
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
|
You may also have to update your /etc/hosts file. If so you can add the following
| No Format |
|---|
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost.localdomain localhost LMC-032857
::1 ip6-localhost ip6-loopback
fe80::1%lo0 ip6-localhost ip6-loopback
|
...
- Default crawl properties can be viewed and edited within {{conf/nutch-default.xml }}- where most of these can be used without modification
- The file
conf/nutch-site.xmlserves as a place to add your own custom crawl properties that overwriteconf/nutch-default.xml. The only required modification for this file is to override thevaluefield of the {{http.agent.name }}- i.e. Add your agent name in the
valuefield of thehttp.agent.nameproperty inconf/nutch-site.xml, for example:
- i.e. Add your agent name in the
| No Format |
|---|
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
|
...
mkdir -p urlscd urlstouch seed.txtto create a text fileseed.txtunderurls/with the following content (one URL per line for each site you want Nutch to crawl).
| No Format |
|---|
http://nutch.apache.org/
|
(Optional) Configure Regular Expression Filters
Edit the file conf/regex-urlfilter.txt and replace
| No Format |
|---|
# accept anything else
+.
|
with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:
| No Format |
|---|
+^https?://([a-z0-9-]+\.)*nutch\.apache\.org/
|
...
This option shadows the creation of the seed list as covered here.
| No Format |
|---|
bin/nutch inject crawl/crawldb urls |
...
To fetch, we first generate a fetch list from the database:
| No Format |
|---|
bin/nutch generate crawl/crawldb crawl/segments
|
This generates a fetch list for all of the pages due to be fetched. The fetch list is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable s1:
| No Format |
|---|
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
|
Now we run the fetcher on this segment with:
| No Format |
|---|
bin/nutch fetch $s1
|
Then we parse the entries:
| No Format |
|---|
bin/nutch parse $s1
|
When this is complete, we update the database with the results of the fetch:
| No Format |
|---|
bin/nutch updatedb crawl/crawldb $s1
|
...
Now we generate and fetch a new segment containing the top-scoring 1,000 pages:
| No Format |
|---|
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2
|
Let's fetch one more round:
| No Format |
|---|
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch parse $s3
bin/nutch updatedb crawl/crawldb $s3
|
...
Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages.
| No Format |
|---|
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
|
...
Every version of Nutch is built against a specific Solr version, but you may also try a "close" version.
Nutch | Solr |
| 1.21 | 8.11.4 |
| 1.20 | 8.11.2 |
| 1.19 | 8.11.2 |
| 1.18 | 8.5.1 |
| 1.17 | 8.5.1 |
| 1.16 | 7.3.1 |
1.15 | 7.3.1 |
1.14 | 6.6.0 |
1.13 | 5.5.0 |
1.12 | 5.4.1 |
...
make sure that there is no managed-schema "in the way":
No Format rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schemastart the solr server
No Format ${APACHE_SOLR_HOME}/bin/solr startcreate the nutch core
No Format ${APACHE_SOLR_HOME}/bin/solr create -c nutch -d ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/
...
After you started Solr admin console, you should be able to access the following links:
| No Format |
|---|
http://localhost:8983/solr/#/
|
...