Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update / complete Nutch <> Solr version matrix

...

  • run "bin/nutch" - You can confirm a correct installation if you see something similar to the following:
No Format

Usage: nutch COMMAND where command is one of:
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment's pages
...

...

  • Run the following command if you are seeing "Permission denied":
No Format

chmod +x bin/nutch
  • Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc:
No Format

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/11/Home
# note that the actual path may be different on your system

On Debian or Ubuntu, you can run the following command or add it to ~/.bashrc:

No Format

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

You may also have to update your /etc/hosts file. If so you can add the following

No Format

##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1       localhost.localdomain localhost LMC-032857
::1             ip6-localhost ip6-loopback
fe80::1%lo0     ip6-localhost ip6-loopback

...

  • Default crawl properties can be viewed and edited within {{conf/nutch-default.xml }}- where most of these can be used without modification
  • The file conf/nutch-site.xml serves as a place to add your own custom crawl properties that overwrite conf/nutch-default.xml. The only required modification for this file is to override the value field of the {{http.agent.name }}
    • i.e. Add your agent name in the value field of the http.agent.name property in conf/nutch-site.xml, for example:
No Format

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

...

  • mkdir -p urls
  • cd urls
  • touch seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutch to crawl).
No Format

http://nutch.apache.org/

(Optional) Configure Regular Expression Filters

Edit the file conf/regex-urlfilter.txt and replace

No Format

# accept anything else
+.

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:

No Format

+^https?://([a-z0-9-]+\.)*nutch\.apache\.org/

...

This option shadows the creation of the seed list as covered here.

No Format

bin/nutch inject crawl/crawldb urls

...

To fetch, we first generate a fetch list from the database:

No Format

bin/nutch generate crawl/crawldb crawl/segments

This generates a fetch list for all of the pages due to be fetched. The fetch list is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable s1:

No Format

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

Now we run the fetcher on this segment with:

No Format

bin/nutch fetch $s1

Then we parse the entries:

No Format

bin/nutch parse $s1

When this is complete, we update the database with the results of the fetch:

No Format

bin/nutch updatedb crawl/crawldb $s1

...

Now we generate and fetch a new segment containing the top-scoring 1,000 pages:

No Format

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2

bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2

Let's fetch one more round:

No Format

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3

bin/nutch fetch $s3
bin/nutch parse $s3
bin/nutch updatedb crawl/crawldb $s3

...

Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages.

No Format

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

...

Every version of Nutch is built against a specific Solr version, but you may also try a "close" version.

Nutch

Solr

1.218.11.4
1.208.11.2
1.198.11.2
1.188.5.1
1.178.5.1
1.167.3.1

1.15

7.3.1

1.14

6.6.0

1.13

5.5.0

1.12

5.4.1

...

  • make sure that there is no managed-schema "in the way":

    No Format
    
    rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema
    


  • start the solr server

    No Format
    
    ${APACHE_SOLR_HOME}/bin/solr start
    


  • create the nutch core

    No Format
    
    ${APACHE_SOLR_HOME}/bin/solr create -c nutch -d ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/
    


...

After you started Solr admin console, you should be able to access the following links:

No Format

http://localhost:8983/solr/#/

...