Running Nutch in Eclipse

Here are Thia document provides instructions for setting up a development environment for Nutch under within the Eclipse IDE. It is intended to provide a comprehensive beginning resource for the configuration, building, crawling and debugging of Nutch trunk master branch in the above context.

Table of Contents

Before you start

Setting up Nutch to run into in Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line. However, it's very useful to be able to debug Nutch in Eclipse and is also extremely useful when applying and testing patches/pull requests as it enables you to see them working in a larger context. This being said, you will still benefit greatly by looking at the hadoop.log output. This tutorial covers a fully internal Eclipse/Nutch set up, using only Eclipse tools and associated plugins.

...

You need to have Apache Ant installed and configured on your system.
Grab the newest version of Eclipse available here.
All of the following should be available from the Eclipse Marketplace. However if not, you can download them throughout Eclipse as follows.
Once you've set up Eclipse, download Subclipse as per here. N.B. If you experience an error with the 1.8.x release, try 1.6.x. This tends to solve compatibility problems.
Grab
- IvyDE plugin
for Eclipse as here.
Grab
- m2e plugin
for Eclipse here

Steps

Checkout and Build Nutch

Get the latest source code from SVN using terminal. For Nutch 1.x (ie.trunk) run this:

No Format
svn co https://svn.apache.org/repos/asf/nutch/trunk cd trunk

For Nutch 2.x run this:

No Format
svn co https://svn.apache.org/repos/asf/nutch/branches/2.x cd 2.x

For Nutch 1.x (ie. trunk), skip ahead to step #5.
2. At this point you should have decided which data store you want to use. See the Apache Gora documentation to get more information about it. Here are few of the available options of storage classes:

No Format

  org.apache.gora.hbase.store.HBaseStore
  org.apache.gora.cassandra.store.CassandraStore
  org.apache.gora.accumulo.store.AccumuloStore
  org.apache.gora.avro.store.AvroStore
  org.apache.gora.avro.store.DataFileAvroStore

In “conf/nutch-site.xml” add the storage class name. eg. say you pick HBase as datastore, add this to “conf/nutch-site.xml”:

No Format
<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property>

3. In ivy/ivy.xml: Uncomment the dependency for the data store that you selected. eg. If you plan to use HBase, uncomment this line:

No Format
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

4. Set the default datastore in conf/gora.properties. eg. For HBase as datastore, put this in conf/gora.properties:

No Format
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

5. Add “http.agent.name” and “http.robots.agents” with appropiate values in “conf/nutch-site.xml”. See conf/nutch-default.xml for the description of these properties. Also, add “plugin.folders” and set it to {PATH_TO_NUTCH_CHECKOUT}/build/plugins. eg. If Nutch is present at "/home/tejas/Desktop/2.x", set the property to:

No Format
<property> <name>plugin.folders</name> <value>/home/tejas/Desktop/2.x/build/plugins</value> </property>

6. Run this command:

No Format
ant eclipse

Load project in Eclipse

...

In Eclipse, click on “File” -> “Import...”
2. Select “Existing Projects into Workspace”
3. In the next window, set the root directory to the location where you took the checkout of nutch 2.x (or trunk). Click “Finish”.
4. You will now see a new project named 2.x (or trunk) being added in the workspace. Wait for a moment until Eclipse refreshes its SVN cache and builds its workspace. You can see the status at the bottom right corner of Eclipse.
5. In Package Explorer, right click on the project “2.x” (or trunk), select “Build Path” -> “Configure Build Path”

6. In the “Order and Export” tab, scroll down and select “2.x/conf” (or trunk/conf). Click on “Top” button. Sadly, Eclipse will again build the workspace but this time it won’t take take much.

Create Eclipse launcher

Now, lets get geared to run something. Lets start off with the inject operation. Right click on the project in “Package Explorer” -> select “Run As” -> select “Run Configurations”. Create a new configuration. Name it as "inject".

For 1.x ie trunk : Set the main class as: org.apache.nutch.crawl.Injector
For 2.x : Set the main class as: org.apache.nutch.crawl.InjectorJob

In the arguments tab, for program arguments, provide the path of the input directory which has seed urls. Set VM Arguments to “-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log”

...

If you want to find out the java class corresponding to any command, just peek inside "src/bin/nutch" script and at the bottom you would find a switch case with a case corresponding to each command. Here are the important classes corresponding to the crawl cycle:

Operation	Class in Nutch 1.x (i.e.trunk)	Class in Nutch 2.x
inject	org.apache.nutch.crawl.Injector	org.apache.nutch.crawl.InjectorJob
generate	org.apache.nutch.crawl.Generator	org.apache.nutch.crawl.GeneratorJob
fetch	org.apache.nutch.fetcher.Fetcher	org.apache.nutch.fetcher.FetcherJob
parse	org.apache.nutch.parse.ParseSegment	org.apache.nutch.parse.ParserJob
updatedb	org.apache.nutch.crawl.CrawlDb	org.apache.nutch.crawl.DbUpdaterJob

Debug Nutch in Eclipse

Set breakpoints and debug a crawl
It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs.
Here are a few good places to set breakpoints in the 1.x codebase:

No Format

Fetcher [line: 1115] - run
Fetcher [line: 530] - fetch
Fetcher$FetcherThread [line: 560] - run()
Generator [line: 443] - generate
Generator$Selector [line: 108] - map
OutlinkExtractor [line: 71 & 74] - getOutlinks

Here are a few good places to set breakpoints in the 2.x codebase:

No Format

FetcherReducer$FetcherThread run() : line 487 : LOG.info("fetching " + fit.url ....
                                   : line 519 : final ProtocolStatus status = output.getStatus();

GeneratorMapper : map() : line 53
GeneratorReducer : reduce() : line 53
OutlinkExtractor : getOutlinks() : line 84

...

Checkout the Hadoop version that should be used within Nutch trunk
Configure a Hadoop project similar to the Nutch project within your Eclipse IDE. See this.
Add the Hadoop project as a dependent project of Nutch project
You can now also set break points within Hadoop classes like inputformat implementations etc.

Non-ported Plugins to 2.x

...

Space shortcuts

Child pages

Versions Compared

Old Version 1

New Version 2

Key

Running Nutch in Eclipse

Before you start

Steps

Checkout and Build Nutch

Load project in Eclipse

Create Eclipse launcher

Debug Nutch in Eclipse

Non-ported Plugins to 2.x

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 1

New Version 2

Key

Running Nutch in Eclipse

Before you start

Steps

Checkout and Build Nutch

Load project in Eclipse

Create Eclipse launcher

Debug Nutch in Eclipse

Non-ported Plugins to 2.x