Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Thia document provides instructions for setting up a development environment for Nutch within the Eclipse IDE. It is intended to provide a comprehensive beginning resource for the configuration, building, crawling and debugging of Nutch master branch in the above context.

N.B. Nutch 2.x branch is no longer in development so all documentation relating to that branch has been removed from this guide. If you are interested in that information then contact the Nutch dev@ team.


Table of Contents

Before you start

...

  1. Get the latest source code using Git from SVN using the terminal. For Nutch 1.x (ie.trunki.e. master branch) run this:

    No Format
     svngit coclone https://svn.apache.org/repos/asf/nutch/trunk
     cd trunk
     
    For Nutch 2.x run this:
    No Format
    
     svn co https://svn.apache.org/repos/asf/nutch/branches/2.xgithub.com/apache/nutch.git
     cd 2.xnutch
     
    For Nutch 1.x (ie. trunk), skip ahead to step #5.
    2. At this point you should have decided which data store you want to use. See the Apache Gora documentation to get more information about it. Here are few of the available options of storage classes:


    No Format
    
      org.apache.gora.hbase.store.HBaseStore
      org.apache.gora.cassandra.store.CassandraStore
      org.apache.gora.accumulo.store.AccumuloStore
      org.apache.gora.avro.store.AvroStore
      org.apache.gora.avro.store.DataFileAvroStore
     
    In “conf/nutch-site.xml” add the storage class name. eg. say you pick HBase as datastore, add this to “conf/nutch-site.xml”:
    No Format
    
     <property>
      <name>storage.data.store.class</name>
      <value>org.apache.gora.hbase.store.HBaseStore</value>
      <description>Default class for storing data</description>
     </property>
     
    3. In ivy/ivy.xml: Uncomment the dependency for the data store that you selected. eg. If you plan to use HBase, uncomment this line:
    No Format
    
      <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
     
    4. Set the default datastore in conf/gora.properties. eg. For HBase as datastore, put this in conf/gora.properties:
    No Format
    
     gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
     

    5. 2.  Add “http.agent.name” and “http.robots.agents” with appropiate appropriate values in “conf/nutch-site.xml”. See conf/nutch-default.xml for the description of these properties. Also, add “plugin.folders” and set it to {PATH_TO_NUTCH_CHECKOUT}/build/plugins. eg. If Nutch is present at "/home/tejas/Desktop/2.xnutch", set the property to:

    No Format
     <property>
       <name>plugin.folders</name>
       <value>/home/tejas/Desktop/2.xnutch/build/plugins</value>
     </property>
     


    63. Run this command:

    No Format
      ant eclipse
     


...