...
Thia document provides instructions for setting up a development environment for Nutch within the Eclipse IDE. It is intended to provide a comprehensive beginning resource for the configuration, building, crawling and debugging of Nutch master branch in the above context.
N.B. Nutch 2.x branch is no longer in development so all documentation relating to that branch has been removed from this guide. If you are interested in that information then contact the Nutch dev@ team.
Table of Contents |
---|
Before you start
...
Get the latest source code using Git from SVN using the terminal. For Nutch 1.x (ie.trunki.e. master branch) run this:
For Nutch 2.x run this:No Format svngit coclone https://svn.apache.org/repos/asf/nutch/trunk cd trunk
For Nutch 1.x (ie. trunk), skip ahead to step #5.No Format svn co https://svn.apache.org/repos/asf/nutch/branches/2.xgithub.com/apache/nutch.git cd 2.xnutch
2. At this point you should have decided which data store you want to use. See the Apache Gora documentation to get more information about it. Here are few of the available options of storage classes:
In “conf/nutch-site.xml” add the storage class name. eg. say you pick HBase as datastore, add this to “conf/nutch-site.xml”:No Format org.apache.gora.hbase.store.HBaseStore org.apache.gora.cassandra.store.CassandraStore org.apache.gora.accumulo.store.AccumuloStore org.apache.gora.avro.store.AvroStore org.apache.gora.avro.store.DataFileAvroStore
3. In ivy/ivy.xml: Uncomment the dependency for the data store that you selected. eg. If you plan to use HBase, uncomment this line:No Format <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property>
4. Set the default datastore in conf/gora.properties. eg. For HBase as datastore, put this in conf/gora.properties:No Format <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
No Format gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
5. 2. Add “http.agent.name” and “http.robots.agents” with appropiate appropriate values in “conf/nutch-site.xml”. See conf/nutch-default.xml for the description of these properties. Also, add “plugin.folders” and set it to {PATH_TO_NUTCH_CHECKOUT}/build/plugins. eg. If Nutch is present at "/home/tejas/Desktop/2.xnutch", set the property to:
No Format <property> <name>plugin.folders</name> <value>/home/tejas/Desktop/2.xnutch/build/plugins</value> </property>
63. Run this command:No Format ant eclipse
...