Integrating Nutch search functionality into a Java application
:MESSAGE: This page is extremely out of date. It is not useful for modern versions of Nutch.
This example is the fruit of much searching of the nutch users mailing list in order to get a working application that used the Nutch APIs. I couldn't find all that was needed to provide a quick-start in one place, so this document was born...
Using Nutch within an application is actually very simple; the requirements are merely the existence of a previously created crawl index, a couple of settings in a configuration file, and a handful of jars in your classpath. Nothing else is needed from the Nutch release that you can download.
This example assumes that an index has been created in the directory /home/nutch-java-demo/crawl-dir and a copy of the 'plugins' folder from the nutch distribution is in the directory /home/nutch-java-demo/plugins. This directory tree is completely external to the deployment of the java application.
For the search to work, some appropriate settings need to be in a file called nutch-site.xml. If you have read the first part of this document, this file will be familiar to you. While you could use the same version of that file as before, there is no need to do so, as only two properties are required within it:
1) plugin.folders must be a fully qualified path, such as:
This should point to a folder containing all the Nutch plugins. This can be placed anywhere within the filesystem and has no dependency on any other files distributed with Nutch.
2) searcher.dir must be a fully qualified path to the crawl directory you want to use
Place this copy of nutch-site.xml and a copy of common-terms.utf8 (from the conf directory in the Nutch distribution) in the WEB-INF/classes directory of the web application that you're deploying. For a standalone application, the mentioned files have to be available in the CLASSPATH. In some cases, you might want to have extra-flexibility by using runtime configuration parameters. This can be achieved using variable substitution. For example, the nutch-site.xml might look like this:
and run the java application using the appropriate parameters:
You also need to make sure that the following jars are placed in WEB-INF/lib (this assumes usage of Nutch 0.9):
For a standalone application, one might want to use Apache maven (this configuration assumes Nutch 1.1). At the moment of writing this note, Nutch does not publish its artifacts to maven. However we (members of community) hope that maven support will be added soon. In the meantime, just install the nutch-1.1.jar to your maven repository. Here is a snippet that will manage the dependencies that you need to run this example (note that the 1.1-XXX version of Nutch marks the fact that the artifact cannot be found in any public repository yet):
With that, all is ready and we can now write some simple code to search. A quick example in Java to search the crawl index and return the number of hits found is:
Extra information about developing a standalone application that does the search can be obtained by inspecting the main method in org.apache.nutch.searcher.NutchBean.
Chaz Hickman (Jan 2008)
Cristi Vulpe (Aug 2010)