The goal is to extend Apach Nutch with a comfortable web based administration user interface to monitor, configure and manage one or a set of Nutch search system instances through the REST-API. This will tie together a number of issues, ultimately resulting in a Nutch 2.0 Webapp
|Table of Contents|
Have a distribution of nutch that only needs to be decompressed and can be started in three different modi.
- Starting nutch in a single master mode - starts a local map reduce nutch instance by launching a simple nutch admin gui.
- Starting nutch in a multi master mode - starts nutch jobtracker, namenode and admin gui.
- Starting nutch in a worker mode - starts data node and trasktracker.
Command line administration and configuration as it is used today should be still available. The described admin gui can never replace a fine tuned, shell script administered nutch, but it can help to get users faster and easier started with nutch. A easy to use webbased configuration and administration gui would also help to increase the user basis of nutch.
This link provides the best working prototype of an example admin gui, it also provides a heap of material relating to what kind and level of functionality the Nutch webapp should support.
A New Nutch Instance
Example Crawl 1
Example Crawl 2
Example Crawl 3
Description Admin Gui:
There are three main functionalities of the admin gui
- Be able to start all nutch tools like the fetch tool by an api call. This requires some minor changes in some tools that contain some simple logic processing already in the main method.
- Api based job starting from a container and also the ability to have multiple instances running in one jvm require the ability to pass a nutch conf instance within the call stack.
- working folder centralized data storage. For example storing index inside a segment folder, to be able to connect indexes and segement data physically.
- Segment status tracking is required to allow or prohibit functionality in the admin gui. This can simple realized by writing status identify files into the folder as partly already done with index.done.
- add gui related information like, valueType, validationPattern, defaultValue to configuration file.
Starting processes in seperated JVMs as tasktracker does? Split configuration file into pieces of add category node.