Make Luke a Lucene/Solr Module


Student Name : Ajay Bhat

Student e-mail : a.ajay.bhat@gmail.com

Apache Project Name : Apache Lucene

Project Title : Make Luke a Lucene/Solr Module

Mentor name : Mark Miller

Link to project idea JIRA : Make Luke a Lucene/Solr Module

For the ASF-ICFOSS Pilot Mentoring Programme


1. Abstract

Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Lucene offers powerful features through a simple API:

  • Scalable, High-Performance Indexing
  • Powerful, Accurate and Efficient Search Algorithms
  • Cross-Platform Solution

Luke accesses already existing Lucene indexes and allows you to display and modify their content in several ways:

  • Browse by document number, or by term and View documents / copy to clipboard or selectively delete documents from the index
  • Retrieve a ranked list of most frequent terms
  • Execute a search. browse and analyze search results
  • Reconstruct the original document fields, edit them and re-insert to the index
  • optimize indexes



Currently the Luke application uses Thinlet (LGPL Licensed API) for many of the GUI features. The UI needs to be ported to an Apache Licensed API for adding it to the trunk. The porting has been done to support Lucene 4.0.0. However lot of work still needs to be done. In this project I am mainly focusing on porting some of the plugins present in the Luke version, so it is more beneficial to the community and users who want to make use of these plugins.

Here I am going to use Apache Pivot (http://pivot.apache.org/) for the porting. Pivot allows developers to easily construct visually-engaging, cross-platform, connected applications in Java or any other JVM language, such as JavaScript, Groovy, or Scala. It is completely open source, and is driven entirely by the software development community.

Current (sandbox) version of Luke can be accessed at : https://svn.apache.org/repos/asf/lucene/sandbox/luke 
The original repo created by Andrzej Bialecki can be accessed at : http://luke.googlecode.com/svn/trunk/

2. Detailed Description

2.1 BXML used in Pivot for the GUI

BXML is an XML-based markup language for simplifying the construction of Java object hierarchies. While it is most often used to define the user interface of an Apache Pivot application, it is not limited to user interface construction, and can actually be used to create hierarchies of any object type.

In BXML, an XML namespace represents a Java package. Declaring a namespace associates the namespace prefix with the package, similar to how the import keyword is used in Java.

For example, the following simple BXML associates the package "com.foo" with the "foo" namespace prefix: 

<foo:Bar xmlns:foo="com.foo"/> 


The above window is created when BXML file is written thusly:

<Window title="Hello BXML!" maximized="true"
   xmlns:bxml="http://pivot.apache.org/bxml"
   xmlns="org.apache.pivot.wtk">
   <Label text="Hello BXML!"
       styles="{font:'Arial bold 24', color:'#ff0000',
           horizontalAlignment:'center', verticalAlignment:'center'}"/>
</Window> 

and then we use the BXML file in our Java source code as:

import org.apache.pivot.beans.BXMLSerializer;
import org.apache.pivot.collections.Map;
import org.apache.pivot.wtk.Application;
import org.apache.pivot.wtk.Display;
import org.apache.pivot.wtk.Window;
public class HelloBXML implements Application {
   private Window window = null;
   @Override
   public void startup(Display display, Map<String, String> properties)
       throws Exception {
       BXMLSerializer bxmlSerializer = new BXMLSerializer();
       window = (Window)bxmlSerializer.readObject(HelloBXML.class, "hello.bxml");
       window.open(display);
   }
   @Override
   public boolean shutdown(boolean optional) {
       if (window != null) {
           window.close();
       }
       return false;
   }
   public static void main(String[] args) {
     DesktopApplicationContext.main(HelloBXML.class, args);
   } 

2.2 Implement Plugins

Apache Pivot API is used to implement the GUI features of the Luke application as it is Apache Licensed and can be used in the Lucene trunk.
 
Tabs for Plugins

Tabbed pane is already done using Pivot API for the Luke application. There is a pane for plugins that is empty in the sandboxed version.A Tab Pane is added as a BXML file as :

<TabPane bxml:id="tabPane">
	<tabs>
		<Border styles="{backgroundColor:11,thickness:0}">
			<TabPane.tabData>
				<content:ButtonData icon="/img/info.gif" text="%lukeWindow_overviewTabText" />
			</TabPane.tabData>
			<content>
				<bxml:include bxml:id="abc" src="MyTab.bxml" />
			</content>
		</Border>
	</tabs>
</TabPane>

Hadoop Plugin

The Hadoop Plugin uses a index used by Hadoop

We can use the RowSequence class in Pivot (http://pivot.apache.org/2.0.2/docs/api/org/apache/pivot/wtk/TablePane.RowSequence.html) to manipulate rows.

The add() method will add a Row object (http://pivot.apache.org/2.0.2/docs/api/org/apache/pivot/wtk/TablePane.Row.html) to the table.

The process of reading number of bytes is done using methods and classes from apache hadoop and lucene: org.apache.hadoop.conf, org.apache.hadoop.fs,org.apache.lucene.index packages

Vocabulary Analysis Tool

The tool basically works using a SlowThread class which extends Thread and the class having run() defined as:

public final void run() {
   app.add(ui);
   try{
	execute();
   }
   catch (Throwable t){
	t.printStackTrace(); 
	app.showStatus(t.getMessage());
   }
   app.remove(ui);
   app.repaint();
 }

Where app is the pane showing results.

The analysis is done as:

SlowThread st = new SlowThread(app) {
     public void execute() {
       try {
         int numAgeGroups = 100;
         float numDocs = ir.maxDoc();
         if (numDocs < numAgeGroups) numAgeGroups = ir.maxDoc();
         float ageTotals[] = new float[numAgeGroups];
         TermsEnum te = MultiFields.getTerms(ir, field).iterator(null);
         while (te.next() != null)
	{
		DocsEnum td = te.docs(null, null, 0); 
		td.nextDoc(); 
		float firstDocId = td.docID();
		int ageBracket = (int) ((firstDocId / numDocs) * numAgeGroups); 
		ageTotals\[ageBracket\]++;
	}

         float total = 0.0f;
         float max = 0.0f;
         for (int i = 0; i < ageTotals.length; i++) {
           if (ageTotals[i] > max) max = ageTotals[i];
           total += ageTotals[i];
           if (i > 0 && cumul){
		ageTotals\[i\] \+= ageTotals\[i - 1\]; // make totals cumulative 
	   }
       }
         Object maxpct = app.find(myUi, "maxpct");
         if (cumul){
		app.setString(maxpct, "text", "100 %"); 
	   }else{ 
		app.setString(maxpct, "text", (float)Math.round(max * 10000.0f / total) / 100.0f + " %");
	   }
         chart.setScores(ageTotals);
         chart.invalidate();
         app.repaint();
       } 
	 catch (Exception e){ 
		app.showStatus("ERROR: " + e.getMessage()); 
	   }
      }
};

This only involves some basic string manipulation, arithmetic and some classes imported from Lucene index, specifically:

org.apache.lucene.index.MultiFields, org.apache.lucene.index.DocsEnum and org.apache.lucene.index.TermsEnum.

So there’s no need of specific GUI needed for analysis. However for displaying we need Pivot.

Analyzer Tool

The Analyzer tool is a lexical analyzer. It analyzes first for tokens, leaving out any special characters and spaces including @ symbol. An example is:

The Analyzer also provides functionality for highlighting selected text.

Right now the highlighting is done by requestFocus() method which works as:

Object inputText = app.find(myUi, "inputText");
   if (as.hasAttribute(OffsetAttribute.class)){
	OffsetAttribute off = (OffsetAttribute)as.getAttribute(OffsetAttribute.class); 
	app.setInteger(inputText, "start", 0); 
	app.setInteger(inputText, "end", off.endOffset()); 
	app.setInteger(inputText, "start", off.startOffset()); 
	app.requestFocus(inputText);
   }

where setInteger() is inherited from superclass which uses Thinlet::

public void setInteger(Object component, String key, int value) {
       Object[] definition = getDefinition(getClass(component), key, "integer");
       if (setInteger(component, (String) definition[1], value, ((Integer) definition[3]).intValue())){
		update(component, definition\[2\]);}
       }

and getDefinition() is :

private static Object[] getDefinition(Object classname, String key, String type) {
       Object currentname = classname;
       while (classname != null) {
           for (int i = 0; i < dtd.length; i += 3) {
               if (dtd[i] == classname) {
                   Object[][] attributes = (Object[][]) dtd[i + 2];
                   if (attributes != null) {
                       for (int j = 0; j < attributes.length; j++) {
                           if (attributes[j][1].equals(key)) {
                               if ((type != null) && (type != attributes[j][0]))
				{
					throw new IllegalArgumentException(attributes\[j\]\[0\].toString()); 
				}
                               return attributes[j];
                          }
                       }
                   }
                   classname = dtd[i + 1];
                   break;
               }
           }
       }
       throw new IllegalArgumentException("unknown " + key + " " + type + " for " + currentname);
   }

Highlighting can be implemented in Pivot as one of two methods:

  • setSelection(int selectionStart, int selectionLength)
  • setSelection(Span selection)

If we already know the offset and length of where we want to select, then we can use the first one:

void doSomething(){
int start = <starting offset of text to select>;
int length = <length of text>;
textArea.clearSelection();
textArea.setSelection(start, length);

Alternatively, we can use the method that involve a Span object:

        Span span = new Span(start, end);
       textArea.setSelection(span);

The token analysis displayed as a Table Pane. In Pivot we implement Table pane using BXML:

<layout:SimpleTablePanes bxml:id="window" title="Table Panes" maximized="true"
   xmlns:bxml="http://pivot.apache.org/bxml"
   xmlns:layout="org.apache.pivot.tutorials.layout"
   xmlns="org.apache.pivot.wtk">
   <Border>
       <TablePane bxml:id="tablePane" styles="{verticalSpacing:1, showHorizontalGridLines:true,
           horizontalSpacing:1, showVerticalGridLines:true}">
           <columns>
               <TablePane.Column width="100"/>
               <TablePane.Column width="2*"/>
               <TablePane.Column width="300"/>
           </columns>
           <TablePane.Row height="-1">
               <TablePane.Filler/>
               <Label text="Type" styles="{horizontalAlignment:'center'}"/>
               <Label text="Implementation" styles="{horizontalAlignment:'center'}"/>
               <Label text="Value" styles="{horizontalAlignment:'center'}"/>
           </TablePane.Row>
           <TablePane.Row height="-1">
               <Label styles="{verticalAlignment:'center'}"/>
           </TablePane.Row>
           <TablePane.Row height="1*">
               <Label styles="{verticalAlignment:'center'}"/>
           </TablePane.Row>
           <TablePane.Row height="1*">
               <Label styles="{verticalAlignment:'center'}"/>
           </TablePane.Row>
       </TablePane>
   </Border>
</layout:SimpleTablePanes>

  and use the BXML file as:

public class SimpleTablePanes extends Window implements Bindable {    
 private TablePane tablePane = null;     
@Override
 public void initialize(Map<String, Object> namespace, URL location, Resources resources){
        tablePane = (TablePane)namespace.get("tablePane"); 
	}
}

giving a pane similar in look to :

2.3 Implement Other Features

Status Bar

A minor feature that is in the Luke application but not in the sandboxed version is a status bar.

It can be implemented using Box Pane.

<BoxPane styles="{padding:4, verticalAlignment:'center'}">
       <Label text="Index : Not added"/>
</BoxPane>

Themes

Another nice, if minor, touch to the application would be different color schemes.
Hex values of the current themes available are as follows.

Unknown macro: {0xece9d0, 0x000000, 0xf5f4f0, 0x919b9a, 0xb0b0b0, 0xeeeeee, 0xb9b9b9, 0xff8080, 0xc5c5dd}

Default

Unknown macro: {0xe6e6e6, 0x000000, 0xffffff, 0x909090, 0xb0b0b0, 0xededed, 0xb9b9b9, 0x89899a, 0xc5c5dd}

Gray

Unknown macro: {0xeeeecc, 0x000000, 0xffffff, 0x999966, 0xb0b096, 0xededcb, 0xcccc99, 0xcc6600, 0xffcc66}

Sandstone

Unknown macro: {0xf0f0ff, 0x0000a0, 0xffffff, 0x8080ff, 0xb0b0b0, 0xededed, 0xb0b0ff, 0xff0000, 0xfde0e0}

Sky

Unknown macro: {0x6375d6, 0xffffff, 0x7f8fdd, 0xd6dff5, 0x9caae5, 0x666666, 0x003399, 0xff3333, 0x666666}

Navy

Comparison of gray and sandstone themes in current Luke app:

 

2.4 Deliverables

Analyzer Tool Plugin : Tool for analyzing analyzers.

Vocabulary Analysis Plugin : It shows the index’s vocabulary growth.

Hadoop Plugin : Hadoop Plugin to open indexes located on any filesystem supported by Hadoop.

Themes : Set of predefined themes

Status Bar : Status Bar to show the index directory etc.

Documentation : User and developer documentation on Luke web page.

2.5 Future/Additional Work

If I get time during the work period after completing the above deliverables I’d like to work on one of these:

  • Query Parser for documents
  • Scripting Plugin for Luke

3. Timeline

(Rough estimate)

Week

Dates

Work to be done

Week 0

July 20th to August 2nd

Familiarize with Pivot API
Test out Luke and Lucene app
Familiarize with Lucene code and relevant Hadoop packages.

Week 1-3

August 5th to August 25th

Work on Analyzer Plugin.
Start testing Hadoop with indexes etc.

Week 4-6

August 26th to September 13th

Work on Status Bar and Themes.
Do the documentation for code till date. Mid term evaluation

Week 7-8

September 16th to September 30th

Work on Vocabulary Analysis Plugin

Week 9-10

September 30th to October 14th

Work on Hadoop Plugin

Week 11

October 14th to October 21st

Extra Week for catching up in case of delays. Code review.

Week 12

October 21st to October 28th

Do documentation for code till date.
Submit final evaluation

  • No labels