LuceneIndexTransformer is a component that creates or updates Lucene indexes.
This component only writes the index: to search the index, use the SearchGenerator component.
Why use it?
Instead of using LuceneIndexTransformer, you could generate an index by crawling your website. However, the LuceneIndexTransformer is much, much faster than crawling.
The big differences for the developer are:
- Using the LuceneIndexTransformer requires you to write a pipeline that can generate a
lucene:index
document describing your searchable URI space, so it's necessary to have a well-defined URI space. For a site with a consistent structure this should not be too hard. This pipeline can use aggregation and inclusion mechanisms to produce a full list of the pages you want to search. In this way it's also possible to generate an index for websites with forms which are not crawlable. - On the other hand the crawler is a more generic solution, though far less efficient. It doesn't require a pipeline to "document" the entire searchable URI space. Instead, you must create a
content
view and alinks
view for each of the searchable pipelines. The URI space is then defined by crawling thelinks
view.
Declaring the LuceneIndexTransformer
The transformer must be declared in the <transformers>
section of your sitemap:
<map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0"> <map:components> ... <map:transformers default="xslt"> <map:transformer name="index" logger="sitemap.transformer.luceneindextransformer" src="org.apache.cocoon.transformation.LuceneIndexTransformer"/> </map:transformers> ... </map:components> ... </map:sitemap>
Input document for the LuceneIndexTransformer
This is a sample of the kind of document that the transformer expects. NB In this example, I've chosen a couple of simple XHTML documents as the content to be indexed. This is only because everyone knows XHTML - in practice you should typically generate the index from an early stage in the pipeline; indexing DocBook, TEI, etc, rather than a presentation format like HTML.
<lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer" directory="index" create="false" merge-factor="20"> <lucene:document url="http://localhost/sample.html"> <!-- here is some sample content --> <html> <head> <title lucene:store="true">Sample</title> </head> <body> <h1>Blah</h1> <a href="blah.jpg" title="download blah image" lucene:text-attr="title"> <img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/> </a> </body> </html> </lucene:document> <lucene:document url="http://localhost/sample-2.html"> <!-- Another sample doc --> <html> <head> <title lucene:store="true">Second Sample</title> </head> <body> <h1>Foo</h1> <p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. </p> </body> </html> </lucene:document> </lucene:index>
What the lucene:index document means
The lucene:index element
The root element is lucene:index
. The attributes of the lucene:index
in the sample above are shown with their default values - so the effect is as if they were not specified at all.
The merge-factor and analyzer attributes
See the Lucene documentation for explanations of what they mean.
The optimize-frequency attribute (since version 2.2)
Determines how often the lucene index will be optimized. When you have 1000's of documents, optimizing the index can become quite slow (eg. 7 seconds for 9000 small docs, P4).
- 1: always optimize (default)
- 0: never optimize
- x: update every x times. You can use any number, it is a random generator which will determine to optimize or not.
You can eg. create a pipe without optimizing, which is used to index you're document everytime when it's modified. You can then create another pipe which will optimize, which is called manually. For more info see the Lucene FAQ , What is index optimization and when should I use it? :
http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8
The directory attribute
This attribute controls where the index files are stored. The path is relative to the Cocoon work
directory.
The create attribute
This attribute controls whether the index is recreated.
- If create = "false" and the index already exists then the index will be updated. Documents which are already indexed will be removed from the index and reinserted.
- If the index does not exist then it will be created even if
create = "false"
. - If
create = "true"
then any existing index will be destroyed and a new index created. If you are rebuilding your entire index then you should usecreate = "true"
because the indexer doesn't need to remove old documents from the index, so it will be faster.
The lucene:document element
Lucene will index the content of each lucene:document
, which may contain any xml content. The index is associated with the url specified by the url
attribute. So this url will be returned as the results of a search.
The lucene:text-attr attribute
Normally Lucene will only index the content of these elements, not attribute values. To index the attributes of an element as well, give it an attribute called lucene:text-attr
, containing a list of the names of the attributes you want indexed. For example, to index the value of the alt
attribute of an img
element, in html
:
<img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/>
This would index the text "Blah".
The lucene:store attribute
Normally Lucene will only index the text of an element, not store it. To store the text of an element in Lucene's index, add a lucene:store="true"
attribute to the element. It's a good idea to store the title of a document in Lucene, so that your search results can show a document title as well as a URL.
The transformation
The transformer copies the source document to the output, except for the content of the lucene:document
elements.
The transformer also adds an elapsed-time
attribute to the output lucene:document
elements, showing the time (in milliseconds) taken to index that document. You can use XSLT to transform the results into a report on the indexing operation.
Sample output
<?xml version="1.0" encoding="UTF-8"?> <lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" merge-factor="20" create="false" directory="index" analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer"> <lucene:document url="JCB-001/full.html" elapsed-time="3846"/> <lucene:document url="JCB-001/_div1-N1017B.html" elapsed-time="3735"/> <lucene:document url="JCB-002/full.html" elapsed-time="361"/> <lucene:document url="JCB-002/_div1-N10190.html" elapsed-time="1302"/> <lucene:document url="JCB-003/full.html" elapsed-time="300"/> <lucene:document url="JCB-003/_div1-N10188.html" elapsed-time="1352"/> </lucene:index>
Note to users of Mac OS X
Java can not open more than 256 files at a time by default, so you may get an error like the following:
Description: org.apache.cocoon.ProcessingException: Failed to execute pipeline.: java.lang.RuntimeException: java.io.FileNotFoundException: /usr/local/tomcat-4/work/Standalone/localhost/_/cocoon-files/index/_15.f86 (Too many open files)
To avoid this error, you should set your ulimit in the shell script that starts Tomcat. My line reads as follows:
ulimit -S -n 1000
Read more about this here: http://www.amug.org/~glguerin/howto/More-open-files.html
Note to users of Redhat Linux
If you get the following error: (Empty StackException) while creating the index with the LuceneIndexTransformer try to alter your merge-factor to a lower value (default should be 10). Look at the Lucene documentation for more information.