Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you are looking at example code (in an article or book perhaps) and just need to understand how the example would change to work with 2.0 (without needing to actually compile it) you can review the javadocs for Lucene 1.9 and lookup any methods used in the examples that are no longer part of Lucene. The 1.9 javadocs will have a clear deprecation message explaining how to get the same effect using the 2.x methods.

How is Lucene's indexing and search performance measured?

Check Lucene bench: https://home.apache.org/~mikemccand/lucenebench/

I am having a performance issue. How do I ask for help on the java-user@lucene.apache.org mailing list?

...

The trick is to enumerate terms with that field. Terms are sorted first by field, then by text, so all terms with a given field are adjacent in enumerations. Term enumeration is also efficient.

No Format

try
{
    TermEnum terms = indexReader.terms(new Term("FIELD-NAME-HERE", ""));
    while ("FIELD-NAME-HERE".equals(terms.term().field()))
    {
        // ... collect terms.term().text() ...

        if (!terms.next())
            break;
    }
}
finally
{
    terms.close();
}

...

One Explanation...

No Format

  > Does anyone have an example of limiting results returned based on a
  > score threshold? For example if I'm only interested in documents with
  > a score > 0.05.

I would not recommend doing this because absolute score values in Lucene
are not meaningful (e.g., scores are not directly comparable across
searches).  The ratio of a score to the highest score returned is
meaningful, but there is no absolute calibration for the highest score
returned, at least at present, so there is not a way to determine from
the scores what the quality of the result set is overall.

...

The components responsible for this are various Analyzers. Make sure you use the appropriate analyzer. For examaple, StandardAnaylzer does not remove numbers, but it removes most punctuation.


Wiki Markup
Is the [IndexWriter] class, and especially the method addIndexes(Directory\[\]) thread safe?



Wiki Markup
Yes, {{IndexWriter.addIndexes(Directory\[\])}} method is thread safe (it is a {{synchronized}} method). IndexWriter in general is thread safe, i.e. you should use the same IndexWriter object from all of your threads. Actually it's impossible to use more than one IndexWriter for the same index directory, as this will lead to an exception trying to create the lock file.


When is it possible for document IDs to change?

...

Here is an example:

No Format

public class MyAnalyzer extends ReusableAnalyzerBase {
  private Version matchVersion;

  public MyAnalyzer(Version matchVersion) {
    this.matchVersion = matchVersion;
  }

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
    TokenStream sink = new LowerCaseFilter(matchVersion, source);
    sink = new LengthFilter(sink, 3, Integer.MAX_VALUE);
    return new TokenStreamComponents(source, sink);
  }
}

...

If you want your custom token modification to come after the filters that lucene's StandardAnalyzer class would normally call, do the following:

No Format

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new StandardTokenizer(matchVersion, reader);
    TokenStream sink = new StandardFilter(matchVersion, source);
    sink = new LowerCaseFilter(matchVersion, sink);
    sink = new StopFilter(matchVersion, sink,
                          StopAnalyzer.ENGLISH_STOP_WORDS_SET, false);
    sink = new CaseNumberFilter(sink);
    sink = new NameFilter(sink);
    return new TokenStreamComponents(source, sink);
  }

...

Lucene only uses Java strings, so you normally do not need to care about this. Just remember that you may need to specify an encoding when you read in external strings from e.g. a file (otherwise the system's default encoding will be used). If you really need to recode a String you can use this hack:

No Format

String newStr = new String(someString.getBytes("UTF-8"));

...

Note that the article uses an older version of apache lucene. For parsing the java source files and extracting that information, the ASTParser of the eclipse java development tools is used.


Wiki Markup
What is the difference between [IndexWriter].addIndexes(IndexReader\[\]) and [IndexWriter].addIndexes(Directory\[\]), besides them taking different arguments?


When merging lots of indexes (more than the mergeFactor), the Directory-based method will use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once, while the IndexReader-based method requires that all indexes be open when passed.

...