I think storing a String->String map per index makes sense, but I'm not
sure it makes sense for this to be implemented as a set of Fields, since
many field attributes do not make sense here, such as Indexed,
Tokenized, Stored, Vectored, etc. The Binary & Compressed attributes
could make sense however. So we could try to refactor Field, but much
of it would not be shared code anyway, since we would probably not use
the same format for field names. So I think I would opt for a similar
yet distinct implementation for index attributes rather than overloading
Grant Ingersoll wrote:
> At the Index (Directory ???) level, I think we could make it such that a
> Directory/Index can have Fields (perhaps just Keyword based, not sure
> what it would mean to have tokenized text stored on the index)
> associated with it as well. Thus we would be able to use existing
> mechanisms for writing Index level metadata. We would just need a new
> place in the file format for storing these fields. This may call for
> some marker interfaces (in Java land anyway) such that the Index field
> addition mechanism only allows certain kind of Fields if that makes sense.
> Then the user could do similar things to an Index that they do to a
> Document (i.e. get value for a field, etc.). I currently simulate this
> in our IR Tools implementation by writing out an XML file that goes in
> the same directory as the index files and stores metadata about the index.
> Doug Cutting wrote:
>> Marvin Humphrey wrote:
>>> The number of bits per position dedicated to weighting (4-8 out of
>>> 16) in the Google paper is maddeningly small. However the number of
>>> bits per document per term for a common term is embarrassingly large
>>> compared to the 8 Lucene currently has available. It strikes me
>>> that it might be helpful to delta encode not just positions, but
>>> boosts as well.
>> Not sure what you mean here, since boosts are not ordered.
>> Personally I think the eight-bit floats used by Lucene give plenty of
>> precision for this class of computation. Relevant documents should be
>> easily distinguished from non-relevant documents, and fine-differences
>> in ranking between relevant documents don't matter. The only time
>> folks have complained about the precision of eight-bit floats in
>> Lucene is when they've attempted to overload them with other semantics
>> besides relevance (e.g., dates), which is inappropriate.
>> So I would opt for delta-encoded positions with a one-byte boost.
>> I think combining frequencies and positions in a single file might be
>> useful. If folks don't want to pay the penalty of pawing through
>> positions then they should disable position indexing for that field.
>> Another thing folks have frequently asked for is per-document boosts
>> (i.e., boosts instead of frequencies, and no positions).
>> Some useful posting options are thus:
>> a. <doc>+
>> b. <doc, boost>+
>> c. <doc, freq, <position>+ >+
>> d. <doc, freq, <position, boost>+ >+
>> These suggest the following booleans per field:
>> 1. freq
>> 2. document boost
>> 3. position (requires freq)
>> 4. position boost (requires position)