Notes and Q&A
Notes
- In Lucene, ranking is per-field.
- What about deleted docs?
What about the other methods?
<rmuir> maxDoc() doesnt reflect deletes <rmuir> docFreq() doesnt reflect deletes <rmuir> the numDocs() reflects delete
sumOfNorms
can be used as a "sum of lengths", provided the norm reflects the length (and not1/sqrt(#tokens)
as the default)- Lucene indexes in segments. For ranking we need to see the whole index, that's why we climb up to the top of segment tree via
ReaderUtil.getTopLevelContext(context);
inMockBM25Similarity.avgDocumentLength()
. - In
Similarity.computeWeight()
(soon to becomputeStats
) we are seek'ed to the term, so statistics should be computed there. - There are three types of boost
score + boost
: I do not consider this a boost, but rather a sum of similarity scores, of which one happens to come from outside (e.g. PageRank)score * boost
score = tf(boost * freq) * idf
- We prefer manual instantiation (for Similarities, parts thereof). Providers should be written manually.
Problems
- Language modeling would require custom aggregation of query terms
- product instead of weighted sum (this could be solved by using log, but the query norm still messes it up)
- decide which documents have a term, and which do not, because we have to weight them accordingly (p_t or 1 - p_t)
- two types of aggregation?
- per field (definitely Similarity-specific)
- whole query (should be Similarity-specific too, but might be OK if fixed)
- What about phrases? LATER... sum(DF)
Questions about Lucene
- Is it possible to design a scoring interface that is consistent across ranking frameworks?
- How do contexts work?
- NormConverter? NO
- Common Normalization, IDF, etc. TOO
- QueryWeight class
- What to pass to score()?
- LM default parameters?
- Factory for DFR (low prio)
- lnu.ltc, LM, DFR+