I am interested in knowing a little more specifically about 开发者_StackOverflow社区how Lucene queries are scored. In their documentation, they mention the VSM. I am familiar with VSM, but it seems inconsistent with the types of queries they allow.
I tried stepping through the source code for BooleanScorer2 and BooleanWeight, to no real avail.
My question is, can somebody step through the execution of a BooleanScorer to explain how it combines queries.
Also, is there a way to simple send out several terms and just get the raw tf.idf score for those terms, the way it is described in the documentation?
The place to start is http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/search/Similarity.html
I think it clears up your inconsistency? Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.
The next thing to look at is Searcher.explain, which can give you a string explaining how the score for a (query, document) pair is calculated.
Tracing thru the execution of BooleanScorer can be challenging I think, its probably easiest to understand BooleanScorer2 first, which uses subscorers like ConjunctionScorer/DisjunctionSumScorer, and to think of BooleanScorer as an optimization.
If this is confusing, then start even simpler at TermScorer. Personally I look at it "bottoms-up" anyway:
- A Query creates a Weight valid across the whole index: this incorporates boost, idf, queryNorm, and even confusingly, boosts of any 'outer'/'parent' queries like booleanquery that are holding the term. this weight is computed a single time.
- A Weight creates a Scorer (e.g. TermScorer) for each index segment, for a single term this scorer has everything it needs in the formula except for what is document-dependent: the within-document term-frequency (TF), which it must read from the postings, and the document's length normalization value (norm). So this is why termscorer scores a document as weight * sqrt(tf) * norm. in practice this is cached for tf values < 32 so that scoring most documents is a single multiply.
- BooleanQuery really doesnt do "much" except its scorers are responsible for nextDoc()'ing and advance()'ing subscorers, and when the Boolean model is satisfied, then it combines the scores of the subscorers, applying the coordination factory (coord()) based on how many subscorers matched.
in general, its definitely difficult to trace through how lucene scores documents because in all released forms, the Scorers are responsible for 2 things: matching and calculating scores. In Lucene's trunk (http://svn.apache.org/repos/asf/lucene/dev/trunk/) these are now separated, in such a way that a Similarity is basically responsible for all aspects of scoring, and this is separate from matching. So the API there might be easier to understand, maybe harder, but at least you can refer to implementations of many other scoring models (BM25, language models, divergence from randomness, information-based models) if you get confused: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/similarities/
精彩评论