The following describes how Lucene scoring evolves from underlying information retrieval models to (efficient) implementation. We first brief on VSM Score, then derive from it Lucene's Conceptual Scoring Formula, from which, finally, evolves Lucene's Practical Scoring Function (the latter is connected directly with Lucene classes and methods).
Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.
In VSM, documents and queries are represented as weighted vectors in a multi-dimensional space, where each distinct index term is a dimension, and weights are Tf-idf values.
VSM does not require weights to be Tf-idf values, but Tf-idf values are believed to produce search results of high quality, and so Lucene is using Tf-idf. Tf and Idf are described in more detail below, but for now, for completion, let's just say that for given term t and document (or query) x, Tf(t,x) varies with the number of occurrences of term t in x (when one increases so does the other) and idf(t) similarly varies with the inverse of the number of index documents containing term t.
VSM score of document d for query q is the Cosine Similarity of the weighted query vectors V(q) and V(d):
| ||||||
|
Note: the above equation can be viewed as the dot product of the normalized weighted vectors, in the sense that dividing V(q) by its euclidean norm is normalizing it to a unit vector.
Lucene refines VSM score for both search quality and usability:
Under the simplifying assumption of a single field in the index, we get Lucene's Conceptual scoring formula:
| |||||||
|
The conceptual formula is a simplification in the sense that (1) terms and documents are fielded and (2) boosts are usually per query term rather than per query.
We now describe how Lucene implements this conceptual scoring formula, and derive from it Lucene's Practical Scoring Function.
For efficient score computation some scoring components are computed and aggregated in advance:
Lucene's Practical Scoring Function is derived from the above. The color codes demonstrate how it relates to those of the conceptual formula:
| |||||||
|
where
{@link org.apache.lucene.search.DefaultSimilarity#tf(float) tf(t in d)} = | frequency½ |
{@link org.apache.lucene.search.DefaultSimilarity#idf(int,int) idf(t)} = | 1 + log ( |
| ) |
queryNorm(q) = {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm(sumOfSquaredWeights)} = |
|
{@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights} = {@link org.apache.lucene.search.Query#getBoost() q.getBoost()} 2 · | ∑ | ( idf(t) · t.getBoost() ) 2 |
t in q |
When a document is added to the index, all the above factors are multiplied. If the document has multiple fields with the same name, all their boosts are multiplied together:
norm(t,d) = {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()} · {@link #lengthNorm(String,int) lengthNorm(field)} · | ∏ | {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}() |
field f in d named as t |
Expert: Scoring API.
This is a low-level API, you should only extend this API if you want to implement an information retrieval model. If you are instead looking for a convenient way to alter Lucene's scoring, consider extending a higher-level implementation such as {@link TFIDFSimilarity}, which implements the vector space model with this API, or just tweaking the default implementation: {@link DefaultSimilarity}.
Similarity determines how Lucene weights terms, and Lucene interacts with this class at both index-time and query-time.
At indexing time, the indexer calls {@link #computeNorm(FieldInvertState)}, allowing the Similarity implementation to set a per-document value for the field that will be later accessible via {@link AtomicReader#getNormValues(String)}. Lucene makes no assumption about what is in this norm, but it is most useful for encoding length normalization information.
Implementations should carefully consider how the normalization is encoded: while Lucene's classical {@link TFIDFSimilarity} encodes a combination of index-time boostand length normalization information with {@link SmallFloat} into a single byte, this might not be suitable for all purposes.
Many formulas require the use of average document length, which can be computed via a combination of {@link CollectionStatistics#sumTotalTermFreq()} and {@link CollectionStatistics#maxDoc()} or {@link CollectionStatistics#docCount()}, depending upon whether the average should reflect field sparsity.
Additional scoring factors can be stored in named NumericDocValuesField
s and accessed at query-time with {@link AtomicReader#getNumericDocValues(String)}.
Finally, using index-time boosts (either via folding into the normalization byte or via DocValues), is an inefficient way to boost the scores of different fields if the boost will be the same for every document, instead the Similarity can simply take a constant boost parameter C, and {@link PerFieldSimilarityWrapper} can return different instances with different boosts depending upon field name.
At query-time, Queries interact with the Similarity via these steps:
When {@link IndexSearcher#explain(org.apache.lucene.search.Query,int)} is called, queries consult the Similarity's DocScorer for an explanation of how it computed its score. The query passes in a the document id and an explanation of how the frequency was computed. @see org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity) @see IndexSearcher#setSimilarity(Similarity) @lucene.experimental
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|