Expert: Scoring API.
This is a low-level API, you should only extend this API if you want to implement an information retrieval model. If you are instead looking for a convenient way to alter Lucene's scoring, consider extending a higher-level implementation such as {@link TFIDFSimilarity}, which implements the vector space model with this API, or just tweaking the default implementation: {@link DefaultSimilarity}.
Similarity determines how Lucene weights terms, and Lucene interacts with this class at both index-time and query-time.
At indexing time, the indexer calls {@link #computeNorm(FieldInvertState)}, allowing the Similarity implementation to set a per-document value for the field that will be later accessible via {@link AtomicReader#getNormValues(String)}. Lucene makes no assumption about what is in this norm, but it is most useful for encoding length normalization information.
Implementations should carefully consider how the normalization is encoded: while Lucene's classical {@link TFIDFSimilarity} encodes a combination of index-time boostand length normalization information with {@link SmallFloat} into a single byte, this might not be suitable for all purposes.
Many formulas require the use of average document length, which can be computed via a combination of {@link CollectionStatistics#sumTotalTermFreq()} and {@link CollectionStatistics#maxDoc()} or {@link CollectionStatistics#docCount()}, depending upon whether the average should reflect field sparsity.
Additional scoring factors can be stored in named NumericDocValuesField
s and accessed at query-time with {@link AtomicReader#getNumericDocValues(String)}.
Finally, using index-time boosts (either via folding into the normalization byte or via DocValues), is an inefficient way to boost the scores of different fields if the boost will be the same for every document, instead the Similarity can simply take a constant boost parameter C, and {@link PerFieldSimilarityWrapper} can return different instances with different boosts depending upon field name.
At query-time, Queries interact with the Similarity via these steps:
When {@link IndexSearcher#explain(org.apache.lucene.search.Query,int)} is called, queries consult the Similarity's DocScorer for an explanation of how it computed its score. The query passes in a the document id and an explanation of how the frequency was computed. @see org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity) @see IndexSearcher#setSimilarity(Similarity) @lucene.experimental
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|