Examples of it.unimi.dsi.mg4j.search.score.BM25Scorer

A scorer that implements the BM25 ranking scheme.

BM25 is the name of a ranking scheme for text derived from the probabilistic model. The essential feature of the scheme is that of assigning to each term appearing in a given document a weight depending both on the count (the number of occurrences of the term in the document), on the frequency (the number of the documents in which the term appears) and on the document length (in words). It was devised in the early nineties, and it provides a significant improvement over the classical TF/IDF scheme. Karen Spärck Jones, Steve Walker and Stephen E. Robertson give a full account of BM25 and of the probabilistic model in “A probabilistic model of information retrieval: development and comparative experiments”, Inf. Process. Management 36(6):779−840, 2000.

There are a number of incarnations with small variations of the formula itself. Here, the weight assigned to a term which appears in f documents out of a collection of N documents w.r.t. to a document of length l in which the term appears c times is

log( (N − f + 1/2) / (f + 1/2) ) ( k₁ + 1 ) c ⁄ ( c + k₁ ((1 − b) + bl / L) ),

where L is the average document length, and k₁ and b are parameters that default to {@link #DEFAULT_K1} and {@link #DEFAULT_B}: these values were chosen following the suggestions given in “Efficiency vs. effectiveness in Terabyte-scale information retrieval”, by Stefan Büttcher and Charles L. A. Clarke, in Proceedings of the 14th Text REtrieval Conference (TREC 2005). Gaithersburg, USA, November 2005. The logarithmic part (a.k.a. idf (inverse document-frequency) part) is actually maximised with {@link #EPSILON_SCORE}, so it is never negative (the net effect being that terms appearing in more than half of the documents have almost no weight).

This class uses a {@link it.unimi.dsi.mg4j.search.visitor.CounterCollectionVisitor}and related classes (by means of {@link DocumentIterator#acceptOnTruePaths(it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor)}) to take into consideration only terms that are actually involved in query semantics for the current document. @author Mauro Mereu @author Sebastiano Vigna

Examples of it.unimi.dsi.mg4j.search.score.BM25Scorer

Related Classes of it.unimi.dsi.mg4j.search.score.BM25Scorer