Examples of org.apache.lucene.index.pruning.RIDFTermPruningPolicy

org.apache.lucene.index.pruning.RIDFTermPruningPolicy

c.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf.

Residual IDF measures a difference between a collection-wide IDF of a term (which assumes a uniform distribution of occurrences) and the actual observed total number of occurrences of a term in all documents. Positive values indicate that a term is informative (e.g. for rare terms), negative values indicate that a term is not informative (e.g. too popular to offer good selectivity).

This metric produces small values close to [-1, 1], so useful ranges for thresholds under this metrics are somewhere between [0, 1]. The higher the threshold the more informative (and more rare) terms will be retained. For filtering of common words a value of close to or slightly below 0 (e.g. -0.1) should be a good starting point.

  
  public void testRIDFPruning() throws Exception {
    RAMDirectory targetDir = new RAMDirectory();
    IndexReader in = IndexReader.open(sourceDir, true);
    // remove only very popular terms
    RIDFTermPruningPolicy ridf = new RIDFTermPruningPolicy(in, null, null, -0.12);
    PruningReader tfr = new PruningReader(in, null, ridf);
    assertTDCount(tfr, new Term("body", "one"), 0);
    assertTD(tfr, new Term("body", "two"), new int[]{0, 1, 2, 4});
    assertTD(tfr, new Term("body", "three"), new int[]{0, 1, 3});
    assertTD(tfr, new Term("test", "one"), new int[]{4});

Examples of org.apache.lucene.index.pruning.RIDFTermPruningPolicy

Related Classes of org.apache.lucene.index.pruning.RIDFTermPruningPolicy