Examples of WordDelimiterFilter

The combinations parameter affects how subwords are combined: One use for {@link WordDelimiterFilter} is to help match words with differentsubword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current {@link StandardTokenizer} immediately removes many intra-worddelimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as {@link WhitespaceTokenizer}).
  • org.apache.solr.analysis.WordDelimiterFilter
    ---there, 'dude'" -> "hello", "there", "dude" - trailing "'s" are removed for each subword - "O'Neil's" -> "O", "Neil" - Note: this step isn't performed in a separate filter because of possible subword combinations. The combinations parameter affects how subwords are combined: - combinations="0" causes no subword combinations. - "PowerShot" -> 0:"Power", 1:"Shot" (0 and 1 are the token positions) - combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run. - "PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot" - "A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC" - "Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder" One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer). @version $Id: WordDelimiterFilter.java 1166766 2011-09-08 15:52:10Z rmuir $

  • Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

        }
      }

      @Override
      public WordDelimiterFilter create(TokenStream input) {
        return new WordDelimiterFilter(input, typeTable == null ? WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE : typeTable,
                                       flags, protectedWords);
      }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      @Test
      public void testOffsets() throws IOException {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        // test that subwords and catenated subwords have
        // the correct offsets.
        WordDelimiterFilter wdf = new WordDelimiterFilter(new SingleTokenTokenStream(new Token("foo-bar", 5, 12)), DEFAULT_WORD_DELIM_TABLE, flags, null);

        assertTokenStreamContents(wdf,
            new String[] { "foo", "bar", "foobar" },
            new int[] { 5, 9, 5 },
            new int[] { 8, 12, 12 },
            null, null, null, null, false);

        wdf = new WordDelimiterFilter(new SingleTokenTokenStream(new Token("foo-bar", 5, 6)), DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf,
            new String[] { "foo", "bar", "foobar" },
            new int[] { 5, 5, 5 },
            new int[] { 6, 6, 6 },
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }
     
      @Test
      public void testOffsetChange() throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        WordDelimiterFilter wdf = new WordDelimiterFilter(new SingleTokenTokenStream(new Token("übelkeit)", 7, 16)), DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf,
            new String[] { "übelkeit" },
            new int[] { 7 },
            new int[] { 15 });
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }
     
      @Test
      public void testOffsetChange2() throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        WordDelimiterFilter wdf = new WordDelimiterFilter(new SingleTokenTokenStream(new Token("(übelkeit", 7, 17)), DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf,
            new String[] { "übelkeit" },
            new int[] { 8 },
            new int[] { 17 });
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }
     
      @Test
      public void testOffsetChange3() throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        WordDelimiterFilter wdf = new WordDelimiterFilter(new SingleTokenTokenStream(new Token("(übelkeit", 7, 16)), DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf,
            new String[] { "übelkeit" },
            new int[] { 8 },
            new int[] { 16 });
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }
     
      @Test
      public void testOffsetChange4() throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        WordDelimiterFilter wdf = new WordDelimiterFilter(new SingleTokenTokenStream(new Token("(foo,bar)", 7, 16)), DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf,
            new String[] { "foo", "bar", "foobar"},
            new int[] { 8, 12, 8 },
            new int[] { 11, 15, 15 },
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

            null, null, null, null, false);
      }

      public void doSplit(final String input, String... output) throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        WordDelimiterFilter wdf = new WordDelimiterFilter(new MockTokenizer(
                    new StringReader(input), MockTokenizer.KEYWORD, false), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf, output);
      }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }
     
      public void doSplitPossessive(int stemPossessive, final String input, final String... output) throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS;
        flags |= (stemPossessive == 1) ? STEM_ENGLISH_POSSESSIVE : 0;
        WordDelimiterFilter wdf = new WordDelimiterFilter(new MockTokenizer(
            new StringReader(input), MockTokenizer.KEYWORD, false), flags, null);

        assertTokenStreamContents(wdf, output);
      }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

        /* analyzer that uses whitespace + wdf */
        Analyzer a = new Analyzer() {
          @Override
          public TokenStreamComponents createComponents(String field, Reader reader) {
            Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
            return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(
                tokenizer,
                flags, protWords));
          }
        };

        /* in this case, works as expected. */
        assertAnalyzesTo(a, "LUCENE / SOLR", new String[] { "LUCENE", "SOLR" },
            new int[] { 0, 9 },
            new int[] { 6, 13 },
            null,
            new int[] { 1, 1 },
            null,
            false);
       
        /* only in this case, posInc of 2 ?! */
        assertAnalyzesTo(a, "LUCENE / solR", new String[] { "LUCENE", "sol", "R", "solR" },
            new int[] { 0, 9, 12, 9 },
            new int[] { 6, 12, 13, 13 },
            null,
            new int[] { 1, 1, 1, 0 },
            null,
            false);
       
        assertAnalyzesTo(a, "LUCENE / NUTCH SOLR", new String[] { "LUCENE", "NUTCH", "SOLR" },
            new int[] { 0, 9, 15 },
            new int[] { 6, 14, 19 },
            null,
            new int[] { 1, 1, 1 },
            null,
            false);
       
        /* analyzer that will consume tokens with large position increments */
        Analyzer a2 = new Analyzer() {
          @Override
          public TokenStreamComponents createComponents(String field, Reader reader) {
            Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
            return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(
                new LargePosIncTokenFilter(tokenizer),
                flags, protWords));
          }
        };
       
        /* increment of "largegap" is preserved */
        assertAnalyzesTo(a2, "LUCENE largegap SOLR", new String[] { "LUCENE", "largegap", "SOLR" },
            new int[] { 0, 7, 16 },
            new int[] { 6, 15, 20 },
            null,
            new int[] { 1, 10, 1 },
            null,
            false);
       
        /* the "/" had a position increment of 10, where did it go?!?!! */
        assertAnalyzesTo(a2, "LUCENE / SOLR", new String[] { "LUCENE", "SOLR" },
            new int[] { 0, 9 },
            new int[] { 6, 13 },
            null,
            new int[] { 1, 11 },
            null,
            false);
       
        /* in this case, the increment of 10 from the "/" is carried over */
        assertAnalyzesTo(a2, "LUCENE / solR", new String[] { "LUCENE", "sol", "R", "solR" },
            new int[] { 0, 9, 12, 9 },
            new int[] { 6, 12, 13, 13 },
            null,
            new int[] { 1, 11, 1, 0 },
            null,
            false);
       
        assertAnalyzesTo(a2, "LUCENE / NUTCH SOLR", new String[] { "LUCENE", "NUTCH", "SOLR" },
            new int[] { 0, 9, 15 },
            new int[] { 6, 14, 19 },
            null,
            new int[] { 1, 11, 1 },
            null,
            false);

        Analyzer a3 = new Analyzer() {
          @Override
          public TokenStreamComponents createComponents(String field, Reader reader) {
            Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
            StopFilter filter = new StopFilter(TEST_VERSION_CURRENT,
                tokenizer, StandardAnalyzer.STOP_WORDS_SET);
            return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(filter, flags, protWords));
          }
        };

        assertAnalyzesTo(a3, "lucene.solr",
            new String[] { "lucene", "solr", "lucenesolr" },
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

          Analyzer a = new Analyzer() {
           
            @Override
            protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
              Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
              return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(tokenizer, flags, protectedWords));
            }
          };
          checkRandomData(random(), a, 200, 20, false, false);
        }
      }
    View Full Code Here
    TOP
    Copyright © 2018 www.massapi.com. All rights reserved.
    All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.