Examples of WordDelimiterFilter

The combinations parameter affects how subwords are combined: One use for {@link WordDelimiterFilter} is to help match words with differentsubword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current {@link StandardTokenizer} immediately removes many intra-worddelimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as {@link WhitespaceTokenizer}).
  • org.apache.solr.analysis.WordDelimiterFilter
    ---there, 'dude'" -> "hello", "there", "dude" - trailing "'s" are removed for each subword - "O'Neil's" -> "O", "Neil" - Note: this step isn't performed in a separate filter because of possible subword combinations. The combinations parameter affects how subwords are combined: - combinations="0" causes no subword combinations. - "PowerShot" -> 0:"Power", 1:"Shot" (0 and 1 are the token positions) - combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run. - "PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot" - "A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC" - "Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder" One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer). @version $Id: WordDelimiterFilter.java 1166766 2011-09-08 15:52:10Z rmuir $

  • Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

            new int[] { 11, 15, 15 });
      }

      public void doSplit(final String input, String... output) throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new MockTokenizer(
                    new StringReader(input), MockTokenizer.KEYWORD, false), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf, output);
      }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }
     
      public void doSplitPossessive(int stemPossessive, final String input, final String... output) throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS;
        flags |= (stemPossessive == 1) ? STEM_ENGLISH_POSSESSIVE : 0;
        WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new MockTokenizer(
            new StringReader(input), MockTokenizer.KEYWORD, false), flags, null);

        assertTokenStreamContents(wdf, output);
      }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

        /* analyzer that uses whitespace + wdf */
        Analyzer a = new Analyzer() {
          @Override
          public TokenStreamComponents createComponents(String field, Reader reader) {
            Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
            return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(TEST_VERSION_CURRENT,
                tokenizer,
                flags, protWords));
          }
        };

        /* in this case, works as expected. */
        assertAnalyzesTo(a, "LUCENE / SOLR", new String[] { "LUCENE", "SOLR" },
            new int[] { 0, 9 },
            new int[] { 6, 13 },
            new int[] { 1, 1 });
       
        /* only in this case, posInc of 2 ?! */
        assertAnalyzesTo(a, "LUCENE / solR", new String[] { "LUCENE", "sol", "solR", "R" },
            new int[] { 0, 9, 9, 12 },
            new int[] { 6, 12, 13, 13 },
            new int[] { 1, 1, 0, 1 });
       
        assertAnalyzesTo(a, "LUCENE / NUTCH SOLR", new String[] { "LUCENE", "NUTCH", "SOLR" },
            new int[] { 0, 9, 15 },
            new int[] { 6, 14, 19 },
            new int[] { 1, 1, 1 });
       
        /* analyzer that will consume tokens with large position increments */
        Analyzer a2 = new Analyzer() {
          @Override
          public TokenStreamComponents createComponents(String field, Reader reader) {
            Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
            return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(TEST_VERSION_CURRENT,
                new LargePosIncTokenFilter(tokenizer),
                flags, protWords));
          }
        };
       
        /* increment of "largegap" is preserved */
        assertAnalyzesTo(a2, "LUCENE largegap SOLR", new String[] { "LUCENE", "largegap", "SOLR" },
            new int[] { 0, 7, 16 },
            new int[] { 6, 15, 20 },
            new int[] { 1, 10, 1 });
       
        /* the "/" had a position increment of 10, where did it go?!?!! */
        assertAnalyzesTo(a2, "LUCENE / SOLR", new String[] { "LUCENE", "SOLR" },
            new int[] { 0, 9 },
            new int[] { 6, 13 },
            new int[] { 1, 11 });
       
        /* in this case, the increment of 10 from the "/" is carried over */
        assertAnalyzesTo(a2, "LUCENE / solR", new String[] { "LUCENE", "sol", "solR", "R" },
            new int[] { 0, 9, 9, 12 },
            new int[] { 6, 12, 13, 13 },
            new int[] { 1, 11, 0, 1 });
       
        assertAnalyzesTo(a2, "LUCENE / NUTCH SOLR", new String[] { "LUCENE", "NUTCH", "SOLR" },
            new int[] { 0, 9, 15 },
            new int[] { 6, 14, 19 },
            new int[] { 1, 11, 1 });

        Analyzer a3 = new Analyzer() {
          @Override
          public TokenStreamComponents createComponents(String field, Reader reader) {
            Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
            StopFilter filter = new StopFilter(TEST_VERSION_CURRENT,
                tokenizer, StandardAnalyzer.STOP_WORDS_SET);
            return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(TEST_VERSION_CURRENT, filter, flags, protWords));
          }
        };

        assertAnalyzesTo(a3, "lucene.solr",
            new String[] { "lucene", "lucenesolr", "solr" },
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

        /* analyzer that uses whitespace + wdf */
        Analyzer a = new Analyzer() {
          @Override
          public TokenStreamComponents createComponents(String field, Reader reader) {
            Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
            return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(TEST_VERSION_CURRENT, tokenizer, flags, null));
          }
        };
       
        assertAnalyzesTo(a, "abc-def-123-456",
            new String[] { "abc", "abcdef", "abcdef123456", "def", "123", "123456", "456" },
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

        /* analyzer that uses whitespace + wdf */
        Analyzer a = new Analyzer() {
          @Override
          public TokenStreamComponents createComponents(String field, Reader reader) {
            Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
            return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(TEST_VERSION_CURRENT, tokenizer, flags, null));
          }
        };
       
        assertAnalyzesTo(a, "abc-def-123-456",
            new String[] { "abc-def-123-456", "abc", "abcdef", "abcdef123456", "def", "123", "123456", "456" },
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

          Analyzer a = new Analyzer() {
           
            @Override
            protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
              Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
              return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(TEST_VERSION_CURRENT, tokenizer, flags, protectedWords));
            }
          };
          // TODO: properly support positionLengthAttribute
          checkRandomData(random(), a, 1000*RANDOM_MULTIPLIER, 20, false, false);
        }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

          Analyzer a = new Analyzer() {
           
            @Override
            protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
              Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false);
              return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(TEST_VERSION_CURRENT, tokenizer, flags, protectedWords));
            }
          };
          // TODO: properly support positionLengthAttribute
          checkRandomData(random(), a, 100*RANDOM_MULTIPLIER, 8192, false, false);
        }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

       
          Analyzer a = new Analyzer() {
            @Override
            protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
              Tokenizer tokenizer = new KeywordTokenizer(reader);
              return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(TEST_VERSION_CURRENT, tokenizer, flags, protectedWords));
            }
          };
          // depending upon options, this thing may or may not preserve the empty term
          checkAnalysisConsistency(random, a, random.nextBoolean(), "");
        }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }

      @Override
      public TokenFilter create(TokenStream input) {
        if (luceneMatchVersion.onOrAfter(Version.LUCENE_4_8)) {
          return new WordDelimiterFilter(luceneMatchVersion, input, typeTable == null ? WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE : typeTable,
                                       flags, protectedWords);
        } else {
          return new Lucene47WordDelimiterFilter(input, typeTable == null ? WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE : typeTable,
                                      flags, protectedWords);
        }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

        Analyzer a = new Analyzer() {
          @Override
          protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
            Tokenizer tokenizer = new WikipediaTokenizer(reader);
            TokenStream stream = new SopTokenFilter(tokenizer);
            stream = new WordDelimiterFilter(TEST_VERSION_CURRENT, stream, table, -50, protWords);
            stream = new SopTokenFilter(stream);
            return new TokenStreamComponents(tokenizer, stream);
         
        };
        checkAnalysisConsistency(random(), a, false, "B\u28c3\ue0f8[ \ud800\udfc2 </p> jb");
    View Full Code Here
    TOP
    Copyright © 2018 www.massapi.com. All rights reserved.
    All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.