Examples of WordDelimiterFilter

The combinations parameter affects how subwords are combined: One use for {@link WordDelimiterFilter} is to help match words with differentsubword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current {@link StandardTokenizer} immediately removes many intra-worddelimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as {@link WhitespaceTokenizer}).
  • org.apache.solr.analysis.WordDelimiterFilter
    ---there, 'dude'" -> "hello", "there", "dude" - trailing "'s" are removed for each subword - "O'Neil's" -> "O", "Neil" - Note: this step isn't performed in a separate filter because of possible subword combinations. The combinations parameter affects how subwords are combined: - combinations="0" causes no subword combinations. - "PowerShot" -> 0:"Power", 1:"Shot" (0 and 1 are the token positions) - combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run. - "PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot" - "A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC" - "Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder" One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer). @version $Id: WordDelimiterFilter.java 1166766 2011-09-08 15:52:10Z rmuir $

  • Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

       
          Analyzer a = new Analyzer() {
            @Override
            protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
              Tokenizer tokenizer = new KeywordTokenizer(reader);
              return new TokenStreamComponents(tokenizer, new WordDelimiterFilter(tokenizer, flags, protectedWords));
            }
          };
          // depending upon options, this thing may or may not preserve the empty term
          checkAnalysisConsistency(random, a, random.nextBoolean(), "");
        }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

                @Override public String name() {
                    return "word_delimiter";
                }

                @Override public TokenStream create(TokenStream tokenStream) {
                    return new WordDelimiterFilter(tokenStream, WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE,
                            1, 1, 0, 0, 0, 1, 0, 1, 1, null);
                }
            }));

            tokenFilterFactories.put("stop", new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

            Set<String> protectedWords = Analysis.getWordSet(env, settings, "protected_words");
            this.protoWords = protectedWords == null ? null : CharArraySet.copy(Lucene.VERSION, protectedWords);
        }

        @Override public TokenStream create(TokenStream tokenStream) {
            return new WordDelimiterFilter(tokenStream,
                    WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE,
                    generateWordParts ? 1 : 0,
                    generateNumberParts ? 1 : 0,
                    catenateWords ? 1 : 0,
                    catenateNumbers ? 1 : 0,
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

        }
      }

      @Override
      public WordDelimiterFilter create(TokenStream input) {
        return new WordDelimiterFilter(input, typeTable == null ? WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE : typeTable,
                                       flags, protectedWords);
      }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

        }
      }

      @Override
      public WordDelimiterFilter create(TokenStream input) {
        return new WordDelimiterFilter(input, typeTable == null ? WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE : typeTable,
                                       flags, protectedWords);
      }
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      @Test
      public void testOffsets() throws IOException {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        // test that subwords and catenated subwords have
        // the correct offsets.
        WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("foo-bar", 5, 12)), DEFAULT_WORD_DELIM_TABLE, flags, null);

        assertTokenStreamContents(wdf,
            new String[] { "foo", "foobar", "bar" },
            new int[] { 5, 5, 9 },
            new int[] { 8, 12, 12 });

        wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("foo-bar", 5, 6)), DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf,
            new String[] { "foo", "bar", "foobar" },
            new int[] { 5, 5, 5 },
            new int[] { 6, 6, 6 });
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }
     
      @Test
      public void testOffsetChange() throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("übelkeit)", 7, 16)), DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf,
            new String[] { "übelkeit" },
            new int[] { 7 },
            new int[] { 15 });
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }
     
      @Test
      public void testOffsetChange2() throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("(übelkeit", 7, 17)), DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf,
            new String[] { "übelkeit" },
            new int[] { 8 },
            new int[] { 17 });
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }
     
      @Test
      public void testOffsetChange3() throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("(übelkeit", 7, 16)), DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf,
            new String[] { "übelkeit" },
            new int[] { 8 },
            new int[] { 16 });
    View Full Code Here

    Examples of org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

      }
     
      @Test
      public void testOffsetChange4() throws Exception {
        int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | CATENATE_ALL | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | STEM_ENGLISH_POSSESSIVE;
        WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("(foo,bar)", 7, 16)), DEFAULT_WORD_DELIM_TABLE, flags, null);
       
        assertTokenStreamContents(wdf,
            new String[] { "foo", "foobar", "bar"},
            new int[] { 8, 8, 12 },
            new int[] { 11, 15, 15 });
    View Full Code Here
    TOP
    Copyright © 2018 www.massapi.com. All rights reserved.
    All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.