Examples of org.apache.lucene.analysis.standard.ClassicTokenizer

Package org.apache.lucene.analysis.standard

Examples of org.apache.lucene.analysis.standard.ClassicTokenizer

org.apache.lucene.analysis.standard.ClassicTokenizer
A grammar-based tokenizer constructed with JFlex
This should be a good tokenizer for most European-language documents:
- Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, {@link StandardTokenizer} implements Unicode text segmentation,as specified by UAX#29.

    maxTokenLength = getInt("maxTokenLength", 
                            StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
  }


  public Tokenizer create(Reader input) {
    ClassicTokenizer tokenizer = new ClassicTokenizer(luceneMatchVersion, input); 
    tokenizer.setMaxTokenLength(maxTokenLength);
    return tokenizer;
  }

View Full Code Here

                            StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
  }


  @Override
  public Tokenizer create(Reader input) {
    ClassicTokenizer tokenizer = new ClassicTokenizer(luceneMatchVersion, input); 
    tokenizer.setMaxTokenLength(maxTokenLength);
    return tokenizer;
  }

View Full Code Here

    }


    @Override
    protected TokenStreamComponents createComponents(final String fieldName,
            final Reader reader) {
        ClassicTokenizer src = new ClassicTokenizer(matchVersion, reader);
        TokenStream tok = new LowerCaseFilter(matchVersion, src);
        tok = new WordDelimiterFilter(tok,
                WordDelimiterFilter.GENERATE_WORD_PARTS
                        | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE
                        | WordDelimiterFilter.GENERATE_NUMBER_PARTS, null);

View Full Code Here

        maxTokenLength = settings.getAsInt("max_token_length", StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
    }


    @Override
    public Tokenizer create() {
        ClassicTokenizer tokenizer = new ClassicTokenizer();
        tokenizer.setMaxTokenLength(maxTokenLength);
        return tokenizer;
    }

View Full Code Here

TOP

Related Classes of org.apache.lucene.analysis.standard.ClassicTokenizer

org.apache.jackrabbit.oak.plugins.index.lucene.OakAnalyzer

org.apache.lucene.analysis.standard.ClassicTokenizerFactory

org.apache.solr.analysis.ClassicTokenizerFactory

org.elasticsearch.index.analysis.ClassicTokenizerFactory

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.