Examples of org.apache.lucene.analysis.TokenStream

org.apache.lucene.analysis.TokenStream
A TokenStream enumerates the sequence of tokens, either from {@link Field}s of a {@link Document} or from query text.
This is an abstract class; concrete subclasses are:
- {@link Tokenizer}, a TokenStream whose input is a Reader; and
- {@link TokenFilter}, a TokenStream whose input is another TokenStream.
A new TokenStream API has been introduced with Lucene 2.9. This API has moved from being {@link Token}-based to {@link Attribute}-based. While {@link Token} still exists in 2.9 as a convenience class, the preferred wayto store the information of a {@link Token} is to use {@link AttributeImpl}s.
TokenStream now extends {@link AttributeSource}, which provides access to all of the token {@link Attribute}s for the TokenStream. Note that only one instance per {@link AttributeImpl} is created and reusedfor every token. This approach reduces object creation and allows local caching of references to the {@link AttributeImpl}s. See {@link #incrementToken()} for further details.
The workflow of the new TokenStream API is as follows:
1. Instantiation of TokenStream/ {@link TokenFilter}s which add/get attributes to/from the {@link AttributeSource}.
2. The consumer calls {@link TokenStream#reset()}.
3. The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.
4. The consumer calls {@link #incrementToken()} until it returns falseconsuming the attributes after each call.
5. The consumer calls {@link #end()} so that any end-of-stream operationscan be performed.
6. The consumer calls {@link #close()} to release any resource when finishedusing the TokenStream.
To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in {@link #incrementToken()}.
You can find some example code for the new API in the analysis package level Javadoc.
Sometimes it is desirable to capture a current state of a TokenStream, e.g., for buffering purposes (see {@link CachingTokenFilter}, {@link TeeSinkTokenFilter}). For this usecase {@link AttributeSource#captureState} and {@link AttributeSource#restoreState}can be used.

    // Ideally the Analyzer superclass should have a method with the same signature, 
    // with a default impl that simply delegates to the StringReader flavour. 
    if (text == null) 
      throw new IllegalArgumentException("text must not be null");
    
    TokenStream stream;
    if (pattern == NON_WORD_PATTERN) { // fast path
      stream = new FastStringTokenizer(text, true, toLowerCase, stopWords);
    }
    else if (pattern == WHITESPACE_PATTERN) { // fast path
      stream = new FastStringTokenizer(text, false, toLowerCase, stopWords);

View Full Code Here

   *       {@link LowerCaseFilter}, {@link StandardFilter}, {@link StopFilter}, and 
   *          {@link BrazilianStemFilter}.
   */
  @Override
  public final TokenStream tokenStream(String fieldName, Reader reader) {
                TokenStream result = new StandardTokenizer( matchVersion, reader );
    result = new LowerCaseFilter( result );
    result = new StandardFilter( result );
    result = new StopFilter( StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
                                         result, stoptable );
    result = new BrazilianStemFilter( result, excltable );

View Full Code Here

    this.collator = collator;
  }


  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new KeywordTokenizer(reader);
    result = new CollationKeyFilter(result, collator);
    return result;
  }

View Full Code Here

  private class PayloadAnalyzer extends Analyzer {




    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream result = new LowerCaseTokenizer(reader);
      result = new PayloadFilter(result, fieldName);
      return result;
    }

View Full Code Here

  }


  public QueryTermVector(String queryString, Analyzer analyzer) {    
    if (analyzer != null)
    {
      TokenStream stream = analyzer.tokenStream("", new StringReader(queryString));
      if (stream != null)
      {
        List<String> terms = new ArrayList<String>();
        try {
          boolean hasMoreTokens = false;
          
          stream.reset(); 
          TermAttribute termAtt = stream.addAttribute(TermAttribute.class);


          hasMoreTokens = stream.incrementToken();
          while (hasMoreTokens) {
            terms.add(termAtt.term());
            hasMoreTokens = stream.incrementToken();
          }
          processTerms(terms.toArray(new String[terms.size()]));
        } catch (IOException e) {
        }
      }

View Full Code Here


      FieldQueryNode fieldNode = ((FieldQueryNode) node);
      String text = fieldNode.getTextAsString();
      String field = fieldNode.getFieldAsString();


      TokenStream source = this.analyzer.tokenStream(field, new StringReader(
          text));
      CachingTokenFilter buffer = new CachingTokenFilter(source);


      PositionIncrementAttribute posIncrAtt = null;
      int numTokens = 0;
      int positionCount = 0;
      boolean severalTokensAtSamePosition = false;


      if (buffer.hasAttribute(PositionIncrementAttribute.class)) {
        posIncrAtt = buffer.getAttribute(PositionIncrementAttribute.class);
      }


      try {


        while (buffer.incrementToken()) {
          numTokens++;
          int positionIncrement = (posIncrAtt != null) ? posIncrAtt
              .getPositionIncrement() : 1;
          if (positionIncrement != 0) {
            positionCount += positionIncrement;


          } else {
            severalTokensAtSamePosition = true;
          }


        }


      } catch (IOException e) {
        // ignore
      }


      try {
        // rewind the buffer stream
        buffer.reset();


        // close original stream - all tokens buffered
        source.close();
      } catch (IOException e) {
        // ignore
      }


      if (!buffer.hasAttribute(TermAttribute.class)) {

View Full Code Here

    if (text == null)
      throw new IllegalArgumentException("text must not be null");
    if (analyzer == null)
      throw new IllegalArgumentException("analyzer must not be null");
    
    TokenStream stream = analyzer.tokenStream(fieldName, 
        new StringReader(text));


    addField(fieldName, stream);
  }

View Full Code Here

  public <T> TokenStream keywordTokenStream(final Collection<T> keywords) {
    // TODO: deprecate & move this method into AnalyzerUtil?
    if (keywords == null)
      throw new IllegalArgumentException("keywords must not be null");
    
    return new TokenStream() {
      private Iterator<T> iter = keywords.iterator();
      private int start = 0;
      private TermAttribute termAtt = addAttribute(TermAttribute.class);
      private OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

View Full Code Here

   *       {@link LowerCaseFilter}, {@link StopFilter}, {@link ArabicNormalizationFilter}
   *            and {@link ArabicStemFilter}.
   */
  @Override
  public final TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new ArabicLetterTokenizer( reader );
    result = new LowerCaseFilter(result);
    // the order here is important: the stopword list is not normalized!
    result = new StopFilter( StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
                             result, stoptable );
    result = new ArabicNormalizationFilter( result );

View Full Code Here

  }


  private class PayloadAnalyzer extends Analyzer {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream result = new LowerCaseTokenizer(reader);
      result = new PayloadFilter(result, fieldName);
      return result;
    }

View Full Code Here

0 1 2 3 4 5 6 7 8 9

TOP

Related Classes of org.apache.lucene.analysis.TokenStream

com.code972.elasticsearch.rest.action.RestHebrewAnalyzerCheckWordAction

com.gentics.cr.lucene.analysis.CustomPatternAnalyzerTest

com.github.pmerienne.trident.ml.preprocessing.EnglishTokenizer

com.o19s.RegexPathHierarchyTokenizerTest

com.stimulus.archiva.search.FileNameAnalyzer

com.stimulus.archiva.search.FileNameAnalyzer$LowercaseDelimiterAnalyzer

ivory.core.tokenize.LuceneAnalyzer

ivory.core.tokenize.LuceneArabicAnalyzer

ivory.core.tokenize.LuceneTokenizer

org.apache.jackrabbit.core.query.lucene.SearchIndex

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.