Examples of TokenFilter

edu.ucla.sspace.text.TokenFilter
A utility for asserting what tokens are valid and invalid within a stream of tokens. A filter may be either inclusive or exclusive.
An inclusive filter will accept only those tokens with which it was initialized. For an example, an inclusive filter initialized with all of the words in the english dictionary would exclude all misspellings or foreign words in a token stream.
An exclusive filter will aceept only those tokens that are not in set with which it was initialized. An exclusive filter is often used with a list of common words that should be excluded, which is also known as a "stop list."
{@code TokenFilter} instances may be combined into a linear chain of filters.This allows for a highly configurable filter to be made from mulitple rules. Chained filters are created in a linear order and each filter must accept the token for the last filter to return {@code}. If the any of the earlier filters return {@code false}, then the token is not accepted.
This class also provides a static utility function {@link #loadFromSpecification(String) loadFromSpecification} for initializing achain of filters from a text configuration. This is intended to facility command-line tools that want to provide easily configurable filters. An example configuration might look like: include=top-tokens.txt:test-words.txt,exclude=stop-words.txt @see FilteredIterator
org.apache.lucene.analysis.TokenFilter
A TokenFilter is a TokenStream whose input is another token stream.
This is an abstract class. NOTE: subclasses must override {@link #incrementToken()} if the new TokenStream API is usedand {@link #next(Token)} or {@link #next()} if the oldTokenStream API is used.
See {@link TokenStream}
org.apache.tools.ant.filters.TokenFilter
This splits up input into tokens and passes the tokens to a sequence of filters. @author Peter Reilly @since Ant 1.6 @see BaseFilterReader @see ChainableReader @see DynamicConfigurator
org.apache.uima.conceptMapper.support.tokens.TokenFilter

Examples of org.apache.lucene.analysis.TokenFilter

   */
  public void testCaseSensitive() throws Exception {
    final String input = "How The s a brown s cow d like A B thing?";
    MockTokenizer wt = new MockTokenizer(new StringReader(input), MockTokenizer.WHITESPACE, false);
    Set common = CommonGramsFilter.makeCommonSet(commonWords);
    TokenFilter cgf = new CommonGramsFilter(wt, common, false);
    assertTokenStreamContents(cgf, new String[] {"How", "The", "The_s", "s",
        "s_a", "a", "a_brown", "brown", "brown_s", "s", "s_cow", "cow",
        "cow_d", "d", "d_like", "like", "A", "B", "thing?"});
  }

View Full Code Here

Examples of org.apache.lucene.analysis.TokenFilter

   */
  public void testLastWordisStopWord() throws Exception {
    final String input = "dog the";
    MockTokenizer wt = new MockTokenizer(new StringReader(input), MockTokenizer.WHITESPACE, false);
    CommonGramsFilter cgf = new CommonGramsFilter(wt, commonWords);
    TokenFilter nsf = new CommonGramsQueryFilter(cgf);
    assertTokenStreamContents(nsf, new String[] { "dog_the" });
  }

View Full Code Here

Examples of org.apache.lucene.analysis.TokenFilter

   */
  public void testFirstWordisStopWord() throws Exception {
    final String input = "the dog";
    MockTokenizer wt = new MockTokenizer(new StringReader(input), MockTokenizer.WHITESPACE, false);
    CommonGramsFilter cgf = new CommonGramsFilter(wt, commonWords);
    TokenFilter nsf = new CommonGramsQueryFilter(cgf);
    assertTokenStreamContents(nsf, new String[] { "the_dog" });
  }

View Full Code Here

Examples of org.apache.lucene.analysis.TokenFilter

   */
  public void testOneWordQueryStopWord() throws Exception {
    final String input = "the";
    MockTokenizer wt = new MockTokenizer(new StringReader(input), MockTokenizer.WHITESPACE, false);
    CommonGramsFilter cgf = new CommonGramsFilter(wt, commonWords);
    TokenFilter nsf = new CommonGramsQueryFilter(cgf);
    assertTokenStreamContents(nsf, new String[] { "the" });
  }

View Full Code Here

Examples of org.apache.lucene.analysis.TokenFilter

   */
  public void testOneWordQuery() throws Exception {
    final String input = "monster";
    MockTokenizer wt = new MockTokenizer(new StringReader(input), MockTokenizer.WHITESPACE, false);
    CommonGramsFilter cgf = new CommonGramsFilter(wt, commonWords);
    TokenFilter nsf = new CommonGramsQueryFilter(cgf);
    assertTokenStreamContents(nsf, new String[] { "monster" });
  }

View Full Code Here

Examples of org.apache.lucene.analysis.TokenFilter

   */
  public void TestFirstAndLastStopWord() throws Exception {
    final String input = "the of";
    MockTokenizer wt = new MockTokenizer(new StringReader(input), MockTokenizer.WHITESPACE, false);
    CommonGramsFilter cgf = new CommonGramsFilter(wt, commonWords);
    TokenFilter nsf = new CommonGramsQueryFilter(cgf);
    assertTokenStreamContents(nsf, new String[] { "the_of" });
  }

View Full Code Here

Examples of org.apache.lucene.analysis.TokenFilter


      @Override
      public TokenStream tokenStream(String fieldName, Reader reader) {
        MockTokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.SIMPLE, true);
        tokenizer.setEnableChecks(false); // disable workflow checking as we forcefully close() in exceptional cases.
        return new TokenFilter(tokenizer) {
          private int count = 0;


          @Override
          public boolean incrementToken() throws IOException {
            if (count++ == 5) {

View Full Code Here

Examples of org.apache.lucene.analysis.TokenFilter

    RAMDirectory dir = new MockRAMDirectory();
    IndexWriter writer = new IndexWriter(dir, new Analyzer() {


      @Override
      public TokenStream tokenStream(String fieldName, Reader reader) {
        return new TokenFilter(new StandardTokenizer(Version.LUCENE_CURRENT, reader)) {
          private int count = 0;


          @Override
          public boolean incrementToken() throws IOException {
            if (count++ == 5) {

View Full Code Here

Examples of org.apache.lucene.analysis.TokenFilter


  public void testTokenReuse() throws IOException {
    Analyzer analyzer = new Analyzer() {
      @Override
      public TokenStream tokenStream(String fieldName, Reader reader) {
        return new TokenFilter(new WhitespaceTokenizer(TEST_VERSION_CURRENT, reader)) {
          boolean first = true;
          AttributeSource.State state;


          @Override
          public boolean incrementToken() throws IOException {

View Full Code Here

Examples of org.apache.lucene.analysis.TokenFilter

    // vowel shortening
    check("आईऊॠॡऐऔीूॄॣैौ", "अइउऋऌएओिुृॢेो");
  }
  private void check(String input, String output) throws IOException {
    Tokenizer tokenizer = new MockTokenizer(new StringReader(input), MockTokenizer.WHITESPACE, false);
    TokenFilter tf = new HindiNormalizationFilter(tokenizer);
    assertTokenStreamContents(tf, new String[] { output });
  }

View Full Code Here

0 1 2 3 4 5

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.