peek()
. An implementation of this interface is expected to have a constructor that takes a single argument, a Reader.
@author Teg Grenager (grenager@stanford.edu)
Title: Virtual Oscilloscope.
Description: A Oscilloscope simulator
Copyright (C) 2003 José Manuel Gómez Soriano
This file is part of Virtual Oscilloscope. Virtual Oscilloscope is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Virtual Oscilloscope is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Virtual Oscilloscope; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
Locator
interface. This is not an incidental implementation detail: Users of this class are encouraged to make use of the Locator
nature. By default, the tokenizer may report data that XML 1.0 bans. The tokenizer can be configured to treat these conditions as fatal or to coerce the infoset to something that XML 1.0 allows.
@version $Id$
@author hsivonen
Tokenization is a necessary step before more complex NLP tasks can be applied, these usually process text on a token level. The quality of tokenization is important because it influences the performance of high-level task applied to it.
In segmented languages like English most words are segmented by white spaces expect for punctuations, etc. which is directly attached to the word without a white space in between, it is not possible to just split at all punctuations because in abbreviations dots are a part of the token itself. A tokenizer is now responsible to split these tokens correctly.
In non-segmented languages like Chinese tokenization is more difficult since words are not segmented by a whitespace.
Tokenizers can also be used to segment already identified tokens further into more atomic parts to get a deeper understanding. This approach helps more complex task to gain insight into tokens which do not represent words like numbers, units or tokens which are part of a special notation.
For most further task it is desirable to over tokenize rather than under tokenize.
Mainly used for turning Cell rowKeys into a trie, but also used for family and qualifier encoding.
This is an abstract class.
NOTE: subclasses must override {@link #incrementToken()} if the new TokenStream API is usedand {@link #next(Token)} or {@link #next()} if the oldTokenStream API is used.
NOTE: Subclasses overriding {@link #incrementToken()} mustcall {@link AttributeSource#clearAttributes()} beforesetting attributes. Subclasses overriding {@link #next(Token)} must call{@link Token#clear()} before setting Token attributes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|