A fast, rule-based tokenizer implementation, which produces Penn Treebank style tokenization of English text. It was initially written to conform to Penn Treebank tokenization conventions over ASCII text, but now provides a range of tokenization options over a broader space of Unicode text. It reads raw text and outputs tokens of classes that implement edu.stanford.nlp.trees.HasWord (typically a Word or a CoreLabel). It can optionally return end-of-line as a token.
New code is encouraged to use the {@link #PTBTokenizer(Reader,LexedTokenFactory,String)}constructor. The other constructors are historical. You specify the type of result tokens with a LexedTokenFactory, and can specify the treatment of tokens by mainly boolean options given in a comma separated String options (e.g., "invertible,normalizeParentheses=true"). If the String is {@code null} or empty, you get the traditionalPTB3 normalization behaviour (i.e., you get ptb3Escaping=true). If you want no normalization, then you should pass in the String "ptb3Escaping=false". The known option names are:
- invertible: Store enough information about the original form of the token and the whitespace around it that a list of tokens can be faithfully converted back to the original String. Valid only if the LexedTokenFactory is an instance of CoreLabelTokenFactory. The keys used in it are: TextAnnotation for the tokenized form, OriginalTextAnnotation for the original string, BeforeAnnotation and AfterAnnotation for the whitespace before and after a token, and perhaps CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation to record token begin/after end character offsets, if they were specified to be recorded in TokenFactory construction. (Like the String class, begin and end are done so end - begin gives the token length.) Default is false.
- tokenizeNLs: Whether end-of-lines should become tokens (or just be treated as part of whitespace). Default is false.
- ptb3Escaping: Enable all traditional PTB3 token transforms (like parentheses becoming -LRB-, -RRB-). This is a macro flag that sets or clears all the options below. (Default setting of the various properties below that this flag controls is equivalent to it being set to true.)
- americanize: Whether to rewrite common British English spellings as American English spellings. (This is useful if your training material uses American English spelling, such as the Penn Treebank.) Default is true.
- normalizeSpace: Whether any spaces in tokens (phone numbers, fractions get turned into U+00A0 (non-breaking space). It's dangerous to turn this off for most of our Stanford NLP software, which assumes no spaces in tokens. Default is true.
- normalizeAmpersandEntity: Whether to map the XML & to an ampersand. Default is true.
- normalizeCurrency: Whether to do some awful lossy currency mappings to turn common currency characters into $, #, or "cents", reflecting the fact that nothing else appears in the old PTB3 WSJ. (No Euro!) Default is true.
- normalizeFractions: Whether to map certain common composed fraction characters to spelled out letter forms like "1/2". Default is true.
- normalizeParentheses: Whether to map round parentheses to -LRB-, -RRB-, as in the Penn Treebank. Default is true.
- normalizeOtherBrackets: Whether to map other common bracket characters to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank. Default is true.
- asciiQuotes Whether to map all quote characters to the traditional ' and ". Default is false.
- latexQuotes: Whether to map quotes to ``, `, ', '', as in Latex and the PTB3 WSJ (though this is now heavily frowned on in Unicode). If true, this takes precedence over the setting of unicodeQuotes; if both are false, no mapping is done. Default is true.
- unicodeQuotes: Whether to map quotes to the range U+2018 to U+201D, the preferred unicode encoding of single and double quotes. Default is false.
- ptb3Ellipsis: Whether to map ellipses to three dots (...), the old PTB3 WSJ coding of an ellipsis. If true, this takes precedence over the setting of unicodeEllipsis; if both are false, no mapping is done. Default is true.
- unicodeEllipsis: Whether to map dot and optional space sequences to U+2026, the Unicode ellipsis character. Default is false.
- ptb3Dashes: Whether to turn various dash characters into "--", the dominant encoding of dashes in the PTB3 WSJ. Default is true.
- keepAssimilations: true to tokenize "gonna", false to tokenize "gon na". Default is true.
- escapeForwardSlashAsterisk: Whether to put a backslash escape in front of / and * as the old PTB3 WSJ does for some reason (something to do with Lisp readers??). Default is true.
- untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is "firstDelete".
- strictTreebank3: PTBTokenizer deliberately deviates from strict PTB3 WSJ tokenization in two cases. Setting this improves compatibility for those cases. They are: (i) When an acronym is followed by a sentence end, such as "U.K." at the end of a sentence, the PTB3 has tokens of "Corp" and ".", while by default PTBTokenizer duplicates the period returning tokens of "Corp." and ".", and (ii) PTBTokenizer will return numbers with a whole number and a fractional part like "5 7/8" as a single token, with a non-breaking space in the middle, while the PTB3 separates them into two tokens "5" and "7/8". (Exception: for only "U.S." the treebank does have the two tokens "U.S." and "." like our default; strictTreebank3 now does that too.) The default is false.
A single instance of a PTBTokenizer is not thread safe, as it uses a non-threadsafe JFlex object to do the processing. Multiple instances can be created safely, though. A single instance of a PTBTokenizerFactory is also not thread safe, as it keeps its options in a local variable.
@author Tim Grow (his tokenizer is a Java implementation of ProfessorChris Manning's Flex tokenizer, pgtt-treebank.l)
@author Teg Grenager (grenager@stanford.edu)
@author Jenny Finkel (integrating in invertible PTB tokenizer)
@author Christopher Manning (redid API, added many options, maintenance)