Examples of edu.stanford.nlp.process.PTBTokenizer

A fast, rule-based tokenizer implementation, which produces Penn Treebank style tokenization of English text. It was initially written to conform to Penn Treebank tokenization conventions over ASCII text, but now provides a range of tokenization options over a broader space of Unicode text. It reads raw text and outputs tokens of classes that implement edu.stanford.nlp.trees.HasWord (typically a Word or a CoreLabel). It can optionally return end-of-line as a token.

New code is encouraged to use the {@link #PTBTokenizer(Reader,LexedTokenFactory,String)}constructor. The other constructors are historical. You specify the type of result tokens with a LexedTokenFactory, and can specify the treatment of tokens by mainly boolean options given in a comma separated String options (e.g., "invertible,normalizeParentheses=true"). If the String is {@code null} or empty, you get the traditionalPTB3 normalization behaviour (i.e., you get ptb3Escaping=true). If you want no normalization, then you should pass in the String "ptb3Escaping=false". The known option names are:

invertible: Store enough information about the original form of the token and the whitespace around it that a list of tokens can be faithfully converted back to the original String. Valid only if the LexedTokenFactory is an instance of CoreLabelTokenFactory. The keys used in it are: TextAnnotation for the tokenized form, OriginalTextAnnotation for the original string, BeforeAnnotation and AfterAnnotation for the whitespace before and after a token, and perhaps CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation to record token begin/after end character offsets, if they were specified to be recorded in TokenFactory construction. (Like the String class, begin and end are done so end - begin gives the token length.) Default is false.
tokenizeNLs: Whether end-of-lines should become tokens (or just be treated as part of whitespace). Default is false.
ptb3Escaping: Enable all traditional PTB3 token transforms (like parentheses becoming -LRB-, -RRB-). This is a macro flag that sets or clears all the options below. (Default setting of the various properties below that this flag controls is equivalent to it being set to true.)
americanize: Whether to rewrite common British English spellings as American English spellings. (This is useful if your training material uses American English spelling, such as the Penn Treebank.) Default is true.
normalizeSpace: Whether any spaces in tokens (phone numbers, fractions get turned into U+00A0 (non-breaking space). It's dangerous to turn this off for most of our Stanford NLP software, which assumes no spaces in tokens. Default is true.
normalizeAmpersandEntity: Whether to map the XML & to an ampersand. Default is true.
normalizeCurrency: Whether to do some awful lossy currency mappings to turn common currency characters into $, #, or "cents", reflecting the fact that nothing else appears in the old PTB3 WSJ. (No Euro!) Default is true.
normalizeFractions: Whether to map certain common composed fraction characters to spelled out letter forms like "1/2". Default is true.
normalizeParentheses: Whether to map round parentheses to -LRB-, -RRB-, as in the Penn Treebank. Default is true.
normalizeOtherBrackets: Whether to map other common bracket characters to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank. Default is true.
asciiQuotes Whether to map all quote characters to the traditional ' and ". Default is false.
latexQuotes: Whether to map quotes to ``, `, ', '', as in Latex and the PTB3 WSJ (though this is now heavily frowned on in Unicode). If true, this takes precedence over the setting of unicodeQuotes; if both are false, no mapping is done. Default is true.
unicodeQuotes: Whether to map quotes to the range U+2018 to U+201D, the preferred unicode encoding of single and double quotes. Default is false.
ptb3Ellipsis: Whether to map ellipses to three dots (...), the old PTB3 WSJ coding of an ellipsis. If true, this takes precedence over the setting of unicodeEllipsis; if both are false, no mapping is done. Default is true.
unicodeEllipsis: Whether to map dot and optional space sequences to U+2026, the Unicode ellipsis character. Default is false.
ptb3Dashes: Whether to turn various dash characters into "--", the dominant encoding of dashes in the PTB3 WSJ. Default is true.
keepAssimilations: true to tokenize "gonna", false to tokenize "gon na". Default is true.
escapeForwardSlashAsterisk: Whether to put a backslash escape in front of / and * as the old PTB3 WSJ does for some reason (something to do with Lisp readers??). Default is true.
untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is "firstDelete".
strictTreebank3: PTBTokenizer deliberately deviates from strict PTB3 WSJ tokenization in two cases. Setting this improves compatibility for those cases. They are: (i) When an acronym is followed by a sentence end, such as "U.K." at the end of a sentence, the PTB3 has tokens of "Corp" and ".", while by default PTBTokenizer duplicates the period returning tokens of "Corp." and ".", and (ii) PTBTokenizer will return numbers with a whole number and a fractional part like "5 7/8" as a single token, with a non-breaking space in the middle, while the PTB3 separates them into two tokens "5" and "7/8". (Exception: for only "U.S." the treebank does have the two tokens "U.S." and "." like our default; strictTreebank3 now does that too.) The default is false.

A single instance of a PTBTokenizer is not thread safe, as it uses a non-threadsafe JFlex object to do the processing. Multiple instances can be created safely, though. A single instance of a PTBTokenizerFactory is also not thread safe, as it keeps its options in a local variable.

@author Tim Grow (his tokenizer is a Java implementation of ProfessorChris Manning's Flex tokenizer, pgtt-treebank.l) @author Teg Grenager (grenager@stanford.edu) @author Jenny Finkel (integrating in invertible PTB tokenizer) @author Christopher Manning (redid API, added many options, maintenance)

Examples of edu.stanford.nlp.process.PTBTokenizer

Related Classes of edu.stanford.nlp.process.PTBTokenizer