Examples of edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper

edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper

An Escaper for Chinese normalization to match Treebank. Currently normalizes "ASCII" characters into the full-width range used inside the Penn Chinese Treebank.

Notes: Smart quotes appear in CTB, and are left unchanged. I think you get various hyphen types from U+2000 range too - certainly, Roger lists them in LanguagePack. @author Christopher Manning


    // variables needed to process the files to be parsed
    TokenizerFactory<Word> tokenizerFactory = null;
//    DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor();
    boolean tokenized = false; // whether or not the input file has already been tokenized
    Function<List<HasWord>, List<HasWord>> escaper = new ChineseEscaper();
    // int tagDelimiter = -1;
    // String sentenceDelimiter = "\n";
    // boolean fromXML = false;
    int argIndex = 0;
    if (args.length < 1) {

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.

Examples of edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper

Related Classes of edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper