Examples of edu.stanford.nlp.ie.crf.CRFClassifier

edu.stanford.nlp.ie.crf.CRFClassifier
Class for Sequence Classification using a Conditional Random Field model. The code has functionality for different document formats, but when using the standard {@link edu.stanford.nlp.sequences.ColumnDocumentReaderAndWriter} for trainingor testing models, input files are expected to be one token per line with the columns indicating things like the word, POS, chunk, and answer class. The default for ColumnDocumentReaderAndWriter training data is 3 column input, with the columns containing a word, its POS, and its gold class, but this can be specified via the map property.
When run on a file with -textFile, the file is assumed to be plain English text (or perhaps simple HTML/XML), and a reasonable attempt is made at English tokenization by {@link PlainTextDocumentReaderAndWriter}. The class used to read the text can be changed with -plainTextDocumentReaderAndWriter. Extra options can be supplied to the tokenizer using the -tokenizeOptions flag.
To read from stdin, use the flag -readStdin. The same reader/writer will be used as for -textFile.
Typical command-line usage
For running a trained model with a provided serialized classifier on a text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

To train with multiple files:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFileList file1,file2,... -testFile testFile -macro > output

To test on multiple files, use the -testFiles option and a comma separated list.
Features are defined by a {@link edu.stanford.nlp.sequences.FeatureFactory}. {@link NERFeatureFactory} is used by default, and you should lookthere for feature templates and properties or flags that will cause certain features to be used when training an NER classifier. There are also various feature factories for Chinese word segmentation such as {@link edu.stanford.nlp.wordseg.ChineseSegmenterFeatureFactory}. Features are specified either by a Properties file (which is the recommended method) or by flags on the command line. The flags are read into a {@link SeqClassifierFlags} object,which the user need not be concerned with, unless wishing to add new features.
CRFClassifier may also be used programmatically. When creating a new instance, you must specify a Properties object. You may then call train methods to train a classifier, or load a classifier. The other way to get a CRFClassifier is to deserialize one via the static {@link CRFClassifier#getClassifier(String)} methods, which return adeserialized classifier. You may then tag (classify the items of) documents using either the assorted classify() or the assorted classify methods in {@link AbstractSequenceClassifier}. Probabilities assigned by the CRF can be interrogated using either the printProbsDocument() or getCliqueTrees() methods. @author Jenny Finkel @author Sonal Gupta (made the class generic) @author Mengqiu Wang (LOP implementation and non-linear CRF implementation)TODO(mengqiu) need to move the embedding lookup and capitalization features into a FeatureFactory

      log(classifier);
    }


    ners = Generics.newHashMap();
    for (String classifier : classifiers) {
      CRFClassifier model = null;
      String filename = "/WEB-INF/data/models/" + classifier;
      InputStream is = getServletConfig().getServletContext().getResourceAsStream(filename);
      
      if (is == null)
        throw new ServletException("File not found. Filename = " + filename);

View Full Code Here

    if (crf == null) {
      synchronized(NERServerITest.class) {
        if (crf == null) {
          Properties props = new Properties();
          props.setProperty("outputFormat", "slashTags");
          crf = new CRFClassifier(props);
          crf.loadClassifierNoExceptions(englishCRFPath, props);
        }
      }
    }

View Full Code Here

    props.setProperty("serDictionary", conf.get(Constants.TokenizerData) + "/dict-chris6.ser");
    props.setProperty("inputEncoding", "UTF-8");
    props.setProperty("sighanPostProcessing", "true");


    try {
      classifier = new CRFClassifier(props);
      FSDataInputStream in = fs.open(new Path(conf.get(Constants.TokenizerData) + "/pku"));
      FSDataInputStream inDict = fs.open(new Path(conf.get(Constants.TokenizerData) + "/dict-chris6.ser"));
      classifier.loadClassifier(in, props);
      classifier.flags.setConf(conf);
      readerWriter = classifier.makeReaderAndWriter(inDict);

View Full Code Here

    props.setProperty("serDictionary", conf.get(Constants.TokenizerData) + "/dict-chris6.ser");
    props.setProperty("inputEncoding", "UTF-8");
    props.setProperty("sighanPostProcessing", "true");


    try {
      classifier = new CRFClassifier(props);
      FSDataInputStream in = fs.open(new Path(conf.get(Constants.TokenizerData) + "/pku"));
      FSDataInputStream inDict = fs.open(new Path(conf.get(Constants.TokenizerData) + "/dict-chris6.ser"));
      classifier.loadClassifier(in, props);
      classifier.flags.setConf(conf);
      readerWriter = classifier.makeReaderAndWriter(inDict);

View Full Code Here

    props.setProperty("serDictionary",conf.get(Constants.TokenizerData)+"/dict-chris6.ser");//"data/dict-chris6.ser.gz");
    props.setProperty("inputEncoding", "UTF-8");
    props.setProperty("sighanPostProcessing", "true");


    try {
      classifier = new CRFClassifier(props);
      FSDataInputStream in = fs.open(new Path(conf.get(Constants.TokenizerData)+"/pku"));
      FSDataInputStream inDict = fs.open(new Path(conf.get(Constants.TokenizerData)+"/dict-chris6.ser"));
      classifier.loadClassifier(in, props);      //data/pku.gz
      classifier.flags.setConf(conf);
      readerWriter = classifier.makeReaderAndWriter(inDict);

View Full Code Here

TOP

Related Classes of edu.stanford.nlp.ie.crf.CRFClassifier

edu.stanford.nlp.ie.ner.webapp.NERServlet

edu.stanford.nlp.ie.NERServerITest

edu.stanford.nlp.optimization.Function

ivory.core.tokenize.StanfordChineseTokenizer

java.util.zip.GZIPOutputStream

java.util.zip.GZIPInputStream

java.text.NumberFormat

java.text.DecimalFormat

edu.stanford.nlp.io.RuntimeIOException

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.