ColumnDocumentReaderAndWriter
training data is 3 column input, with the columns containing a word, its POS, and its gold class, but this can be specified via the map
property. When run on a file with -textFile
, the file is assumed to be plain English text (or perhaps simple HTML/XML), and a reasonable attempt is made at English tokenization by {@link PlainTextDocumentReaderAndWriter}. The class used to read the text can be changed with -plainTextDocumentReaderAndWriter. Extra options can be supplied to the tokenizer using the -tokenizeOptions flag.
To read from stdin, use the flag -readStdin. The same reader/writer will be used as for -textFile.
Typical command-line usageFor running a trained model with a provided serialized classifier on a text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output
To train with multiple files:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFileList file1,file2,... -testFile testFile -macro > output
To test on multiple files, use the -testFiles option and a comma separated list.
Features are defined by a {@link edu.stanford.nlp.sequences.FeatureFactory}. {@link NERFeatureFactory} is used by default, and you should lookthere for feature templates and properties or flags that will cause certain features to be used when training an NER classifier. There are also various feature factories for Chinese word segmentation such as {@link edu.stanford.nlp.wordseg.ChineseSegmenterFeatureFactory}. Features are specified either by a Properties file (which is the recommended method) or by flags on the command line. The flags are read into a {@link SeqClassifierFlags} object,which the user need not be concerned with, unless wishing to add new features. CRFClassifier may also be used programmatically. When creating a new instance, you must specify a Properties object. You may then call train methods to train a classifier, or load a classifier. The other way to get a CRFClassifier is to deserialize one via the static {@link CRFClassifier#getClassifier(String)} methods, which return adeserialized classifier. You may then tag (classify the items of) documents using either the assortedclassify()
or the assorted classify
methods in {@link AbstractSequenceClassifier}. Probabilities assigned by the CRF can be interrogated using either the printProbsDocument()
or getCliqueTrees()
methods.
@author Jenny Finkel
@author Sonal Gupta (made the class generic)
@author Mengqiu Wang (LOP implementation and non-linear CRF implementation)TODO(mengqiu) need to move the embedding lookup and capitalization features into a FeatureFactory
|
|
|
|
|
|
|
|
|
|