Examples of edu.stanford.nlp.tagger.maxent.MaxentTagger

edu.stanford.nlp.tagger.maxent.MaxentTagger

The main class for users to run, train, and test the part of speech tagger. You can tag things through the Java API or from the command line. The two English taggers included in this distribution are:

A bi-directional dependency network tagger in {@code edu/stanford/nlp/models/pos-tagger/english-left3words/english-bidirectional-distsim.tagger}. Its accuracy was 97.32% on Penn Treebank WSJ secs. 22-24.
A model using only left second-order sequence information and similar but less unknown words and lexical features as the previous model in {@code edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger}This tagger runs a lot faster, and is recommended for general use. Its accuracy was 96.92% on Penn Treebank WSJ secs. 22-24.

Using the Java API

A MaxentTagger can be made with a constructor taking as argument the location of parameter files for a trained tagger:: MaxentTagger tagger = new MaxentTagger("models/left3words-wsj-0-18.tagger");
A default path is provided for the location of the tagger on the Stanford NLP machines:: MaxentTagger tagger = new MaxentTagger(DEFAULT_NLP_GROUP_MODEL_PATH);
If you set the NLP_DATA_HOME environment variable, DEFAULT_NLP_GROUP_MODEL_PATH will instead point to the directory given in NLP_DATA_HOME.
To tag a List of HasWord and get a List of TaggedWord, you can use one of:: List<TaggedWord> taggedSentence = tagger.tagSentence(List<? extends HasWord> sentence); List<TaggedWord> taggedSentence = tagger.apply(List<? extends HasWord> sentence)
To tag a list of sentences and get back a list of tagged sentences:: List taggedList = tagger.process(List sentences)
To tag a String of text and to get back a String with tagged words:: String taggedString = tagger.tagString("Here's a tagged string.")
To tag a string of correctly tokenized, whitespace-separated words and get a string of tagged words back:: String taggedString = tagger.tagTokenizedString("Here 's a tagged string .")

The tagString method uses the default tokenizer (PTBTokenizer). If you wish to control tokenization, you may wish to call {@link #tokenizeText(Reader,TokenizerFactory)} and then to callprocess() on the result.

Using the command line

Tagging, testing, and training can all also be done via the command line.

Training from the command line

To train a model from the command line, first generate a property file:

java edu.stanford.nlp.tagger.maxent.MaxentTagger -genprops

This gets you a default properties file with descriptions of each parameter you can set in your trained model. You can modify the properties file, or use the default options. To train, run:

java -mx1g edu.stanford.nlp.tagger.maxent.MaxentTagger -props myPropertiesFile.props

with the appropriate properties file specified. Any argument you give in the properties file can also be specified on the command line. You must have specified a model using -model, either in the properties file or on the command line, as well as a file containing tagged words using -trainFile. Useful flags for controlling the amount of output are -verbose, which prints extra debugging information, and -verboseResults, which prints full information about intermediate results. -verbose defaults to false and -verboseResults defaults to true.

Tagging and Testing from the command line

Usage: For tagging (plain text):

java edu.stanford.nlp.tagger.maxent.MaxentTagger -model <modelFile> -textFile <textfile>

For testing (evaluating against tagged text):

java edu.stanford.nlp.tagger.maxent.MaxentTagger -model <modelFile> -testFile <testfile>

You can use the same properties file as for training if you pass it in with the "-props" argument. The most important arguments for tagging (besides "model" and "file") are "tokenize" and "tokenizerFactory". See below for more details.
Note that the tagger assumes input has not yet been tokenized and by default tokenizes it using a default English tokenizer. If your input has already been tokenized, use the flag "-tokenize false".

Parameters can be defined using a Properties file (specified on the command-line with -prop propFile), or directly on the command line (by preceding their name with a minus sign ("-") to turn them into a flag. The following properties are recognized:

Property Name	Type	Default Value	Relevant Phase(s)	Description
model	String	N/A	All	Path and filename where you would like to save the model (training) or where the model should be loaded from (testing, tagging).
trainFile	String	N/A	Train	Path to the file holding the training data; specifying this option puts the tagger in training mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified. There are three formats possible. The first is a text file of tagged data, Each line is considered a separate sentence. In each sentence, words are separated by whitespace. Each word must have a tag, which is separated using the specified tagSeparator. This format, called TEXT, is the default format. The second format is a file of Penn Treebank formatted tree files. Trees are loaded one at a time and the tagged words in a tree are used as a training sentence. To specify this format, preface the filename with " {@code format=TREES,}". The final possible format is TSV files (tab-separated columns). To specify a TSV file, set trainFile to " {@code format=TSV,wordColumn=x,tagColumn=y,filename}". Column numbers are indexed from 0, and sentences are separated with blank lines. The default wordColumn is 0 and default tagColumn is 1. A file can be in a different encoding than the tagger's default encoding by prefacing the filename with "encoding=ENC". You can specify the tagSeparator character in a TEXT file by prefacing the filename with "tagSeparator=c". Tree files can be fed through TreeTransformers and TreeNormalizers. To specify a transformer, preface the filename with "treeTransformer=CLASSNAME". To specify a normalizer, preface the filename with "treeNormalizer=CLASSNAME". You can also filter trees using a Filter<Tree>, which can be specified with "treeFilter=CLASSNAME". A specific range of trees to be used can be specified with treeRange=X-Y. Multiple parts of the range can be separated by : as opposed to the normal separator of ,. For example, one could use the argument "-treeRange=25-50:75-100". You can specify a TreeReaderFactory by prefacing the filename with "trf=CLASSNAME". Multiple files can be specified by making a semicolon separated list of files. Each file can have its own format specifiers as above. You will note that none of , ; or = can be in filenames.
testFile	String	N/A	Test	Path to the file holding the test data; specifying this option puts the tagger in testing mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified. The same format as trainFile applies, but only one file can be specified.
textFile	String	N/A	Tag	Path to the file holding the text to tag; specifying this option puts the tagger in tagging mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified. No file reading options may be specified for textFile
dump	String	N/A	Dump	Path to the file holding the model to dump; specifying this option puts the tagger in dumping mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified.
genprops	boolean	N/A	N/A	Use this option to output a default properties file, containing information about each of the possible configuration options.
tagSeparator	char	/	All	Separator character that separates word and part of speech tags, such as out/IN or out_IN. For training and testing, this is the separator used in the train/test files. For tagging, this is the character that will be inserted between words and tags in the output.
encoding	String	UTF-8	All	Encoding of the read files (training, testing) and the output text files.
tokenize	boolean	true	Tag,Test	Whether or not the file needs to be tokenized. If this is false, the tagger assumes that white space separates words if and only if they should be tagged as separate tokens, and that the input is strictly one sentence per line.
tokenizerFactory	String	edu.stanford.nlp. process.PTBTokenizer	Tag,Test	Fully qualified class name of the tokenizer to use. edu.stanford.nlp.process.PTBTokenizer does basic English tokenization.
tokenizerOptions	String		Tag,Test	Known options for the particular tokenizer used. A comma-separated list. For PTBTokenizer, options of interest include `americanize=false` and `asciiQuotes` (for German). Note that any choice of tokenizer options that conflicts with the tokenization used in the tagger training data will likely degrade tagger performance.
arch	String	generic	Train	Architecture of the model, as a comma-separated list of options, some with a parenthesized integer argument written k here: this determines what features are used to build your model. See {@link ExtractorFrames} and {@link ExtractorFramesRare} for more information.
wordFunction	String	(none)	Train	A function to apply to the text before training or testing. Must inherit from edu.stanford.nlp.util.Function<String, String>. Can be blank.
lang	String	english	Train	Language from which the part of speech tags are drawn. This option determines which tags are considered closed-class (only fixed set of words can be tagged with a closed-class tag, such as prepositions). Defined languages are 'english' (Penn tagset), 'polish' (very rudimentary), 'french', 'chinese', 'arabic', 'german', and 'medline'.
openClassTags	String	N/A	Train	Space separated list of tags that should be considered open-class. All tags encountered that are not in this list are considered closed-class. E.g. format: "NN VB"
closedClassTags	String	N/A	Train	Space separated list of tags that should be considered closed-class. All tags encountered that are not in this list are considered open-class.
learnClosedClassTags	boolean	false	Train	If true, induce which tags are closed-class by counting as closed-class tags all those tags which have fewer unique word tokens than closedClassTagThreshold.
closedClassTagThreshold	int	int	Train	Number of unique word tokens that a tag may have and still be considered closed-class; relevant only if learnClosedClassTags is true.
sgml	boolean	false	Tag, Test	Very basic tagging of the contents of all sgml fields; for more complex mark-up, consider using the xmlInput option.
xmlInput	String		Tag, Test	Give a space separated list of tags in an XML file whose content you would like tagged. Any internal tags that appear in the content of fields you would like tagged will be discarded; the rest of the XML will be preserved and the original text of specified fields will be replaced with the tagged text.
outputFile	String	""	Tag	Path to write output to. If blank, stdout is used.
outputFormat	String	""	Tag	Output format. One of: slashTags (default), xml, or tsv
outputFormatOptions	String	""	Tag	Output format options.
tagInside	String	""	Tag	Tags inside elements that match the regular expression given in the String.
search	String	cg	Train	Specify the search method to be used in the optimization method for training. Options are 'cg' (conjugate gradient), 'iis' (improved iterative scaling), or 'qn' (quasi-newton).
sigmaSquared	double	0.5	Train	Sigma-squared smoothing/regularization parameter to be used for conjugate gradient search. Default usually works reasonably well.
iterations	int	100	Train	Number of iterations to be used for improved iterative scaling.
rareWordThresh	int	5	Train	Words that appear fewer than this number of times during training are considered rare words and use extra rare word features.
minFeatureThreshold	int	5	Train	Features whose history appears fewer than this number of times are discarded.
curWordMinFeatureThreshold	int	2	Train	Words that occur more than this number of times will generate features with all of the tags they've been seen with.
rareWordMinFeatureThresh	int	10	Train	Features of rare words whose histories occur fewer than this number of times are discarded.
veryCommonWordThresh	int	250	Train	Words that occur more than this number of times form an equivalence class by themselves. Ignored unless you are using ambiguity classes.
debug	boolean	boolean	All	Whether to write debugging information (words, top words, unknown words, confusion matrix). Useful for error analysis.
debugPrefix	String	N/A	All	File (path) prefix for where to write out the debugging information (relevant only if debug=true).
nthreads	int	1	Test,Text	Number of threads to use when processing text.

@author Kristina Toutanova @author Miler Lee @author Joseph Smarr @author Anna Rafferty @author Michel Galley @author Christopher Manning @author John Bauer

  void setTagger() throws Exception
  {
    if(Dictionary.tagger==null)
    {
      if(this.language.equals("en"))
        Dictionary.tagger=new MaxentTagger("./data/taggermodels/"+this.language+"/english-left3words-distsim.tagger");
      if(this.language.equals("es"))
      {
        //TODO
      }

View Full Code Here

    preprocessor.setSentenceDelimiter(config.sentenceDelimiter);
    preprocessor.setTokenizerFactory(config.tlp.getTokenizerFactory());


    Timing timer = new Timing();


    MaxentTagger tagger = new MaxentTagger(config.tagger);
    List<List<TaggedWord>> tagged = new ArrayList<>();
    for (List<HasWord> sentence : preprocessor) {
      tagged.add(tagger.tagSentence(sentence));
    }


    System.err.printf("Tagging completed in %.2f sec.%n",
        timer.stop() / 1000.0);

View Full Code Here

      }
    }


    String text = "I can almost always tell when movies use fake dinosaurs.";


    MaxentTagger tagger = new MaxentTagger(taggerPath);
    DependencyParser parser = DependencyParser.loadFromModelFile(modelPath);


    DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(text));
    for (List<HasWord> sentence : tokenizer) {
      List<TaggedWord> tagged = tagger.tagSentence(sentence);
      GrammaticalStructure gs = parser.predict(tagged);


      // Print typed dependencies
      System.err.println(gs);
    }

View Full Code Here

      }
    }


    String text = "My dog likes to shake his stuffed chickadee toy.";


    MaxentTagger tagger = new MaxentTagger(taggerPath);
    ShiftReduceParser model = ShiftReduceParser.loadModel(modelPath);


    DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(text));
    for (List<HasWord> sentence : tokenizer) {
      List<TaggedWord> tagged = tagger.tagSentence(sentence);
      Tree tree = model.apply(tagged);
      System.err.println(tree);
    }
  }

View Full Code Here

        throw new IllegalArgumentException("Unknown argument: " + args[argIndex]);
      }
    }


    LexicalizedParser parser = LexicalizedParser.loadModel(inputFile);
    MaxentTagger tagger = new MaxentTagger(taggerFile);
    parser.reranker = new TaggerReranker(tagger, parser.getOp());
    parser.saveParserToSerialized(outputFile);
  }

View Full Code Here

    Timing timer = null;
    if (verbose) {
      timer = new Timing();
      timer.doing("Loading POS Model [" + loc + ']');
    }
    MaxentTagger tagger = new MaxentTagger(loc);
    if (verbose) {
      timer.done();
    }
    return tagger;
  }

View Full Code Here

      System.exit(-1);
    }
    try {
      // Load MaxentTagger, which is threadsafe
      String modelFile = args[0];
      final MaxentTagger tagger = new MaxentTagger(modelFile);


      // Configure to run with 4 worker threads
      int nThreads = 4;
      MulticoreWrapper<String,String> wrapper =
          new MulticoreWrapper<String,String>(nThreads,
              new ThreadsafeProcessor<String,String>() {
                @Override
                public String process(String input) {
                  return tagger.tagString(input);
                }
                @Override
                public ThreadsafeProcessor<String, String> newInstance() {
                  // MaxentTagger is threadsafe
                  return this;

View Full Code Here

  public static void main(String[] args) throws Exception {
    if (args.length != 2) {
      System.err.println("usage: java TaggerDemo2 modelFile fileToTag");
      return;
    }
    MaxentTagger tagger = new MaxentTagger(args[0]);
    TokenizerFactory<CoreLabel> ptbTokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
                     "untokenizable=noneKeep");
    BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(args[1]), "utf-8"));
    PrintWriter pw = new PrintWriter(new OutputStreamWriter(System.out, "utf-8"));
    DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(r);
    documentPreprocessor.setTokenizerFactory(ptbTokenizerFactory);
    for (List<HasWord> sentence : documentPreprocessor) {
      List<TaggedWord> tSentence = tagger.tagSentence(sentence);
      pw.println(Sentence.listToString(tSentence, false));
    }


    // print the adjectives in one more sentence. This shows how to get at words and tags in a tagged sentence.
    List<HasWord> sent = Sentence.toWordList("The", "slimy", "slug", "crawled", "over", "the", "long", ",", "green", "grass", ".");
    List<TaggedWord> taggedSent = tagger.tagSentence(sent);
    for (TaggedWord tw : taggedSent) {
      if (tw.tag().startsWith("JJ")) {
        pw.println(tw.word());
      }
    }

View Full Code Here

  public static void main(String[] args) throws Exception {
    if (args.length != 2) {
      System.err.println("usage: java TaggerDemo modelFile fileToTag");
      return;
    }
    MaxentTagger tagger = new MaxentTagger(args[0]);
    List<List<HasWord>> sentences = MaxentTagger.tokenizeText(new BufferedReader(new FileReader(args[1])));
    for (List<HasWord> sentence : sentences) {
      List<TaggedWord> tSentence = tagger.tagSentence(sentence);
      System.out.println(Sentence.listToString(tSentence, false));
    }
  }

View Full Code Here


  public void testEnglishTagSet() {
    LexicalizedParser lp = LexicalizedParser.loadModel(englishParsers[0]);
    Set<String> tagSet = lp.getLexicon().tagSet(lp.treebankLanguagePack().getBasicCategoryFunction());
    for (String name : englishTaggers) {
      MaxentTagger tagger = new MaxentTagger(name);
      assertEquals("English PCFG parser/" + name + " tag set mismatch", tagSet, tagger.tagSet());
    }
    for (String name : englishParsers) {
      LexicalizedParser lp2 = LexicalizedParser.loadModel(name);
      assertEquals("English PCFG parser/" + name + " tag set mismatch",
                   tagSet, lp2.getLexicon().tagSet(lp.treebankLanguagePack().getBasicCategoryFunction()));

View Full Code Here

0 1

TOP

Related Classes of edu.stanford.nlp.tagger.maxent.MaxentTagger

com.jgaap.eventDrivers.StanfordPartOfSpeechEventDriver

com.samir_ahmed.Iris.PoSTagger

edu.stanford.nlp.io.PrintFile

edu.stanford.nlp.maxent.CGRunner

edu.stanford.nlp.maxent.Problem

edu.stanford.nlp.objectbank.ReaderIteratorFactory

edu.stanford.nlp.parser.lexparser.AddTaggerToParser

edu.stanford.nlp.parser.nndep.demo.DependencyParserDemo

edu.stanford.nlp.parser.nndep.DependencyParser

edu.stanford.nlp.parser.shiftreduce.demo.ShiftReduceDemo

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.