Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).
Valid options are:
-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-S Ignore words that are in the stoplist.
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-stopwords <file> A file containing stopwords to override the default ones. Using this option automatically sets the flag ('-S') to use the stoplist if the file exists. Format: one stopword per line, lines starting with '#' are interpreted as comments and ignored.
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
@author Len Trigg (len@reeltwo.com)
@author Stuart Inglis (stuart@reeltwo.com)
@author Gordon Paynter (gordon.paynter@ucr.edu)
@author Asrhaf M. Kibriya (amk14@cs.waikato.ac.nz)
@version $Revision: 5547 $
@see Stopwords