TrainSpellChecker
instance provides a mechanism for collecting training data for a compiled spell checker. Training instances are nothing more than character sequences which represent likely user queries. After training, a model is written out through the Compilable
interface using {@link #compileTo(ObjectOutput)}. When this model is read back in, it will be an instance of {@link CompiledSpellChecker}. The compiled spell checkers allow many runtime parameters to be tuned; see the class documentation for full details.
In training the source language model, all training data is whitespace normalized with an initial whitespace, final whitespace, and all internal whitespace sequences converted to a single space character.
A tokenization factory may be optionally specified for training token-sensitive spell checkers. With tokenization, input is further normalized to insert a single whitespace between all tokens not already separated by a space in the input. The tokens are then output during compilation and read back into the compiled spell checker. The set of tokens output may be pruned to remove any below a given count threshold. The resulting set of tokens is used to constrain the set of alternative spellings suggested during spelling correction to include only tokens in the observed token set.
In constructing a spell checker trainer, a compilable weighted edit distance must be specified. This edit distance model will be compiled along with the language model and token set and used as the channel model in the compiled spell checker.
As an alternative to using the spell checker, a language model may be trained directly and supplied in compiled form along with a weighted edit distance to the public constructors for compiled spell checkers. @author Bob Carpenter @version 2.0 @since LingPipe2.0
|
|