Tokenizer for Japanese that uses morphological analysis.
This tokenizer sets a number of additional attributes:
- {@link BaseFormAttribute} containing base form for inflectedadjectives and verbs.
- {@link PartOfSpeechAttribute} containing part-of-speech.
- {@link ReadingAttribute} containing reading and pronunciation.
- {@link InflectionAttribute} containing additional part-of-speechinformation for inflected forms.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is {@link Mode#SEARCH}, we output the alternate segmentation as well.