CJK types are set by these tokenizers, but you can also use {@link #CJKBigramFilter(TokenStream,int)} to explicitly control whichof the CJK scripts are turned into bigrams.
By default, when a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you want to always output both unigrams and bigrams, set the outputUnigrams
flag in {@link CJKBigramFilter#CJKBigramFilter(TokenStream,int,boolean)}. This can be used for a combined unigram+bigram approach.
In all cases, all non-CJK input is passed thru unmodified.
|
|
|
|
|
|