This should be a good tokenizer for most European-language documents:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, {@link StandardTokenizer} implements Unicode text segmentation,as specified by UAX#29.
|
|
|
|
|
|
|
|