This example shows how to cluster non-English content. By default Carrot2 assumes that the documents provided for clustering are written in English. When clustering content written in some different language, it is important to indicate the language to Carrot2, so that it can use the lexical resources (stop words, tokenizer, stemmer) appropriate for that language.
There are two ways to indicate the desired clustering language to Carrot2:
- By setting the language of each document in their {@link org.carrot2.core.Document#LANGUAGE} field. The language does not necessarilyhave to be the same for all documents on the input, Carrot2 can handle multiple languages in one document set as well. Please see the {@link org.carrot2.text.clustering.MultilingualClustering#languageAggregationStrategy}attribute for more details.
- By setting the fallback language. For documents with undefined {@link org.carrot2.core.Document#LANGUAGE} field, Carrot2 will assume the some fallbacklanguage, which is English by default. You can change the fallback language by setting the {@link org.carrot2.text.clustering.MultilingualClustering#defaultLanguage}attribute.
Additionally, a number of document sources automatically set the {@link org.carrot2.core.Document#LANGUAGE} of documents they produce based on theirspecific language-related attributes. Currently, three documents support this scenario:
- {@link org.carrot2.source.microsoft.Bing3WebDocumentSource} through the{@link org.carrot2.source.microsoft.Bing3WebDocumentSource#market} attribute,
- {@link org.carrot2.source.etools.EToolsDocumentSource} through the{@link org.carrot2.source.etools.EToolsDocumentSource#language} attribute.
For the document sources that do not set the documents' language automatically, the easiest way to set the clustering language is through the {@link org.carrot2.text.clustering.MultilingualClustering#defaultLanguage} attribute.