Examples of edu.ucla.sspace.ri.RandomIndexing

A co-occurrence based approach to statistical semantics that uses a randomized projection of a full co-occurrence matrix to perform dimensionality reduction. This implementation is based on three papers:

M. Sahlgren, "Vector-based semantic analysis: Representing word meanings based on random labels," in Proceedings of the ESSLLI 2001 Workshop on Semantic Knowledge Acquisition and Categorisation, Helsinki, Finland, 2001.
M. Sahlgren, "An introduction to random indexing," in Proceedings of the Methods and Applicatons of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, 2005.
M. Sahlgren, A. Holst, and P. Kanerva, "Permutations as a means to encode order in word space," in Proceedings of the 30th Annual Meeting of the Cognitive Science Society (CogSci’08), 2008.

Random Indexing (RI) is an efficient way of capturing word co-occurence. In most co-occurence models, a word-by-word matrix is constructed, where the values denote how many times the columns's word occurred in the context of the row's word. RI instead represents co-occurrence through index vectors. Each word is assigned a high-dimensional, random vector that is known as its index vector. These index vectors are very sparse - typically 7 ± 2 non zero bits for a vector of length 2048, which ensures that the the chance of any two arbitrary index vectors having an overlapping meaning (i.e. a cosine similarity that is non-zero) is very low. Word semantics are calculated for each word by keeping a running sum of all of the index vectors for the words that co-occur.

Sahlgren et al. (2008) introduced another variation on RI, where the semantics also capture word order by using a permutation function. For each occurrence of a word, rather than summing the index vectors of the co-occurring words, the permutation function is used to transform the co-occurring words based on their position. For example, consider the sentece, "the quick brown fox jumps over the lazy dog." With a window-size of 2, the semantic vector for "fox" is added with the values Π^-2(quick_index) + Π^-1(brown_index) + Π¹(jumps_index) + Π²(over_index), where Π^{{@code k}} denotes the {@code k}^th permutation of the specified index vector.

This class defines the following configurable properties that may be set using either the System properties or using the {@link RandomIndexing#RandomIndexing(Properties)} constructor.

Property: {@value #WINDOW_SIZE_PROPERTY} Default: {@value #DEFAULT_WINDOW_SIZE}: This property sets the number of words before and after that are counted as co-occurring. With the default value, {@value #DEFAULT_WINDOW_SIZE} words are counted before and {@value #DEFAULT_WINDOW_SIZE} words are counter after. This class always uses asymmetric window.
Property: {@value #VECTOR_LENGTH_PROPERTY} Default: {@value #DEFAULT_VECTOR_LENGTH}: This property sets the number of dimensions to be used for the index and semantic vectors.
Property: {@value #USE_PERMUTATIONS_PROPERTY} Default: {@code false}: This property specifies whether to enable permuting the index vectors of co-occurring words. Enabling this option will cause the word semantics to include word-ordering information. However this option is best used with a larger corpus.
Property: {@value #PERMUTATION_FUNCTION_PROPERTY} Default: {@link edu.ucla.sspace.index.DefaultPermutationFunction DefaultPermutationFunction}: This property specifies the fully qualified class name of a {@link PermutationFunction} instance that will be usedto permute index vectors. If the {@value #USE_PERMUTATIONS_PROPERTY} isset to {@code false}, the value of this property has no effect.
Property: {@value #USE_SPARSE_SEMANTICS_PROPERTY} Default: {@code true}: This property specifies whether to use a sparse encoding for each word's semantics. Using a sparse encoding can result in a large saving in memory, while requiring more time to process each document.

This class implements {@link Filterable}, which allows for fine-grained control of which semantics are retained. The {@link #setSemanticFilter(Set)}method can be used to speficy which words should have their semantics retained. Note that the words that are filtered out will still be used in computing the semantics of other words. This behavior is intended for use with a large corpora where retaining the semantics of all words in memory is infeasible.

This class is thread-safe for concurrent calls of {@link #processDocument(BufferedReader) processDocument}. At any given point in processing, the {@link #getVectorFor(String) getVector} method may be usedto access the current semantics of a word. This allows callers to track incremental changes to the semantics as the corpus is processed.

The {@link #processSpace(Properties) processSpace} method does nothing forthis class and calls to it will not affect the results of {@code getVectorFor}. @see PermutationFunction @see IndexVectorGenerator @author David Jurgens

Examples of edu.ucla.sspace.ri.RandomIndexing

Related Classes of edu.ucla.sspace.ri.RandomIndexing