Random Indexing (RI) is an efficient way of capturing word co-occurence. In most co-occurence models, a word-by-word matrix is constructed, where the values denote how many times the columns's word occurred in the context of the row's word. RI instead represents co-occurrence through index vectors. Each word is assigned a high-dimensional, random vector that is known as its index vector. These index vectors are very sparse - typically 7 ± 2 non zero bits for a vector of length 2048, which ensures that the the chance of any two arbitrary index vectors having an overlapping meaning (i.e. a cosine similarity that is non-zero) is very low. Word semantics are calculated for each word by keeping a running sum of all of the index vectors for the words that co-occur.
Sahlgren et al. (2008) introduced another variation on RI, where the semantics also capture word order by using a permutation function. For each occurrence of a word, rather than summing the index vectors of the co-occurring words, the permutation function is used to transform the co-occurring words based on their position. For example, consider the sentece, "the quick brown fox jumps over the lazy dog." With a window-size of 2, the semantic vector for "fox" is added with the values Π-2(quickindex) + Π-1(brownindex) + Π1(jumpsindex) + Π2(overindex), where Π {@code k} denotes the {@code k}th permutation of the specified index vector.
This class defines the following configurable properties that may be set using either the System properties or using the {@link RandomIndexing#RandomIndexing(Properties)} constructor.
{@value #WINDOW_SIZE_PROPERTY}
{@value #VECTOR_LENGTH_PROPERTY}
{@value #USE_PERMUTATIONS_PROPERTY}
{@value #PERMUTATION_FUNCTION_PROPERTY}
{@value #USE_SPARSE_SEMANTICS_PROPERTY}
This class implements {@link Filterable}, which allows for fine-grained control of which semantics are retained. The {@link #setSemanticFilter(Set)}method can be used to speficy which words should have their semantics retained. Note that the words that are filtered out will still be used in computing the semantics of other words. This behavior is intended for use with a large corpora where retaining the semantics of all words in memory is infeasible.
This class is thread-safe for concurrent calls of {@link #processDocument(BufferedReader) processDocument}. At any given point in processing, the {@link #getVectorFor(String) getVector} method may be usedto access the current semantics of a word. This allows callers to track incremental changes to the semantics as the corpus is processed.
The {@link #processSpace(Properties) processSpace} method does nothing forthis class and calls to it will not affect the results of {@code getVectorFor}. @see PermutationFunction @see IndexVectorGenerator @author David Jurgens
|
|
|
|