A co-occurrence based approach to statistical semantics that uses dependency parse trees and approximates a full co-occurrence matrix by using a randomized projection. This implementation is an extension of {@link edu.ucla.sspace.ri.RandomIndexing}, which is based on three papers:
- M. Sahlgren, "Vector-based semantic analysis: Representing word meanings based on random labels," in Proceedings of the ESSLLI 2001 Workshop on Semantic Knowledge Acquisition and Categorisation, Helsinki, Finland, 2001.
- M. Sahlgren, "An introduction to random indexing," in Proceedings of the Methods and Applicatons of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, 2005.
- M. Sahlgren, A. Holst, and P. Kanerva, "Permutations as a means to encode order in word space," in Proceedings of the 30th Annual Meeting of the Cognitive Science Society (CogSci’08), 2008.
The technique for incorprating dependnecy parse trees is based on the following paper:
S Pado and M. Lapata, "Dependency-Based Construction of Semantic Space Models," in Association for Computational Linguistics, 2007 Dependency Random Indexing (DRI) extends Random Indexing by restricting a word's context to be set of words with which it has a syntactic relationship. Full word co-occurrence models have shown that this restricted interpretation of a context can improve the semantic representations. DRI uses the same approximation technique as Random Indexing to project this full co-occurrence space into a significantly smaller dimensional space. This projection is done through use of index vectors, each of which are sparse and mostly orthogonal to all other index vectors. The summation of a word's index vectors corresponds directly to that word's occurrence in a context.
While Random Indexing uses permutations of these index vectors to encode lexical position, a shallow form of syntactic structure, DRI extends the notion of permutations to allow for the encoding of dependency relationships. Through this modification, the set of relationships between any two co-occurirng words in a sentence can be encoded, as can the distance between the two words. Under this model, each possible dependency relationship could have it's own permutation function, as could each possible distance between co-occurring words.
This class defines the following configurable properties that may be set using either the System properties or using the {@link DependencyRandomIndexing#DependencyRandomIndexing(DependencyExtractor,DependencyPermutationFunction,Properties)} constructor.
- Property:
{@value #DEPENDENCY_ACCEPTOR_PROPERTY}
Default: {@link UniversalRelationAcceptor} - This property sets {@link DependencyRelationAcceptor} to use for validating dependency paths. If apath is rejected it will not influence either the lemma vector or the selectional preference vectors.
- Property:
{@value #DEPENDENCY_PATH_LENGTH_PROPERTY}
Default: {@value DEFAULT_DEPENDENCY_PATH_LENGTH} - This property sets the maximal length a dependency path can be for it to be accepted. Paths beyond this length will not contribute towards either the lemma vectors or selectional preference vectors.
- Property:
{@value #VECTOR_LENGTH_PROPERTY}
Default: {@link DEFAULT_VECTOR_LENGTH} - This property sets the number of dimensions in the word space.
This class implements {@link Filterable}, which allows for fine-grained control of which semantics are retained. The {@link #setSemanticFilter(Set)}method can be used to speficy which words should have their semantics retained. Note that the words that are filtered out will still be used in computing the semantics of
other words. This behavior is intended for use with a large corpora where retaining the semantics of all words in memory is infeasible. This class is thread-safe for concurrent calls of {@link #processDocument(BufferedReader) processDocument}. At any given point in processing, the {@link #getVectorFor(String) getVector} method may be usedto access the current semantics of a word. This allows callers to track incremental changes to the semantics as the corpus is processed. The {@link #processSpace(Properties) processSpace} method does nothing forthis class and calls to it will not affect the results of {@code getVectorFor}.
@see RandomIndexing
@see DependencyPermutationFunction
@author Keith Stevens