This model requires a dependency parsed corpus. When processing, three types of vectors: word, which represnts the co-occureences word has with all other tokens via a dependency chain; REL|word, which records the set of tokens that govern the REL relationship with word; and word|REL, which records the set of tokens that are governed by word in the REL relationship. The first vector is referred to as a lemma vector and the later two are called selectional preference vectors. In all cases REL is a dependency relationship.
This class implements {@link Filterable}, which allows for fine-grained control of which semantics are retained. The {@link #setSemanticFilter(Set)}method can be used to speficy which words should have their semantics retained. Note that the words that are filtered out will still be used in computing the semantics of other words. This behavior is intended for use with a large corpora where retaining the semantics of all words in memory is infeasible.
This class is thread-safe for concurrent calls of {@link #processDocument(BufferedReader) processDocument}. At any given point in processing, the {@link #getVectorFor(String) getVector} method may be usedto access the current semantics of a word. This allows callers to track incremental changes to the semantics as the corpus is processed. The {@link #processSpace(Properties) processSpace} method does nothing otherthan print out the feature indexes in the space to standard out. @author Keith Stevens
|
|
|
|