cimec.unitn.it/marco/publications/acl2007/coglearningacl07.pdf">here
ISA is notable in that it builds semantics incrementally using both information from the co-occurrence of a word and the semantics of the co-occurring word. Similar to Random Indexing (RI), ISA uses index vectors to reduce the number of dimensions needed to represent the full co-occurrence matrix. In contrast, other semantic space algorithms such as RI, HAL and BEAGLE, ISA uses the semantics of the co-occurring words to update the semantics of their neighbors. Formally, the semantics of a word wi are updated for the co-occurrence of another word wj as:
sem(wi) += i · (mc · sem(wj) + (1 - mc) · IV(wj))
where
sem is the semantics for a word, and
IV is the index vector for a word.
i defines the impact rate, which is how much the co-occurrence affects the semantics.
mc defines the degree to which the semantics affect the co-occurring word's semantics. This weighting factor is based on the frequency of occurrence; the semantics of frequently occurring words cause less impact.
mc is formally defined as 1 ÷
efreq(word) ÷ km, where
km is a weighting factor for determing how quickly the semantic of a a word diminish in their affect on co-occurring words.
This class defines the following configurable properties that may be set using either the System properties or using the {@link IncrementalSemanticAnalysis#IncrementalSemanticAnalysis(Properties)}constructor. The two most important properties for configuring ISA are {@value #IMPACT_RATE_PROPERTY} and {@value #HISTORY_DECAY_RATE_PROPERTY}. The values that these properties set have been initialized to the values specified in Baroni et al.
- Property:
{@value #IMPACT_RATE_PROPERTY}
Default: {@value #DEFAULT_IMPACT_RATE} - This property specifies the impact rate of co-occurrence, which specifies to what degree does the co-occurrence of one word affect the semantics of the other. This rate affects both the impact of the index vector for a co-occurring word as well as the impact of the semantics.
- Property:
{@value #HISTORY_DECAY_RATE_PROPERTY}
Default: {@value #DEFAULT_HISTORY_DECAY_RATE} - This property specifies the decay rate at which the semantics of co-occurring words lessen their impact. A word's frequency of occurrence is combined with the history decay rate to indicate the degree to which the word's semantics will influence (i.e. be added to) the semantics of a co-occurring word. High values will cause the semantics of frequently occurring words to have minimal impact on other words' semantics.
- Property:
{@value #WINDOW_SIZE_PROPERTY}
Default: {@value #DEFAULT_WINDOW_SIZE} - This property sets the number of words before and after that are counted as co-occurring. With the default value, {@code 5} words are counted before and {@code 5} words are counterafter. This class always uses a symmetric window.
- Property:
{@value #VECTOR_LENGTH_PROPERTY}
Default: {@value #DEFAULT_VECTOR_LENGTH} - This property sets the number of dimensions to be used for the index and semantic vectors.
- Property:
{@value #USE_PERMUTATIONS_PROPERTY}
Default: {@code false} - This property specifies whether to enable permuting the index vectors of co-occurring words. Enabling this option will cause the word semantics to include word-ordering information. However this option is best used with a larger corpus.
- Property:
{@value #PERMUTATION_FUNCTION_PROPERTY}
Default: {@link edu.ucla.sspace.index.DefaultPermutationFunction DefaultPermutationFunction} - This property specifies the fully qualified class name of a {@link PermutationFunction} instance that will be usedto permute index vectors. If the {@value #USE_PERMUTATIONS_PROPERTY} isset to {@code false}, the value of this property has no effect.
- Property:
{@value #USE_SPARSE_SEMANTICS_PROPERTY}
Default: {@code false} - This property specifies whether to use a sparse encoding for each word's semantics. Using a sparse encoding can result in a large saving in memory, while requiring more time to process each document.
Due to the incremental nature of ISA, instance of this class are not designed to be multi-threaded. Documents must be processed sequentially to properly model how the semantics of co-occurring words affect each other. Multi-threading would induce an ambiguous ordering to co-occurrence.
@author David Jurgens