HAL is based on recording the co-occurrence of words in a sparse matrix. HAL also incorporates word order information by treating the co-occurrences of two words x y as being different than y x. Each word is assigned a unique index in the co-occurrence matrix. For some word x, when another word x co-occurs before, matrix entry x,y is update. Similarly, when y co-occurs after, the matrix entry y,x is updated. Therefore the full semantic vector for any words is its row vector concatenated with its column vector.
Typically, the full vectors are used (for an N x N matrix, these are 2*N in length). However, HAL also offers two posibilities for dimensionality reduction. Not all columns provide equal amount of information that can be used to distinguish the meanings of the words. Specifically, the information theoretic entropy of each column can be calculated as a way of ordering the columns by their importance. Using this ranking, either a fixed number of columns may be retained, or a threshold may be set to filter out low-entropy columns.
A {@link HyperspaceAnalogueToLanguage} model is defined by four parameters.The default constructor uses reasonable parameters that match those mentioned in the original publication. For alternate models, appropriate values must be passed in through the full constructor. The four parameters are:
For models that require a non-symmetric window, a special {@link WeightingFunction} can be used which assigns a weight of {@code 0} toco-occurrences that match the non-symmetric window size. @author Alex Nau @author David Jurgens @see SemanticSpace @see WeightingFunction
|
|