LRA uses three main components to analyze a large corpus of text in order to measure relational similarity between pairs of words (i.e. analogies). LRA uses the search engine to find patterns based on the input set as well as its corresponding alternates (see {@link #loadAnalogiesFromFile(String)}). A sparse matrix is then generated, where each value in the matrix is the number of times the row's word pair occurs with the column's pattern between them.
After the matrix has been built, the Singular Value Decomposition (SVD) is used to reduce the dimensionality of the original word-document matrix, denoted as A. The SVD is a way of factoring any matrix A into three matrices U Σ VT such that Σ is a diagonal matrix containing the singular values of A. The singular values of Σ are ordered according to which causes the most variance in the values of A. The original matrix may be approximated by recomputing the matrix with only k of these singular values and setting the rest to 0. The approximated matrix  = Uk Σk VkT is the least squares best-fit rank-k approximation of A. LRA reduces the dimensions by keeping only the first k dimensions from the row vectors of U and the k dimensions from the column vectors of Σ. The projection matrix UΣ is then used to calculate the relational similarities between pairs using the row vectors corresponding to the word pairs.
This class uses the Apache Lucune Search Engine for optimal indexing and filtering of word pairs using any given corpus. This class also uses Wordnet through the JAWS interface in order to find alternate word pairs from given input pairs. @author Sky Lin
|
|