LSA first processes documents into a word-document matrix where each unique word is a assigned a row in the matrix, and each column represents a document. The values of ths matrix correspond to the number of times the row's word occurs in the column's document. After the matrix has been built, the Singular Value Decomposition (SVD) is used to reduce the dimensionality of the original word-document matrix, denoted as A. The SVD is a way of factoring any matrix A into three matrices U Σ VT such that Σ is a diagonal matrix containing the singular values of A. The singular values of Σ are ordered according to which causes the most variance in the values of A. The original matrix may be approximated by recomputing the matrix with only k of these singular values and setting the rest to 0. The approximated matrix  = Uk Σk VkT is the least squares best-fit rank-k approximation of A. LSA reduces the dimensions by keeping only the first k dimensions from the row vectors of U. These vectors form the semantic space of the words.
This class offers configurable preprocessing and dimensionality reduction. through three parameters. These properties should be specified in the {@code Properties} object passed to the {@link #processSpace(Properties) processSpace} method.
{@value #MATRIX_TRANSFORM_PROPERTY}
{@value LSA_DIMENSIONS_PROPERTY}
{@value LSA_SVD_ALGORITHM_PROPERTY}
{@value RETAIN_DOCUMENT_SPACE_PROPERTY}
This class is thread-safe for concurrent calls of {@link #processDocument(BufferedReader) processDocument}. Once {@link #processSpace(Properties) processSpace} has been called, no further calls to{@code processDocument} should be made. This implementation does not supportaccess to the semantic vectors until after {@code processSpace} has beencalled. @see Transform @see SingularValueDecomposition @author David Jurgens
|
|