Package edu.ucla.sspace.lsa

Examples of edu.ucla.sspace.lsa.LatentSemanticAnalysis

See the Wikipedia page on Latent Semantic Analysis for an execuative summary.

LSA first processes documents into a word-document matrix where each unique word is a assigned a row in the matrix, and each column represents a document. The values of ths matrix correspond to the number of times the row's word occurs in the column's document. After the matrix has been built, the Singular Value Decomposition (SVD) is used to reduce the dimensionality of the original word-document matrix, denoted as A. The SVD is a way of factoring any matrix A into three matrices U Σ VT such that Σ is a diagonal matrix containing the singular values of A. The singular values of Σ are ordered according to which causes the most variance in the values of A. The original matrix may be approximated by recomputing the matrix with only k of these singular values and setting the rest to 0. The approximated matrix  = Uk Σk VkT is the least squares best-fit rank-k approximation of A. LSA reduces the dimensions by keeping only the first k dimensions from the row vectors of U. These vectors form the semantic space of the words.

This class offers configurable preprocessing and dimensionality reduction. through three parameters. These properties should be specified in the {@code Properties} object passed to the {@link #processSpace(Properties) processSpace} method.

Property: {@value #MATRIX_TRANSFORM_PROPERTY}
Default: {@link LogEntropyTransform}
This variable sets the preprocessing algorithm to use on the term-document matrix prior to computing the SVD. The property value should be the fully qualified named of a class that implements {@link Transform}. The class should be public, not abstract, and should provide a public no-arg constructor.

Property: {@value LSA_DIMENSIONS_PROPERTY}
Default: {@code 300}
The number of dimensions to use for the semantic space. This value is used as input to the SVD.

Property: {@value LSA_SVD_ALGORITHM_PROPERTY}
Default: {@link edu.ucla.sspace.matrix.SVD.Algorithm#ANY}
This property sets the specific SVD algorithm that LSA will use to reduce the dimensionality of the word-document matrix. In general, users should not need to set this property, as the default behavior will choose the fastest available on the system.

Property: {@value RETAIN_DOCUMENT_SPACE_PROPERTY}
Default: {@code false}
This property indicate whether the document space should be retained after {@code processSpace}. Setting this property to {@code true} will enable the {@link #getDocumentVector(int) getDocumentVector} method.

This class is thread-safe for concurrent calls of {@link #processDocument(BufferedReader) processDocument}. Once {@link #processSpace(Properties) processSpace} has been called, no further calls to{@code processDocument} should be made. This implementation does not supportaccess to the semantic vectors until after {@code processSpace} has beencalled. @see Transform @see SingularValueDecomposition @author David Jurgens


            String algName = argOptions.getStringOption("svdAlgorithm", "ANY");
            SingularValueDecomposition factorization = SVD.getFactorization(
                    Algorithm.valueOf(algName.toUpperCase()));
            basis = new StringBasisMapping();

            return new LatentSemanticAnalysis(
                false, dimensions, transform, factorization, false, basis);
        } catch (IOException ioe) {
            throw new IOError(ioe);
        }
    }
View Full Code Here

TOP

Related Classes of edu.ucla.sspace.lsa.LatentSemanticAnalysis

Copyright © 2018 www.massapicom. All rights reserved.
All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.