Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text.
LSA is an information retrieval technique which analyzes and identifies the pattern in unstructured collection of text and the relationship between them.
LSA itself is an unsupervised way of uncovering synonyms in a collection of documents.
To start, we take a look how Latent Semantic Analysis is used in Natural Language Processing to analyze relationships between a set of documents and the terms that they contain. Then we go steps further to analyze and classify sentiment. We will review Chi Squared for feature selection along the way.
In natural language understanding (NLU) tasks, there is a hierarchy of lenses through which we can extract meaning — from words to sentences to paragraphs to documents. At the document level, one of the most useful ways to understand text is by analyzing its topics. The process of learning, recognizing, and extracting these topics across a collection of documents is called topic modeling.
In this post, we will explore topic modeling through 4 of the most popular techniques today: LSA, pLSA, LDA, and the newer, deep learning-based lda2vec.
note footnote at the bottom: "http://www.sciencemag.org/content/313/5786/504.abstract, http://www.cs.toronto.edu/~amnih/cifar/talks/salakhut_talk.pdf. In a strict sense, this work was obsoleted by a slew of papers from 2011 which showed that you can achieve similar results to this 2006 result with “simple” algorithms, but it’s still true that current deep learning methods are better than the best “simple” feature learning schemes, and this paper was the first example that came to mind. [return]"
Die Grundannahme für die Verwendung der PCA zur Clusteranalyse und Dimensionsreduktion lautet: Die Richtungen mit der größten Streuung (Varianz) beinhalten die meiste Information.
As the use of a Bayesian probability calculation on a simple co-occurrence frequency table created from the same data has similar disambiguation capabilities, the paper also incorporates comparison of LSA with the Bayesian model.
R. Wetzker, W. Umbrath, and A. Said. Proceedings of the WSDM '09 Workshop on Exploiting Semantic Annotations in Information Retrieval, page 25--29. New York, NY, USA, ACM, (2009)
R. Rehurek, and P. Sojka. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, page 45--50. Valletta, Malta, ELRA, (May 22, 2010)http://is.muni.cz/publication/884893/en.