Latent_semantic_indexing Latent_semantic_indexing

Latent semantic indexing - Definition and Overview

Related Words: Demonstrative, Diagnostic, Figurative, Iconic, Ideographic, Idiosyncratic, Individual, Lexical

Latent semantic analysis (LSA) is a technique in information retrieval invented in 1990 [1] (http://lsi.research.telcordia.com/lsi/papers/JASIS90.pdf). It is sometimes called latent semantic indexing (LSI).

LSA is a preprocessing step, used before the classification or search of documents. The purpose of LSA is to make documents easier to classify and search. LSA is meant to solve two fundamental problems in natural language processing: synonymy and polysemy. In synonymy, different writers use different words to describe the same idea. Thus, a person issuing a query in a search engine may use a different word than appears in a document, and may not retrieve the document. In polysemy, the same word can have multiple meanings, so a searcher can get unwanted documents with the alternate meanings.

LSA starts with a document-term matrix, a sparse matrix whose rows correspond to documents and whose columns correspond to terms (typically stemmed words that appear in the documents). The values of the matrix are typically tf-idf: they are proportional to the number of times the terms appear in the matrix, where rare terms are upweighted to reflect their relative importance.

LSA then find a low-rank approximation to the document-term matrix, through the use of singular value decomposition (SVD). In LSA, this SVD is truncated, so that each document and term is represented by a vector of much lower dimensionality than the total number of words in the vocabulary. Thus, when a query is issued by a user, it gets mapped into this low-dimensional space, and gets compared to documents in that same space.

Because it uses a low-dimensional representation for terms and documents, it must represent meaning in documents, rather than simply which terms occur. Thus, document and terms with similar meaning are close in the low-dimensional space. This can mitigate polysemy (by using more than one word in the query to disambiguate in the low-dimensional space) and synonymy (because the synonymous words map similarly in the low-dimensional space).

Recently, LSA has come under criticism, because its probabilistic model does not match the observed data. LSA assumes that words and documents form a joint Gaussian model. However, Gaussian models can generate negative values, and it is impossible to have a negative number of words in a document. Thus, a newer alternative is probabilistic latent semantic analysis, based on a multinomial model, which is reported to give better results than standard LSA. However, LSA still remains a standard algorithm in information retrieval.

External links and references

Example Usage of semantic

mhaller1979: @cbeust i would not. tostring should not be part of api. use getName() or similar, as it has semantic
blogbloke: The semantic Web: A Web 3.0 Revolution http://bit.ly/8ofTO9 (btw, I told O'Reilly I've got dibs on the term "Web 3.0" - it's trademarked).
eprints: The semantic Web Revolution - Unleashing the World’s most Valuable Information http://bit.ly/6MIz10
Copyright 2009 WordIQ.com - Privacy Policy  :: Terms of Use  :: Contact Us  :: About Us
This article is licensed under the GNU Free Documentation License. It uses material from the this Wikipedia article.