|
A stemming algorithm is a method of reducing words to their stem, base, or root form. The algorithm has been a long-standing problem in computer science; the first paper on the subject was published in 1968. The process of stemming, often called conflation, is useful in search engines, natural language processing, and other text processing problems.
For example, a stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".
Methods
There are several types of stemming algorithms. Some techniques used are suffix stripping and lookup table replacement. In lemmatization, the part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech.
pPor
While much of the work in this area has focused on the English language (with significant use of the Porter Stemmer algorithm), other languages have been investigated including at least German, French, Italian, Spanish, Portuguese, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hebrew, and Arabic. Apparently, Hebrew and Arabic are still considered difficult research languages for stemming.
Further reading
- W. B. Frakes, Stemming algorithms, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992
- Lovins, J. B. "Development of a Stemming Algorithm." Mechanical Translation and Computational Linguistics 11, 1968, 22--31.
- Porter, M. F. "An Algorithm for Suffix Stripping." Program 14, 1980, 130--137.
External links
|