Text_normalization Text_normalization

Text normalization - Definition and Overview

Text normalization is a process by which text is transformed in some way to make it consistent in some way which it may not have been before. Text normalization is often performed before a text is processed in some way, such as generating synthesized speech, automated language translation, and storage in a database.

Examples of text normalization:

  • Unicode NFC (Normalization Form Composition) where the base character and combining accents are canonically composed.
  • Unicode NFD (Normalization Form Decomposition) where the base character and combining accents are canonically decomposed. Usually this is into separate codepoints.
  • converting all letters to lower or upper case
  • removing punctuation
  • removing letters with accent marks and other diacritics
  • expanding abbreviations

While this may be done manually, and usually is in the case of ad hoc and personal documents, many programming languages support mechanisms which enable text normalization.

External links

Example Usage of normalization

saqeram: @marcynewman that journalist came in with a non-Israeli passport, acted stealthily.. I don't see why you call that normalization. :-\
saqeram: *not* normalization, but not sure what to think ♻ @marcynewman: normalization in dubai http://www.haaretz.com/hasen/spages/1132953.html
upbux: The Cloud Computer Network from UpBux is unstable. We will give more information about the normalization soon.
Copyright 2009 WordIQ.com - Privacy Policy  :: Terms of Use  :: Contact Us  :: About Us
This article is licensed under the GNU Free Documentation License. It uses material from the this Wikipedia article.