|
UTF-7 - Definition and Overview |
|
|
UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in MIME messages.
MIME requires that the encoding used to send e-mail is ASCII, so any e-mail that directly uses 8-bit or 16-bit Unicode encodings such as UTF-16 is invalid. Unicode encoded in UTF-7 can be sent in e-mail without using a separate transfer encoding, but still must be explicitly identified as the text character set. In addition, if used within e-mail headers such as "Subject:" UTF-7 must be contained in MIME encoded words identifying the character set. For these and other reasons UTF-7 for use in e-mail has been largely deprecated in favor of UTF-8.
A modified form of UTF-7 is currently used in the IMAP e-mail retrieval protocol.
Description
UTF-7 was first standardized as RFC 1642, A Mail-Safe Transformation Format of Unicode. This RFC has been obsoleted by RFC 2152.
Characters below 0x80 (hexadecimal notation) within the ASCII range (except for the + character) are encoded as-is. Any character above 0x80 is encoded with an escape sequence of a + byte followed by the UTF-16 representation, encoded in Modified Base64, and terminated with a - byte (which is consumed), carriage return or line feed (which are not consumed). Literal + characters are encoded as +-.
Examples
- "Hello, World!" is encoded as "Hello, World!"
- "1 + 1 = 2" is encoded as "1 +- 1 = 2"
- "£1" is encoded as "+AKM-1". The British pound codepoint is 0x00A3 in UTF-16, which converts into Modified Base64 as:
- 0b000000 = 0 = 'A',
- 0b001010 = 10 = 'K', and
- 0b0011[00] = 12 = 'M', where the last two bits on the last octet are padding.
External links
|
|
Example Usage of UTF-7 |
 |
aankThuklun: BEGIN:VCARD VERSION:2.1 FN;CHARSET=UTF-7:Bod0 N;CHARSET=UTF-7:Bod0 TEL;CELL:+6285291710681 END:VCARD |
 |
hg_: @BruceHoult Wiki says 160 7 bit chars; seems unlikely they'd be using UTF-7. 70 UTF-16 chars the limit for a single Unicode SMS message? |
|