Big5 Big5

Big5 - Definition and Overview

For other uses, see big five (disambiguation).

Big-5 or Big5 is a character encoding method used in Taiwan (Republic of China) and Hong Kong for Traditional Chinese characters. Its Mainland China equivalent is GB.

Contents

Organization

The original Big5 character set is sorted first by usage frequency, second by stroke count, lastly by KangXi Radicals.

The original Big5 character set lacked many commonly used characters. To solve this problem, each vendor developed its own extension. The ETen extension became part of the current Big5 standard through popularity.

The structure of Big5 does not conform to the ISO 2022 standard, but rather bears a certain similarity to the Shift JIS encoding. It is a double-byte character set (DBCS) with the following structure:

First byte ("lead byte") 0xa1 to 0xfe
Second byte 0x40 to 0x7e, 0xa1 to 0xfe

Certain variants of the Big5 character set, for example the HKSCS, uses an expanded range for the lead byte including values in the 0x80 to 0xA0 range (similar to Shift JIS).

The numerical value of individual Big5 codes are frequently given as a 4-digit hexadecimal number, which describes the two bytes that comprise the Big5 code as if the two bytes were a big endian representation of a 16-bit number. For example, the Big5 code for a full-width space, which are the bytes 0xa1 0x40, is usually written as 0xa140 or just A140.

A more detailed look at the organization

In the original Big5, the encoding is compartmentalized into different zones:

0xa140 to 0xa3bf "Graphical characters" 圖形碼
0xa3c0 to 0xa3fe Reserved for user-defined characters 造字
0xa440 to 0xc67e Frequently used characters 常用字
0xc6a1 to 0xc8fe Reserved for user-defined characters
0xc940 to 0xf9d5 Less frequently used characters 次常用字
0xf9d6 to 0xfefe Reserved for user-defined characters

The "graphical characters" actually comprise punctuation marks, partial punctuation marks (e.g., half of a dash, half of an ellipsis; see below), dingbats, foreign characters, and other special characters (e.g., presentational "full width" forms, digits for Suzhou numerals, zhuyin fuhao, etc.)

In most vendor extensions, extended characters are placed in the various zones reserved for user-defined characters. The various zones for user-defined characters are normally regarded as associated with the preceding zone; for example, additional "graphical characters" (e.g., punctuation marks) would be placed in the 0xa3c0–0xa3fe range, and additional ideograms would be placed in either the 0xc6a1–0xc8fe or the 0xf9d6–0xfefe range.

What a Big5 code actually encodes

Contrary to popular believe, an individual Big5 code does not always represent a complete semantic unit. The Big5 codes of ideograms are always ideograms, but codes in the "graphical characters" section are not always complete "graphical characters". What Big5 encodes are particular graphical representations of characters or part of characters that happen to fit in the space taken by two monospaced ASCII characters.

To illustrate this point, consider the Big5 code 0xa14b (…). To English speakers this looks like an ellipsis; however, in Chinese, the ellipsis consists of six dots that fit in the space of two Chinese characters (……), so in fact there is no Big5 code for the Chinese ellipsis, and the Big5 code 0xa14b just represents half of an ellipsis.

A more striking illustration can be made by considering the Big5 codes 0xa1ca (﹋) and 0xa1cb (﹌). According to the Unicode Consortium, these are the "wavy underline" and the "double wavy underline". However, if you look at a Big5 font, you can see that these two are in fact halves of a longer "wavy underline" pattern (﹋﹌) that form the Chinese "citation mark"; the two halves require two codes only because the wavy pattern that form this variable-length punctuation mark cannot be neatly cut in identical halves (when we consider the practical restriction that the two halves needed to be legible in a 16-by-16 grid).

Characters encoded in Big5 do not always represent things that can be readily used in plain text files; an example is the above-mentioned "citation mark", which is required to be typeset under the citation of literary works. Another example is the "Suzhou numerals", which is a form of scientific notation that requires the number to be laid out in a 2-D form consisting of at least two rows.

Name

Big5's Chinese name 五大碼 (pinyin: wǔdà mǎ), means "Big Five Encoding." The name refers to the original design goal to support the five major software packages used in Taiwan at the time, or to the five leading computer companies in Taiwan (宏碁 (hóng qí; Acer [1] (http://www.acer.com.tw)), 神通 (shén tōng; MiTAC [2] (http://www.mitac.com.tw)), 佳佳 (jīa jīa; ?), 零壹 (líng yī; Zero One ([3] (http://www.zerone.com.tw)), 大眾 (dà zhòng; FIC [4] (http://www.fic.com.tw))) that collaborated to develop the code.

The English name of the encoding, "Big5", was subsequently (mistakenly) translated back to Chinese from English as 大五碼 (dàwǔ mǎ). Both Chinese names are now in use.

History

The Big5 encoding was defined by the Institute for Information Industry of Taiwan in 1984. According to some accounts, Big5 was popularized by its adoption in several commercial software packages, especially the ET Chinese system which ran on MS-DOS.

The Republic of China government declared it their standard in mid-1980s since Big5 was already the de facto standard by that time.

Hong Kong also adopted Big5 for character encoding. However, Cantonese uses many archaic Chinese characters that were not available in the normal Big5 character set. To solve this problem, the Hong Kong Government created the Big5 extensions "Government Chinese Character Set" in 1995 and Hong Kong Supplementary Character Set in 1999. The Hong Kong extensions are commonly distributed as a patch.

See also

External links

References

  • Lunde, Ken (1999). CJKV Information Processing. First Edition. O'Reilly and Associates, Inc. ISBN 1565922247.

Copyright 2009 WordIQ.com - Privacy Policy  :: Terms of Use  :: Contact Us  :: About Us
This article is licensed under the GNU Free Documentation License. It uses material from the this Wikipedia article.