|
Standard Compression Scheme for Unicode - Definition and Overview |
| Related Words: Abstract, Agglutination, Bottleneck, Cervix, Clumping, Clustering, Concentration, Concretion, Condensation, Constriction, Contraction, Crush |
|
|
|
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. It does so by dynamically mapping the values in the range 128-255 to blocks of 128 characters. Since most alphabets are in 128 contiguous Unicode codepoints, this allows for 1 byte per character (plus overhead) encoding for many text files. SCSU will also switch to UTF-16 internally to handle non-alphabetic languages.
SCSU is not a resounding success. Few places need to compress enough Unicode text to make it worth using a poorly supported compression scheme. Treated purely as a compression format, it's inferior to most commonly used compression programs for texts over a few kilobytes. It can be used as a text encoding, but it's very hard to handle internally, and the percentage savings between SCSU and UTF-16 or UTF-8 drops after external compression, dramatically in the case of bzip2 and other modern compression schemes. It does have the advantage that SCSU can compress texts that are only a few characters long, whereas most full-scale compressors need a few kilobytes of data to overcome the overhead.
Reuters, the organization that floated the first draft of SCSU, is believed to use SCSU internally.
External links
|
|
Example Usage of Compression |
 |
MikeLomonosov: SQL Server 2008 R2: Unicode Compression - Компрессия данных появилась ещё в SQL Server 2008, но в версии http://ow.ly/163UBx |
 |
misterfonzie: @Lybbe @vkoser I'd need Compression shirts, a girdle, duct tape, and liposuction. Maybe then I'd look OK |
 |
orbrey: Trying we7. It's pretty good. Interesting Compression, they've moved the range down a bit. Bass is good but treble very squashed. |
|