星期三, 二月 11, 2004

字符转换的说明及定义

Transfer Encoding Syntax (TES)
A transfer encoding syntax is a reversible transform of encoded data which may (or may not) include textual data represented in one or more character encoding schemes.

Typically TES’s are engineered either to:

Avoid particular byte values that would confuse one or more Internet or other transmission/storage protocols: base64, uuencode, BinHex, quoted-printable, etc., or to:
Formally apply various data compressions to minimize the number of bits to be passed down a communication channel: pkzip, gzip, winzip, etc.
SCSU (and RCSU: see UTR #6: A Standard Compression Scheme for Unicode) should also be conceived of as transfer encoding syntaxes. They should not be considered CES's, in part because the compressed forms are not unique, but depend on the sophistication of the compression algorithm.
The Internet Content-Transfer-Encoding tags "7bit" and "8bit" are special cases. These are data width specifications relevant basically to mail protocols and which appear to predate true TES’s like quoted-printable. Encountering a "7bit" tag doesn’t imply any actual transform of data; it merely is an indication that the charset of the data can be represented in 7 bits, and will pass 7-bit channels ?C it is really an indication of the encoding form. In contrast, quoted-printable actually does a conversion of various characters (including some ASCII) to forms like "=2D", "=20", etc., and should be reversed on receipt to regenerate legible text in the designated character encoding scheme.

8 API Binding
Most API’s are specified in terms of either code units or serialized bytes. An example of the first are Java String and char APIs, which use UTF-16 code units. Another example is C and C++ wchar_t interfaces used for DBCS processing codes. For code units, the byte order of the platform is generally not relevant in the API; the same API can be compiled on platforms with any byte polarity, and will simply expect character data (as for any integral-based data) to be passed to the API in the byte polarity for that platform.

C and C++ char* APIs use serialized bytes, which could represent a variety of different character maps, including ISO Latin 1, UTF-8, Windows 1252, as well as compound character maps such as Shift-JIS or 2022-JP. A byte API could also handle UTF-16BE or UTF-16LE, which are serialized forms of Unicode. However, these APIs must be allow for the existence of any byte value, and typically use memcpy plus length instead of strcpy for manipulating strings.