星期三, 二月 11, 2004

About character encodings

About character encodings: About character encodings
A character encoding maps each character in a character set to a numeric value that can be represented by a computer. These numbers can be represented by a single byte or multiple bytes. For example, the ASCII encoding uses seven bits to represent the Latin alphabet, punctuation, and control characters.

You use Japanese encodings, such as Shift-JIS, EUC-JP, and ISO-2022-JP, to represent Japanese text. These encodings can vary slightly, but they include a common set of approximately 10,000 characters used in Japanese.

The following terms apply to character encodings:

SBCS Single-byte character set; a character set encoded in one byte per character, such as ASCII or ISO 8859-1.
DBCS Double-byte character set; a method of encoding a character set in no more than two bytes, such as Shift-JIS. Many character encoding schemes that are referred to as double-byte, including Shift-JIS, allow mixing of single-byte and double-byte encoded characters. Others, such as UCS-2, use two bytes for all characters.
MBCS Multiple-byte character set; a character set encoded with a variable number of bytes per character, such as UTF-8.
The following table lists some common character encodings; however, there are many additional character encodings that browsers and web servers support:

Encoding
Type
Description

ASCII
SBCS
7-bit encoding used by English and Indonesian Bahasa languages

Latin-1
(ISO 8859-1)
SBCS
8-bit encoding used for many Western European languages

Shift_JIS
DBCS
16-bit Japanese encoding (Note that you must use an underscore character (_), not a hyphen (-) in the name in CFML attributes.)

EUC-KR
DBCS
16-bit Korean encoding

UCS-2
DBCS
Two-byte Unicode encoding

UTF-8
MBCS
Multibyte Unicode encoding. ASCII is 7-bit; non-ASCII characters used in European and many Middle Eastern languages are two-byte; and most Asian characters are three-byte


The World Wide Web Consortium maintains a list of all character encodings supported by the Internet. You can find this information at www.w3.org/International/O-charset.html.

Computers often must convert between character encodings. In particular, the character encodings most commonly used on the Internet are not used by Java or Windows. Character sets used on the Internet are typically single-byte or multiple-byte (including DBCS character sets that allow single-byte characters). These character sets are most efficient for transmitting data, because each character takes up the minimum necessary number of bytes. Currently, Latin characters are most frequently used on the web, and most character encodings used on the web represent those characters in a single byte.

Computers, however, process data most efficiently if each character occupies the same number of bytes. Therefore, Windows and Java both use double-byte encoding for internal processing.