星期三, 二月 11, 2004

GH Character Sets

GH Character Sets
Character sets are standards that map specific real world characters (code elements) to specific numbers (code points) and then encode those code points as bytes (character encoding or encoding schemes). EG:

The Yen cold element is ?.
The Yen's code point in different code point systems:
ASCII: out of range.
Window ANSI 1252: 165 = 0xA5.
ISO Latin-1 (iso-8859-1): 165 = 0xA5.
Unicode/UCS-2: 165 = 0xA5 = U+00A5.
Unicode/UCS-4: 165 = 0xA5 = U+000000A5
The Yen is character encoded this way in different encoding schemes:
ASCII: out of range.
Windows ANSI 1252: 0xA5 = 10100101.
ISO Latin-1 (iso-8859-1): 0xA5 = 10100101.
UTF-8: 11000010 10100101.
UTF-16: 00000000 10100101.
UTF-32: 00000000 00000000 00000000 10100101
You may also want to see my articles on Typography.

Once a set of bits have been mapped to a code point, then that code point can be mapped to its code element character, and then software can present that character and apply different fonts, styles, and sizes. EG:

In the ASCII character map the code element J has a code point of 74 (x4A). This code point is encoded as the binary number 1001010 (which is the x4A). The binary code can then be interpreted by programs such as browsers and presented to users:

Formatting Applied Example
Font J
Comic
J
Winding
(normally a smiley face)

Style J
underlined
J
italic

Size J
HTML Size 6
J
HTML Size 1


Choosing a character set is important especially if you have to deal with international code, multiple platforms, or databases. EG:

When SQL Server is installed, a Sort Order ID must be set that is based on a character code. If you try to restore a database to that installation, then it must have the same Sort Order ID. Otherwise you may have to rebuild the database, using something like rebuildm.exe.

Here are some of the major character sets:
ASCII (American Standard Code for Information Interchange, aka Standard ASCII; plain text, ISO-646). A SBCS (Single Byte Character Set) that uses 7 bits of a byte (0-127; x0-x7F) to make 96 basic English characters as well as 32 control characters. The 8th bit is used for parity checking. Most of the prevalent character sets are based on ASCII. Used by Linux/Unix.
SBCSs that use the 8th bit are called "High ASCII" since they encode additional characters by utilizing the bit above the ASCII range. Here are SBCSs that use all 8 bits of a byte (0-255; x0-FF) to encode up to 256 characters. There are different character sets (aka code pages) for different uses, languages or language groups. Many of the character sets are super sets of ASCII.
OEM character sets (Original Equipment Manufacturer). The 8th bit often contained characters for line drawings as a carry over from pre-GUI (Graphics User Interface) days. Used by DOS, OS/2, floppy disks, and the FAT system (File Allocation System).
ISO character sets (International Organization for Standardization). ISO Latin 1 (iso-8859-1) is good for most Western languages but I discuss it on my ANSI page because of it is closely tied to Windows ANSI 1252. Used by the Mac OS.
ANSI character sets (American National Standards Institute). Used by the Windows 3.x/9x OS.
Unicode, aka UCS (Universal Character Set). A MBCS (Multi-Byte Character Set) that uses 2-4 bytes worth of possible code points. The code points may be encoded using a 1-6 bytes per character. Unicode is the sensible international and cross-platform character set. Used by Windows NT/2000 and Linux/Unix.
Here is a summary table of the major character sets:

Character Set Bits Decimal Hexadecimal
ASCII 7/8 127 7F
High ASCII 8 255 FF
Unicode, UCS-2 16 65,536 FFFF
UCS-4 31 2,147,483,648 7FFF FFFF

Here are some of the available code page identifiers used by Windows..

Identifier Meaning OEM/ANSI Comment
037 EBCDIC Used in mainframes, esp. IBM.
437 MS-DOS United States OEM IBM DOS and OS/2.
Aka: IBM PC Extended Character Set; Extended ASCII; High ASCII; 437 U.S. English.
500 EBCDIC "500V1"
708 Arabic (ASMO 708) OEM
709 Arabic (ASMO 449+, BCON V4) OEM
710 Arabic (Transparent Arabic) OEM
720 Arabic (Transparent ASMO) OEM
737 Greek (formerly 437G) OEM
775 Baltic OEM
850 MS-DOS Multilingual (Latin I) OEM Standard MS DOS.
Aka: 850 Multilingual.
852 MS-DOS Slavic (Latin II) OEM
855 IBM Cyrillic (primarily Russian) OEM
857 IBM Turkish OEM
860 MS-DOS Portuguese OEM
861 MS-DOS Icelandic OEM
862 Hebrew OEM
863 MS-DOS Canadian-French OEM
864 Arabic OEM
865 MS-DOS Nordic OEM
866 MS-DOS Russian OEM
869 IBM Modern Greek OEM
874 Thai OEM/ANSI
875 EBCDIC
932 Japanese OEM/ANSI
936 Chinese (PRC, Singapore; Simplified) OEM/ANSI
949 Korean OEM/ANSI
950 Chinese (Taiwan; Hong Kong SAR, PRC; Traditional) OEM/ANSI
1026 EBCDIC
1200 Unicode (BMP of ISO 10646) ANSI Window NT/2000 and HTML.
Aka: ISO-1604-6; UCS;
Unicode (UTF-7): utf-7; csUnicode11UTF7, unicode-1-1-utf-7, x-unicode-2-0-utf-7; 65000.
Unicode (UTF-8): utf-8; unicode-1-1-utf-8, unicode-2-0-utf-8, x-unicode-2-0-utf-8; 65001.

1250 Windows 3.1 Eastern European ANSI
1251 Windows 3.1 Cyrillic ANSI
1252 Windows 3.1 US (ANSI) ANSI Windows 3.x/9x, Macs, and HTML.
ANSI comes in two versions (the difference is found at decimal 128-159 (hexadecimal 80-9F)):
Windows ANSI. Aka: Western European (Windows); windows-1252; US/Western European; Western.
ISO Latin 1 ANSI. Aka: Western European (ISO); iso-8859-1; ANSI_X3.4-1968; ANSI_X3.4-1986; ascii; cp367; cp819; csASCII; IBM367; ibm819; iso-ir-100; iso-ir-6; ISO646-US; iso8859-1; ISO_646.irv:1991; iso_8859-1; iso_8859-1:1987; latin1; us; us-ascii; x-ansi; iso-latin-1.

1253 Windows 3.1 Greek ANSI
1254 Windows 3.1 Turkish ANSI
1255 Hebrew ANSI
1256 Arabic ANSI
1257 Baltic ANSI
1258 Vietnamese
1361 Korean (Johab) OEM
10000 Macintosh Roman
10001 Macintosh Japanese
10006 Macintosh Greek I
10007 Macintosh Cyrillic
10029 Macintosh Latin 2
10079 Macintosh Icelandic
10081 Macintosh Turkish

Note that aliases in bold is the preferred charset ID for the HTML tag:



Character sets are often encoded in plain text documents such as HTML and XML using either NCR (Numeric Character References) or CER (Character Entity Reference). CERs use symbolic names so that authors need not remember code points. EG: For the Yen character (?), the NCR is either ¥ or ¥, while the CER is ¥.