This file specifies a reference set of Label Generation Rules for Chinese using a limited repertoire as appropriate for a second level domain.
Unlike for the Japanese and Korean repertoires, there is no single, well-established standard on which to base an LGR repertoire for Chinese. Chinese repertoires have been independently created for Traditional and Simplified Chinese. However, the need to harmonize the process of simplified and traditional variants has lead to common repertoires in the context of IDNA.
The repertoires defined by the .cn registry [700] and the .tw registry [701] are identical and contain 19,520 Han ideographs from two Unicode blocks: CJK UNIFIED IDEOGRAPHS and CJK UNIFIED IDEOGRAPHS EXTENSION A. They include the core repertoires for both simplified and traditional Chinese expressed as follows:
G0: GB2312-80
G1: GB12345-90 (with one exception, see below)
HB0: Big-5: Computer Chinese Glyph and Character Code Mapping Table, Technical Report C-26, 1984
HB1: Big-5, Level-1
HB2: Big-5, Level-2
T1: TCA-CNS 11643-1992 1st plane
T2: TCA-CNS 11643-1992 2nd plane
Note:The repertoires of [700] and [701] do not include U+9DC0. It is part of the G1 set, where it should have been the traditional variant of U+9E5A, but has been commonly replaced by U+9DBF in that role.
Because the repertoire defined by [700] and [701] does not fully cover the needs of some Chinese constituencies such as Hong Kong SAR and Singapore, a new IDNA repertoire was created by DotAsia (.asia registry) for Chinese use (ZH) [702]. It adds a further 163 Han ideographs: 156 are part of the Hong Kong Supplementary Character Set (HKSCS), a set that is also included in the [IICORE] collection; 4 are GS (Singapore Characters); and the remaining 3 are part of various other Chinese sources, but necessary to ensure full transitivity in variant processing.
Note: This LGR contains 62 code points from the block: CJK UNIFIED IDEOGRAPHS EXTENDED B, included as part of the 156 characters from HKSCS.
Altogether, the Han repertoire of this LGR is made up of 19,684 code points: 19,683 (19,520+163) from .asia ZH [702], plus U+9DC0.
Unlike many other non-Latin 2nd level reference LGRs, the Chinese LGR includes the basic ASCII Latin set (a to z) because it is common practice in Chinese text to mix Han and ASCII. Therefore it does not create confusability or additional security risks in the context of a second level LGR for the Chinese language. It is also supported by current IDNA practice, see [700], [701], and [702].
None.
None.
The variant set is based on the set defined in the .asia ZH set with some minor adjustments:
Addition of blocked variant U+58DC to U+7F4E and blocked variant U+9771 to U+976D to ensure symmetry.
Addition of blocked variant mappings to U+9DC0 from both U+9E5A and U+9DBF.
Addition of a variant mapping from U+9DC0 to itself (traditional), to U+9E5A (simplified), and to U+9DBF (blocked).
Each variant mapping has been assigned a type from the following LGR-specific set of types:
(See Actions below).
This LGR defines no named character classes.
Common rules only:
Hyphen Restrictions — restrictions on the allowable placement of hyphens (no leading/ending hyphen and no hyphen in positions 3 and 4). These restrictions are described in section 4.2.3.1 of RFC5891 [120]. They are implemented here as context rule on U+002D (-) HYPHEN-MINUS.
Leading Combining Marks — restrictions on the allowable placement of combining marks (no leading combining mark). This rule is described in section 4.2.3.2 of RFC5891 [120].
Actions include the default actions for LGRs as well as those needed to invalidate labels with misplaced combining marks.
Chinese-specific actions that are triggered by the LGR-specific variant types described above limit the "allocatable" variant labels to those that are either fully simplified or fully traditional labels. In addition, these actions return a disposition of "valid" for any original label, even those that are mixed between simplified and traditional (see also [RFC3743] and [RFC4713]). To account for original labels, reflexive variant mappings with an "r-" prefix are used. (See [RFC7940])
Note: there is no action explicitly triggered by variant type "r-neither". Instead, it is implicitly handled by the "catch-all" action. Its main benefit is in explicitly documenting the status of the code point.
This reference LGR for Chinese for the 2nd Level has been developed by Michel Suignard and Asmus Freytag, verified in expert reviews by Lu Qin and Wil Tan, and based on multiple open public consultations.
General references for the language:
Wikipedia: "Chinese language", https://en.wikipedia.org/wiki/Chinese_language
Omniglot: Written Chinese http://www.omniglot.com/chinese/written.htm
Other references cited in this document:
In the listing of the repertoire by code point, references starting from [0] refer to the version of the Unicode Standard in which the corresponding code point was initially encoded. Other references (starting from [100]) document usage of code points. For more details, see the Table of References below.
]]>