This document, together with the set of element LGRs, specifies an integrated collection of Label Generation Rules for the Root Zone. For more details on the Root Zone LGRs and their development see "Root Zone Label Generation Rules (RZ LGR-5): Overview and Summary” [RZ-LGR-5-Overview].
As described in the overview, a set of label generation rules for a zone governs the set of labels that may be allocated and eventually delegated in a given Zone. The Root Zone LGR (RZ-LGR) provides this determination with respect to IDN labels for the Root. Logically, any LGR contains four parts: the rules that define allowable Unicode code points (the repertoire), any code point variants that can be substituted to form a variant (the variant rules), the disposition of any resulting label (whether it may be allocated, or is automatically blocked), and a set of optional whole-label evaluation rules that determine whether the output of the previous three portions is still an acceptable label in the root zone.
The Label Generation rules are expressed using a standard format defined in “Representing Label Generation Rulesets in XML” [RFC 7940]. This XML format does not separate the LGR cleanly into the four logical parts described above, but it does provide for a mechanical computation of the status of any label as valid or invalid, and if a valid variant, as to whether that variant is allowed to be allocated, or is instead automatically blocked. Because the Root Zone caters to many scripts, each of which will have script-specific rules, the RZ-LGR is divided into a Common LGR, needed to manage interaction of labels across scripts (such as blocked cross-script variants) plus a set of script-specific Eleement LGRs.
This file specifies the Common LGR, which is one of a set of LGR files that together form an integrated LGR for the DNS Root Zone [RZ-LGR-5].
For readability, a non-normative HTML presentation is available for each of the XML files. That presentation not only provides a formatted view, but is augmented with annotations derived from the Unicode Character Database, as well as with summary data and common explanatory text.
The Root Zone Label Generation Rules (RZ LGR-5) are integrated from the following set of script-specific element LGRs:
Each element LGR provides the complete specification for determining both the validity of a label in the given script, as well as whether a proposed label is an allocatable variant for a given original label. As the set of blocked labels may contain mixed-script or cross-repertoire labels, it cannot be computed from the element LGR. The same is true for Index variants computed in collision testing. See Section 5.4, “Steps in Processing a Label” in [RZ-LGR-5-Overview].
Each element LGR represents in full the underlying proposal for the script-based LGR, except for changes required by the integration process or for uniformity of presentation. See Section 3, “Integration and Contents of LGR-5” in [RZ-LGR-5-Overview].
This merged, or common, LGR has been machine generated by combining the data from all element LGRs plus a common preamble containing this description and the list of references used in this merged LGR. The merged LGR contains the union of the repertoire, variant mappings and Whole Label Evaluation (WLE) and context rules of the element LGRs as described in the following sections. Items that are necessarily script-dependent, such as the types for variant mappings have been removed or replaced by default values, such as “blocked”, or, where appropriate, have their names prefixed with the script ID.
The XML version of this file is the normative specification, the HTML version is provided for ease of reading. That format contains additional explanatory text and other information such as summary counts or attributes looked up in the Unicode Character Database.
When processing an applied for label, this merged LGR presents the complete data and specification needed for conflict checking with any existing label, independent of script. This is in contrast to the script-specific element LGRs, each of which presents the complete data and specification to determine the validity of any label when applied for under that script as well as the verification of any proposed allocatable variant for the label. In particular, the merged LGR contains a complete set of all cross-script and cross-repertoire variants defined for the Root Zone. This can be used to generate an index variant for each label: two labels with the same index variant mutually block each other, unless one is an allocatable variant of the other (which would be determined using the selected Element LGR). See also Section 5, “Using the LGR” in [RZ-LGR-5-Overview].
The repertoire of the integrated Root Zone LGR is the cumulative repertoire of all the element LGRs that have been integrated into this version. Those repertoires in turn were developed based on [MSR-5], which is a subset of [Unicode 11.0].
As the repertoires are merged, any code point with a reflexive “out-of-repertoire-var” mapping in all element LGRs containing that code point will be considered not part of the merged repertoire and not included in the merged repertoire. In contrast, in the Element LGRs, all code points and mappings defined are always retained, independent of whether they are part of the merged repertoire.
As Root Zone LGR, the repertoire includes neither decimal digits nor the HYPHEN-MINUS.
For further details, see Section 3.2.1, “Repertoire” in [RZ-LGR-5-Overview].
Repertoire Listing:Each code point or range is tagged with the script or scripts with which these repertoire elements are used. Each cites the version of the Unicode Standard in which the code points were first encoded. All repertoire elements, including sequences of code points, cite each Element LGR containing that repertoire element; see “References” below.
Some code points or ranges are also tagged with further classifications specific to a given script. Such tags have been prefixed with the Unicode script identifier in the merged LGR.
Sequences as Part of the Repertoire: Element LGRs may define sequences as members of the repertoire. In some instances these sequences may contain code point not indididually listed. These code points are implicitly part of the repertoire, even though they may only appear in the context of that sequence. Somtimes sequences are defined as out-of-repertoire targets for cross-repertoire variant mappings, even though they are otherwise redundant. Such sequences may be included in the merged repertoire, or as an “imposed†variant due to cross-script variant transitivity. (Being redundant, they do not affect the set of available labels under that LGR.)
Comments provide additional information for some of the code points, but also for the definition of variants, classes, rules and actions. Where comments were merged from multiple sources, duplicates are omitted, but non-matching comments are both shown, separated by a vertical bar. A comment marked with ⍟ indicates an item explicitly shared among two or more Element LGRs, or an item required by integration.
The variant mappings in this LGR are based on the union of the non-reflexive variant mappings from all the element LGRs that have been integrated into this version of the Root Zone LGR. After computing the union, the variants are mechanically augmented, if needed, to ensure transitivity of the set. Any variants added during this process are identified in comments.
Variant Dispositions: Because the disposition of variant labels, for example, as “allocatable”, is specific to each script, information related to such dispositions cannot be expressed in the script-neutral context of this merged file. Instead, all merged variant mappings are labeled as “blocked” in this file as required for conflict checking. See also Section 3.2.2, “Variants” in [RZ-LGR-5-Overview].
Reflexive Variants: In the merged file, any reflexive variant mappings, including “out-of-repertoire-var” mappings havee been removed from any repertoire element in the merged repertoire as these mappings do not contribute to the index variant calculation needed for collision checking. In contrast, each element LGR retains the full details of variant mappings as needed to determine variant disposition.
Implicit Coss-script Variants: Not all element LGRs explicitly specify all variants to other scripts and repertoires, choosing instead to inherit those defined in other script LGRs implicitly. Nevertheless, as long as cross-script variants are defined in at least one LGR, they will be part of this merged LGR and are considered in effect for all LGRs that contain the relevant code points or sequences. However, in all cases, each element LGR always lists all of the applicable in-script variants.
Variant Label Processing: Only the merged LGR contains the full information needed for computing the index variant for a label needed for collision checking. The merged LGR therefore must be used to determine conflicts (two labels blocking each other) while the element LGRs are used to determine whether a given label is an allocatable variant for a particular original label under that LGR. See also Section 5, “Using the LGR” in [RZ-LGR-5-Overview].
Large Variant Sets: Especially for longer labels, some variant definitions may lead to variant label sets that can be very large, in some cases, so large as to not be enumerable in real time. In other cases, while the actual number variant labels is limited, the enumeration may in the worst case require the evaluation of a large number of different ways of partitioning a label into sequences. However, given two labels, the Common LGR is designed to allow O(1) determination of whether the two labels collide (block each other), while the element LGRs allow determination of whether one label is an allocatable variant if the other is an original applied-for label.
Context Rules for Variants: Some of the variants defined in this LGR are “effective null variants”, that is, some code points in the source map to “nothing” in the target with all other code points unchanged. (Because mappings are symmetric, it does not matter whether it is the forward or reverse mapping that maps to “null” ). Such variants require a context rule to keep the variant set well behaved. Symmetry requires the same context rule for both forward and reverse mappings.
In other cases, the sequences or code points making up source and target are constrained by explicit context rules on the code points (or by implicit context rules defined for the adjacent code points). In such a case, any variants may require context rules that match the intersection between the effective contexts for both source and target; otherwise, a sequence might be considered valid in some variant label when it would not be valid in an equivalent context in an original label. See Section 6.4, “Code Point Sequences” in [RZ-LGR-5-Overview].
Overlapping Variants: Some sequences may overlap, that is, they share a common part with another sequence or code point, so that in partitioning a label into code points and sequences, more than one partition is possible. In these cases, variants have to be computed for all possible partitions. In some cases, context rules on sequences or variants are defined to curtail any unwanted side effects of such multiple partitioning, such as having each partition being part of a different variant label set, or generating a different index variant. For further discussion, see Section 6.6, “Overlapped Variants” in [RZ-LGR-5-Overview].
All variant mappings in the merged LGR cite each element LGR that explicitly lists the variant.
The specification of variants in the Root Zone LGR follows the guidelines in [RFC 8228].
This merged LGR includes the cumulative set of character classes from all the element LGRs that have been integrated into this version of the Root Zone LGR. See Section 3.2.3, “Character Classes” in [RZ-LGR-5-Overview]. The names for any script-specific character classes have been prefixed with the Unicode script identifier in this file.
This merged LGR includes the cumulative set of WLE and context rules and actions from all the element LGRs that have been integrated into this version of the Root Zone LGR. See Section 3.2.4, “Whole Label Evaluation Rules (WLE)” [RZ-LGR-5-Overview]. See also the comments given for each rule or action.
The integrated LGR includes the set of required default WLE rules and actions applicable to the Root Zone and defined in [MSR-5]. They are marked with ⍟. These default rules include the restrictions defined in [RFC 5891] on placement of combining marks. (Note that the Bidi Rule of [RFC 5893] is implicitly satisfied in the Root Zone, due to restricting the repertoire to letters.)
The names for any script-specific rules have been prefixed with the Unicode script identifier in this file.
Some actions are triggered by script-specific variant type values. While such actions are collected in the Common LGR for reference, they are inoperative in the context of the merged LGR because in the merged LGR all variant type values have been mapped to “blocked”, or, in the case of reflexive variants, removed.
The Root Zone Label Generation Rules - LGR-5 were integrated by the Integration Panel [IP] from a set of proposals for script-specific root zone LGRs developed by community-based Generation Panels [GPs] in an open process with multiple public consultations defined in [Procedure] and [Guidelines]. For more information on the methodology and contributors, see [RZ-LGR-5-Overview], in particular Section 2, “Process of Integration” and Section 8, “Contributors”.
According to the “Procedure to Develop and Maintain the Label Generation Rules for the Root Zone in Respect of IDNA Labels” [Procedure], the Integration Panel is tasked with reviewing each script LGR proposal for the Root Zone and delivering an integrated Root Zone LGR after accepting them. Its members consist of experts in the areas of Unicode, Linguistics and Writing Systems, Domain Name System (DNS) and IDNA.
The Integration Panel was constituted on September 6, 2013 with the following members:
In the listing of the repertoire, references starting at [0] refer to Unicode Standard versions in which the corresponding code points were initially encoded. References [101] and above correspond to the script-specific LGRs that include the repertoire item. Repertoire items may have more than one reference.
In addition the following references are cited in this document:
For details on references [0] and up and [100] and up, please refer to the Table of References below, as well as to [RZ-LGR-5-Overview].
]]>