Skip to main content
Resources

IDN Character Validation Guidance

The IDNA protocol standard is currently under the last step of review and revision in the IETF. An informal expert panel, working as what the IETF calls a "design team," evaluated experiences gained in the implementation of IDNA since its introduction in 2003, and identified several key areas of future work. These were described in particular in RFC4690 that triggered a formal revision of the IDNA protocol. The core components in the revision effort include: definition of valid IDN labels, an inclusion-based model that recognizes the level of understanding of the implications of the Unicode handling of various scripts on use in IDNs (the current model is exclusion-based), elimination of confusing and non-reversible character mappings, fixing a right-to-left error in Stringprep, and eliminating Unicode version dependencies, thereby permitting more scripts to be used in IDNs now and in the future. The issues with the current IDN model that led to the revision work are discussed in RFC4690.

Latest version of the IDNA revision proposals are available through the IETF or at Patrik Fältström's site: http://stupid.domain.name/idnabis/

One of the core principles in the revision is a procedure, and not a table, of code points so that the algorithm can be used to determine code point sets independent of the version of Unicode that is in use. However, for guidance to the IDN ccTLD Fast Track participants, and until the revised version of the IDNA protocol has been implemented, the following tables of IDNA valid characters, resulting from running the protocol procedure on Unicode 5.2 is released:

  1. Characters that are valid under both IDNA2003 and IDNA2008 [TXT, 3.39 MB]
  2. Characters that are valid under IDNA2003 but not under IDNA2008 [TXT, 164 KB]
  3. Characters that are valid under IDNA2008 but not under IDNA2003 [TXT, 4 KB]

Note: What is not included in category (3) are the codepoints that where unassigned in Unicode 3.2 and PVALID in IDNA2008.

Warning Notes: The content of these tables is only a verification made against the idnabis-table document. The tables do not include any bidi verification. Further, no confusability checking has been between codepoints in these tables, and as such the tables constitute only a basic string validation. This means that if a string contains codepoints that are not are not valid in accordance with these tables, further manual checks can be done, but it most likely implies a show-stopper for usage of the implied string. In addition, if a codepoint is valid per the tables, further manual checks must also be done to ensure that the entire string constitutes no stability issues for the DNS.

Domain Name System
Internationalized Domain Name ,IDN,"IDNs are domain names that include characters used in the local representation of languages that are not written with the twenty-six letters of the basic Latin alphabet ""a-z"". An IDN can contain Latin letters with diacritical marks, as required by many European languages, or may consist of characters from non-Latin scripts such as Arabic or Chinese. Many languages also use other types of digits than the European ""0-9"". The basic Latin alphabet together with the European-Arabic digits are, for the purpose of domain names, termed ""ASCII characters"" (ASCII = American Standard Code for Information Interchange). These are also included in the broader range of ""Unicode characters"" that provides the basis for IDNs. The ""hostname rule"" requires that all domain names of the type under consideration here are stored in the DNS using only the ASCII characters listed above, with the one further addition of the hyphen ""-"". The Unicode form of an IDN therefore requires special encoding before it is entered into the DNS. The following terminology is used when distinguishing between these forms: A domain name consists of a series of ""labels"" (separated by ""dots""). The ASCII form of an IDN label is termed an ""A-label"". All operations defined in the DNS protocol use A-labels exclusively. The Unicode form, which a user expects to be displayed, is termed a ""U-label"". The difference may be illustrated with the Hindi word for ""test"" — परीका — appearing here as a U-label would (in the Devanagari script). A special form of ""ASCII compatible encoding"" (abbreviated ACE) is applied to this to produce the corresponding A-label: xn--11b5bs1di. A domain name that only includes ASCII letters, digits, and hyphens is termed an ""LDH label"". Although the definitions of A-labels and LDH-labels overlap, a name consisting exclusively of LDH labels, such as""icann.org"" is not an IDN."