Skip to main content

Celebrating a Major Community-Driven Milestone in Enabling Multilingual Top-Level Domains

Celebrating milestone multilingual tlds 725x343 02mar16 en

On 2 March 2016, the first version of Root Zone Label Generation Rules (LGR-1) became available. LGR-1 supports the Arabic script, and future versions will support additional scripts. This achievement follows the Root Zone LGR proposals that the Arabic and Armenian script communities submitted in November 2015.

The establishment of the first Root Zone Label Generation Rules is a significant step forward in developing a multilingual Internet. These rules provide an open and transparent method for determining the validity and variants of top-level domain (TLD) names, or labels, in the world's various scripts and writing systems. LGR-1 is the result of the hard work of the community-based Arabic Script Generation Panel, the Integration Panel and many other contributors. The work benefits current and future Internet users who use the Arabic script by making it easier to navigate the web, and helps address confusion and security issues in using the domain name system – specifically top-level domains.

I would like to congratulate everyone who contributed to achieving this monumental first step!

Journey to LGR-1: Arabic Script Generation Panel

The community-based Task Force on Arabic Script Internationalized Domain Names (TF-AIDN) started organizing in the second half of 2013 and formally began working as the Arabic Script Generation Panel in February 2014. Arabic is one of the more complex scripts to examine because it is used by several different languages across Asia and Africa, resulting in many variations in the shapes of its letters and use of the script. Minor variations in letters look similar to naive users – even within the Arabic script community – who are not familiar with the breadth of Arabic script usage, causing confusion in labels. For example, Arabic script users can confuse کتاب (kitab, "book") with ڪتاب, the latter interpreted as a stylistic variation. These variations add to the inherent complexity of the script, which already contains context-dependent cursive shapes of letters and many combining marks to indicate both consonantal and vocalic content.

Since the use of Arabic script is so geographically and linguistically diverse, the first challenge for the initial members of the Arabic Script Generation Panel was to recruit representative experts from a variety of disciplines, including linguistics, technology, policy and end-user community. This group worked with ICANN's Global Stakeholder Engagement team to recruit a total of 33 members, representing 21 countries – an impressive variety of Arabic script users!

What followed was an eighteen-month development process involving thousands of emails, scores of online meetings, plenty of lively discussions, some tough linguistic compromises and much hard work. In its progress toward the first LGR, the Arabic Script Generation Panel accomplished three key tasks:

  1. Analyzing Unicode code points for inclusion.

    As a first task, the group had to determine which code points should be allowed for use when forming labels. As a starting point, the Integration Panel prepared a short list, which still contained more than 200 code points. This effort involved finding and documenting sources to verify that each code point was used in contemporary and active language, and if not, excluding it. This was a difficult task, especially in cases of communities whose use of the script is undocumented because their countries formally use other scripts, such as Cyrillic or Latin. Examples of such cases were found in both Asia and Africa.

  2. Defining variants of the code points.

    It was challenging to determine what would be a variant in Arabic script, because there were many ways the script community could consider two code points equivalent, including homoglyphs, stylistic variations, and placement and orientation of dots and other marks. In addition, some variants are semantically related but have graphically unrelated forms, motivated by cultural contexts and phonological considerations. While being liberal in variant definition to manage end-user confusion, the Arabic Script Generation Panel also had to minimize the "allocatable" variant labels generated by these variant code points due to the conservatism of the Root Zone. This was a real challenge, as it meant that different communities had to compromise on their linguistic expressions.

  3. Determining the whole-label evaluation rules that allow only valid labels.

    A new challenge arose while creating label generation rules. It was not evident how to define linguistic rules for script-level label verification. Spelling rules and other criteria are generally based on languages and not scripts, and may not apply to domain labels because labels are not limited to real words in a language. The Arabic Script Generation Panel addressed this challenge by determining the usability of the labels as a limiting criteria. For example, it developed rules that invalidated labels that required switching between keyboards of different languages while using Arabic script.

What's next? Allocatable variants of Arabic script top-level domains can now be determined. The community must agree on how such TLDs will be implemented and delegated. This work is in progress and ICANN will soon be seeking community input on the mechanisms that will be used to manage variant TLDs.

The Label Generation Rules Journey Continues

Community volunteers for other scripts and writing systems are diligently working to complete LGR proposals for their scripts. Here's a short summary:

  • The Armenian Script Generation Panel also completed its work and submitted its proposal in November 2015, finishing in a record six months. However, due to homoglyph variants with Cyrillic, Greek and Latin scripts, the Integration Panel postponed the integration process – the work of the other generation panels will help to better understand the effect of these interactions.
  • The Chinese, Japanese and Korean communities use a mix of scripts to write their languages. The generation panels are analyzing these languages separately, and are also coordinating efforts to ensure a common solution to integrate Han script, which all of them share.
  • The Khmer, Lao and Thai script communities have shown great progress. Khmer and Lao generation panels are having rigorous discussions with the Integration Panel to finalize the complex script-based, whole-label evaluation rules. This feature is shared with the scripts derived from the complex Brahmi writing system.
  • The Cyrillic, Greek and Latin communities are at various stages of their analyses. Once they finish their internal work, they will start coordinating to finalize cross-script variants among themselves and with the Armenian script.
  • Ethiopic and Neo-Brahmi generation panels have formed. The communities have started their work and are learning process requirements. The Neo-Brahmi Generation Panel has a complex task at hand, as it is simultaneously working on nine different scripts of the region.
  • ICANN staff is reaching out to Georgian, Hebrew, Sinhala and Thaana script communities to encourage them to organize and start work on their respective LGR proposals.

Status of Work on Root Zone LGR by the Generation Panels (in March 2016)

Bar graph showing status of work on Root Zone LGR by the Generation Panels (in March 2016)

As these script communities finalize their proposals, they will be incrementally integrated into subsequent releases of the LGR, allowing the relevant communities to determine the validity and variants of labels in these scripts.

I am excited by the progress to date and look forward to the completion of more LGR proposals. I am particularly grateful to all of the volunteers who understand the significance of this undertaking and have dedicated such hard work toward making a multilingual Internet a reality.

For more information on Label Generation Rules, please read earlier blogs on Root Zone LGR – Introduction to Root Zone LGR, collaboration required and challenges faced. For more information on the work of the IDN Program at ICANN, visit icann.org/idn or email IDNProgram@icann.org.

Comments

    Nathalie Messie  05:51 UTC on 03 March 2016

    Congratulation for this progress. I hope everything will continue that well.

Domain Name System
Internationalized Domain Name ,IDN,"IDNs are domain names that include characters used in the local representation of languages that are not written with the twenty-six letters of the basic Latin alphabet ""a-z"". An IDN can contain Latin letters with diacritical marks, as required by many European languages, or may consist of characters from non-Latin scripts such as Arabic or Chinese. Many languages also use other types of digits than the European ""0-9"". The basic Latin alphabet together with the European-Arabic digits are, for the purpose of domain names, termed ""ASCII characters"" (ASCII = American Standard Code for Information Interchange). These are also included in the broader range of ""Unicode characters"" that provides the basis for IDNs. The ""hostname rule"" requires that all domain names of the type under consideration here are stored in the DNS using only the ASCII characters listed above, with the one further addition of the hyphen ""-"". The Unicode form of an IDN therefore requires special encoding before it is entered into the DNS. The following terminology is used when distinguishing between these forms: A domain name consists of a series of ""labels"" (separated by ""dots""). The ASCII form of an IDN label is termed an ""A-label"". All operations defined in the DNS protocol use A-labels exclusively. The Unicode form, which a user expects to be displayed, is termed a ""U-label"". The difference may be illustrated with the Hindi word for ""test"" — परीका — appearing here as a U-label would (in the Devanagari script). A special form of ""ASCII compatible encoding"" (abbreviated ACE) is applied to this to produce the corresponding A-label: xn--11b5bs1di. A domain name that only includes ASCII letters, digits, and hyphens is termed an ""LDH label"". Although the definitions of A-labels and LDH-labels overlap, a name consisting exclusively of LDH labels, such as""icann.org"" is not an IDN."