Skip to main content

Linguistic Diversity in the Internet Root: The Case of the Arabic Script and Jawi

Representing Multiple Languages and Scripts in the Internet Root

The Internet Corporation for Assigned Names and Numbers (ICANN) manages the root zone for the global Internet. The root zone contains the authoritative list and record of all Top Level Domains (TLDs). Since the beginning of the Internet, only a subset of Latin script-based TLDs was allowed in the root zone. This is due to the origin and design legacy of the Domain Name System (DNS). The DNS was originally designed to handle Latin script in ASCII format.1 In 2010 a technical solution was implemented to enable the introduction of domain names in multiple scripts and languages at the top level of the DNS without destabilizing the Internet.

There are four types of TLDs that are relevant to Internet users: country code Top Level Domains (ccTLDs), generic Top Level Domains (gTLDs), IDN country code Top Level Domains (IDN ccTLDs) and generic IDN Top Level Domains (IDN gTLDs). TLDs that fall under the category of country codes represent official names of countries and territories as determined by the International Standard ISO 3166-1.2 Generic TLDs pertain to names or representation of names other than those associated with official names of countries and territories. Internationalized Domain Names (IDNs) refer to domain names in multiple scripts and languages that go beyond the original ASCII character set.3 See Box 1 for examples of Top Level Domains by category.

Box 1 – Domain Name Labels & Examples of Top Level Domains by Category

What are Domain Name Labels?

Domain names are arranged according to a hierarchy of labels from the lowest level to the highest level in the Domain Name System (DNS). Example: en.wikipedia.org. Here, "org" is a domain name label at the highest or top level, "wikipedia" is a domain name label at the second level, and "en" is a domain name label at the third and lowest level.

Examples of Top Level Domains (TLDs) by category

ccTLD

.de (Germany) | .et (Ethiopia) | .my (Malaysia) | .pr (Puerto Rico)

IDN ccTLD

مصر (Egypt) | 中国 (China) | भारत (India) | ไทย (Thailand)

gTLD

.com | .org | .ngo | .guru | .bank | .email | .organic | .photography

IDN gTLD

みんな (everyone) | дети (kids) | संगठन (organization) | 世界 (world) | بازار (bazaar) | 삼성 (Samsung) | vermögensberatung (financial advice) | คอม, (com)

IDN country code Top Level Domains

It was not until 2010 that it was possible for countries to have their country code TLDs represented in scripts other than Latin. ICANN established the IDN ccTLD Fast Track process to accommodate the need for countries to have their country code and country name TLDs in local scripts that reflect their linguistic base.

The first set of IDN ccTLDs inserted into the root zone was in Arabic.4 By mid 2015, ICANN had approved 47 IDN ccTLDs that cover 15 scripts and 24 languages.5 Two-thirds of these scripts are relevant for the Asia Pacific region. The highest demand for IDN ccTLDs was for the Arabic script (38%), followed by Cyrillic (14.9%) and Han (14.9%), Tamil (6.4%), and Bangla/Bengali (4.3%) – see Figure 1 for a breakdown of all IDN ccTLD applications. India has the distinction of having the most number of scripts (seven in total) representing its country name, which reflects its national linguistic diversity.

Figure 1 - Breakdown of IDN ccTLDs Applications

Script Requested

Number of Applications

Arabic

18

Cyrillic

7

Han (Chinese)

7

Tamil

3

Bangla/Bengali

2

Armenian

1

Devanagari

1

Georgian

1

Greek

1

Gujarati

1

Gurmukhi

1

Hangul

1

Sinhala

1

Telugu

1

Thai

1

Total

47

IDN generic TLDs

The deployment of IDN generic TLDs is still in progress and requires the involvement of script communities worldwide. Each language community that wants to have its script effectively represented in efforts that determine TLDs that would be allowed in the root zone should engage in the work of script Generation Panels. Engagement can take various forms such as forming, supporting and facilitating the Panels, becoming Panel members, or providing input or feedback to the Panels during ICANN's call for Public Comments. Script Generation Panels function according to the bottom-up multistakeholder collaboration model that exemplifies ICANN and other Internet organizations. For effectiveness, Panels must comprise experts in DNS, Unicode, IDNs, linguistics and domain name operations and policy. Where expertise is not available, assistance can be requested from ICANN.

Script Generation Panels are responsible for developing proposals that determine script-specific Label Generation Rules (LGR) for the root zone. These proposals are developed based on each script community's expertise and requirements for the use of a particular script in IDN TLD labels. Among others, the work involves going through all the characters of a script and identifying which character would be permitted for use in TLD labels, which would not be allowed, and what rules would apply to determine valid labels and their variants (if any). As Generation Panels have to cover entire repertoires of script characters together with their corresponding Unicode code points, the work requires considerable voluntary effort from script communities.

Currently, nearly 20 script communities are actively working on developing their Label Generation Rules for the root zone. The range of scripts include Arabic, Armenian, Bengali, Chinese, Cyrillic, Devanagari, Gujarati, Gurmukhi, Japanese, Kannada, Khmer, Korean, Latin, Malayalam, Oriya, Tamil and Telugu. The bulk of the work is concentrated on scripts of the Asia Pacific region. This is not surprising considering that nearly half of the existing 3 billion Internet users are in that region. The next billion of Internet users are expected to come online by the year 2020. The majority of them will also come from the Asia Pacific region. The demand for using the Internet in local scripts and languages exists. In Asia, governments have been instrumental in initiating and facilitating the launch of script Generation Panels. These governments understand the importance of Internet accessibility and usability in local scripts for their population.

Generation Panels that focus on a script that is shared across many languages will require more time to complete its work compared to Panels that deal with a single language. For example, the Armenian Generation Panel required only six months to complete its proposal on a script used by the Armenian language. The Arabic Generation Panel in contrast took approximately 20 months to complete its work. The longer period of time is necessary as the Arabic script is used by more than 50 languages across Africa, the Middle East, and Asia (specifically West Asia, South Asia and South East Asia). The Arabic Generation Panel was a pioneer in the Root Zone Label Generation Rules project in two ways. It was the first to organize itself for the work and its experience resulted in the methodology and templates that are used to guide the work of subsequent Generation Panels.

Proposals from both the Arabic and Armenian Generation Panels are now complete and have been published by ICANN for public review and comments – see Box 2 for links to calls for Public Comments. Affected language communities are strongly encouraged to respond and provide feedback to ensure that script repertoires for the root zone address user language needs on the Internet.

Box 2 – ICANN Call for Public Comments on Script Generation Panel LGR Proposals

Arabic Script Root Zone Label Generation Rules Proposal - https://www.icann.org/public-comments/arabic-lgr-proposal-2015-08-24-en.
(Closing Date: 16 October 2015)

Armenian Script Root Zone Label Generation Rules Proposal - https://www.icann.org/public-comments/proposal-armenian-lgr-2015-07-22-en.
(Closing Date: 31 August 2015)

Figure 2 provides an overview of all the characters that the Arabic Generation Panel is proposing for forming IDN TLDs in Arabic Script. It identifies what characters are proposed for inclusion and exclusion, which would be applicable for all languages that use the Arabic script.

Figure 2 – Relevant Unicode Tables with Arabic Characters Proposed by the Arabic Generation Panel (GP) for the Root Zone Label Generation Rules (LGR)

Yellow – Proposed for the Root Zone LGR by the Arabic GP
Blue – Excluded by Arabic GP
Pink – Excluded from Maximal Starting Repertoire (MSR)
White – Disallowed by IDNA2008 by IETF

Relevant Unicode Tables with Arabic Characters Proposed by the Arabic Generation Panel (GP) for the Root Zone Label Generation Rules (LGR)

Box 3 highlights the case of a localized Arabic script in South East Asia (i.e., Jawi). The case identifies Jawi characters that are proposed for exclusion from IDN TLDs.

Box 3 – The Case of Jawi

Jawi (جاوي) is the localized name of the Arabic script used in South East Asian languages. These languages include Acehnese, Banjarese, Malay, Minangkabau and Tausug. Countries with records of Jawi usage include Brunei, Indonesia, Malaysia, Singapore and Thailand. Variants of Jawi can also be found in other countries in the subregion. Jawi was once a dominant script of South East Asia. Its usage was impacted by the widespread adoption of the Latin alphabet. Today Jawi retains formal status in Brunei and Malaysia. Brunei adopted Jawi as one of its two official scripts while Malaysia uses it as an alternative script that is generally reserved for religious, cultural, academic and administrative purposes. Malaysia's successful application for its IDN country code TLD in Arabic script (.مليسيا) is indicative of Jawi's formal status in the country. The Arabic Generation Panel factored Jawi and the Malay language in its script review to determine Label Generation Rules for the root zone. Key documentation related to Jawi from Malaysia reveals close to 50 characters (and corresponding Unicode code points) of interest in the Arabic script.6 Nearly all of those characters have been included in the Arabic Generation Panel's Label Generation Rules proposal for the root zone. Three characters were proposed for exclusion (see Table below).

Table of Jawi characters proposed for exclusion in IDN TLDs for the Arabic Script

Character

Unicode Code Point

Code Point Name and Properties

Jawi Coded Character Name7

[Excluded by] - Rationale

۲

06F2

EXTENDED ARABIC-INDIC DIGIT TWO

EXTENDED ARABIC-INDIC DIGIT TWO

[IETF]8 – Digits are not permitted in TLD labels.

ڬ

06AC

ARABIC LETTER KAF WITH DOT ABOVE

GAF

[Integration Panel]9 - Obsolete Malay-Jawi. Use ݢ (U+0762) instead.

ء

NONE

NONE

ARABIC HAMZAH THREE QUARTER

[Arabic Generation Panel] - No Unicode encoding and thus not eligible for consideration.

Jawi script and associated language communities in South East Asia are urged to review the proposal and provide feedback through the ICANN Public Comments process, which is open until 16 October 2015 (https://www.icann.org/public-comments/arabic-lgr-proposal-2015-08-24-en).

Technical Restrictions to Linguistic Diversity in the Root

The work of the script Generation Panels is encouraging and moves the world closer to the ICANN vision of "One World, One Internet." There is broad recognition at ICANN that IDNs will increase Internet use by the majority of the world's population and the ICANN community strongly supported the deployment of IDNs. This led to the establishment of the IDN Program, which supported the IDN ccTLD Fast Track Process that enabled IDN ccTLDs and the Root Zone Label Generation Rules Project that enabled generic IDN TLDs.

In a world with more than seven billion people, more than 7000 living languages, and numerous writing systems or scripts, an Internet that serves the world should be linguistically diverse.10 Because the root zone is a shared global space, TLD additions is restricted according to the principles for zone operation proposed by the Internet Engineering Task Force and ICANN's policies for root security and stability. The procedure adopted by ICANN for developing and maintaining Label Generation Rules for the root zone restricts additions to the root zone to "those writing systems where there is a clear interest."11 Language communities that are active on the Internet and with a clear interest to have their script enter the root zone are strongly encouraged to engage in their script's Label Generation Rules process.

The Challenge of Universal Acceptance

Despite all of the efforts made by ICANN and its community of stakeholders, one major obstacle stands in the way of achieving a multilingual Internet: Universal Acceptance. Top Level Domains have evolved since they were introduced to the world and they will continue to evolve with further expansion of ICANN's new generic TLD program.12 Some Internet services and software applications have not kept up with that evolution. This makes TLDs not usable for users and essentially blocks user access to websites, email and other applications – see Box 4 for how domain names are relevant to Internet users.

Challenges of "Universal Acceptance" include Internet services and software applications not accepting TLDs written in multilingual scripts other than ASCII, not accepting TLD names that are longer than three characters, and not supporting the introduction of IDNs or non-ASCII names in email. According to the Universal Acceptance Steering Group, "Software and service providers have historically been unaware of these problems or had little market or regulatory incentive to invest in solutions that would bring true interoperability to platforms or applications."13

Solving the problem of "Universal Acceptance" requires getting the providers of Internet services and software developers to support the principle that all domain names and email addresses must be accepted, stored, processed and displayed in a consistent and effective manner. To support Internet users worldwide, TLDs need to be made useable in applications regardless of their script, length or newness. If this challenge of "Universal Acceptance" can be overcome, and with more support for local content worldwide, we would be able to have a truly multilingual Internet.

Box 4 – Relevance of Domain Names to Users

How are Domain Names relevant to Internet Users?

Internet resources are numerically addressed. Domain names make it easier for people to access those resources without having to memorize numbers. Most users would not be able to access and use the Internet, its services and applications, without domain names. These applications include the World Wide Web and email. It is worth remembering that email addresses contain domain names after the "@" symbol. Internet end users typically use domain names when accessing web browsers, email, and mobile apps. They also use domain names when they set up online accounts for services on the Internet. Most end users use domain names to access content published by others. Some of them also register a domain name to publish their own information through websites.

 

Rinalia Abdul Rahim is a member of the Arabic Generation Panel for Root Zone Label Generation Rules and the Task Force on Arabic IDNs. Formerly, she was Co-Chair of the At-Large IDN Working Group that focuses on IDN issues of interest to individual Internet users worldwide. She is also a member of the ICANN Board of Directors and the ICANN Board Working Group on IDN and Variants.


1 ASCII stands for American Standard Code for Information Interchange.

2 http://www.iso.org/iso/countrycodes/countrycodes

3 The original ASCII character set allowed in domain names included the letters a-z, digits and hyphen. Domain names at the top level have a special restriction in that they are only allowed to contain letters and not digits or hyphen.

4 https://www.icann.org/news/announcement-2010-05-05-en

5 https://www.icann.org/resources/pages/string-evaluation-completion-2014-02-19-en.

6 Dewan Bahasa dan Pustaka, Daftar Kata Bahasa Melayu-Rumi-Sebutan Jawi (2001); MYNIC/.MY Domain Registry, Jawi Language Table Submitted to the IANA Repository (2009); MYNIC/.MY Domain Registry, Report for Malaysia's Internationalized Domain Name: Jawi Language Issues, Version 1.0 (2009); Standards Malaysia, Malaysian Standard on IT-Jawi Coded Character Set for Information Interchange (2012).

7 Standards Malaysia, Malaysian Standard on IT-Jawi Coded Character Set for Information Interchange (2012).

8 Internet Engineering Task Force (IETF) RFC1123 and RFC6912

9 Root Zone LGR Integration Panel, MSR-1-Annotated-non-CJK-Tables-20140606 pages 32-38 (https://www.icann.org/en/system/files/files/msr-non-cjk-06jun14-en.pdf [PDF, 1.86 MB]).

10 https://www.ethnologue.com/enterprise-faq/how-many-languages-world-are-unwritten

11 https://www.icann.org/en/system/files/files/draft-lgr-procedure-20mar13-en.pdf [PDF, 1.39 MB]

12 See http://newgtlds.icann.org/en/program-status/delegated-strings for a rolling list of generic TLDs that are being delegated into the root.

13 https://www.icann.org/resources/pages/universal-acceptance-2012-02-25-en

Comments

    Domain Name System
    Internationalized Domain Name ,IDN,"IDNs are domain names that include characters used in the local representation of languages that are not written with the twenty-six letters of the basic Latin alphabet ""a-z"". An IDN can contain Latin letters with diacritical marks, as required by many European languages, or may consist of characters from non-Latin scripts such as Arabic or Chinese. Many languages also use other types of digits than the European ""0-9"". The basic Latin alphabet together with the European-Arabic digits are, for the purpose of domain names, termed ""ASCII characters"" (ASCII = American Standard Code for Information Interchange). These are also included in the broader range of ""Unicode characters"" that provides the basis for IDNs. The ""hostname rule"" requires that all domain names of the type under consideration here are stored in the DNS using only the ASCII characters listed above, with the one further addition of the hyphen ""-"". The Unicode form of an IDN therefore requires special encoding before it is entered into the DNS. The following terminology is used when distinguishing between these forms: A domain name consists of a series of ""labels"" (separated by ""dots""). The ASCII form of an IDN label is termed an ""A-label"". All operations defined in the DNS protocol use A-labels exclusively. The Unicode form, which a user expects to be displayed, is termed a ""U-label"". The difference may be illustrated with the Hindi word for ""test"" — परीका — appearing here as a U-label would (in the Devanagari script). A special form of ""ASCII compatible encoding"" (abbreviated ACE) is applied to this to produce the corresponding A-label: xn--11b5bs1di. A domain name that only includes ASCII letters, digits, and hyphens is termed an ""LDH label"". Although the definitions of A-labels and LDH-labels overlap, a name consisting exclusively of LDH labels, such as""icann.org"" is not an IDN."