Alternate Path to Delegation Report for .education

Eligibility for Alternate Path to Delegation

TLD "education" is eligible for the Alternate Path to Delegation as described in the ICANN New gTLD Collision Occurrence Management plan. [1]

Second Level Domains (SLDs)

A total of 21577 unique applicable SLDs were detected in the eight DNS-OARC "Day In The Life" ("DITL") datasets [2] collected in 2006-2013, and the 2010 DNSSEC rollout datasets; hereafter "input datasets." Pursuant to the ICANN New gTLD Collision Occurrence Management plan, if the Registry Operator chooses to block [3] all of these strings, its proposed TLD may proceed to delegation in advance of the forthcoming Collision Occurrence Assessment.

The list of SLD Strings that must be blocked is available here.

Strings appearing in the input datasets that are not valid hostnames as defined in RFC 1123 ("LDH Rule") and are not valid A-labels as defined in RFC 5890 were not included in the block list. Furthermore, the contractually required SLD "nic" will not appear in the block list.

The block list was determined as follows:

List all unique strings at the SLD position in DNS requests where the applied-for string is in the TLD position in all input datasets;
Filter the SLD query position as described above;
Remove the "Chrome 10" strings at the leftmost query position on a best effort basis (see Methodology section below);

The remaining SLD strings comprise the block list.

Methodology

Data and Source Code

DNS-OARC data was re-processed from the raw PCAP [4] files provided by participating DNS Root Server Operators as a part of the "Day In The Life," and 2010 DNSSEC rollout data collection programs. Source code and procedures to process the raw files are available on GitHub. [5] Because processing these files is resource intensive, DNS-OARC members are invited to utilize the intermediary files located here (/home/kwhite/jas/gtld/jas) [6] for their own analysis and research.

SLD Strings Excluded from the Block List

A significant proportion of the queries appear to be randomly generated 10 character alphabetic [7] strings used by the Google Chrome browser to detect certain aspects of DNS resolver behavior. [8] While there appear to be numerous varieties and sources of random/algorithmically-generated strings in the input datasets, the 10 character Chrome queries appear to present minimal risk if filtered from the block lists.

The "Chrome 10" strings come in triplets as described in the Chromium source code. Only strings that are seen coming in the triplet sequence from the same IP are eliminated.

While "randomness" is relatively easy for humans to detect, it is remarkably hard for machines. However, since the datasets are so dominated by these labels - for which blocking adds no practical value - significant effort to detect and exclude these has been taken.

We engaged expert data scientists to develop a robust mechanism to detect these random strings. The following is a high level description of the algorithm they developed.

Only 10 character labels consistent with the format described in the Chromium source code are subjected to "random detection." [9]

Parameters were selected and algorithms were tuned with an English dictionary. [10] "Randomness" of each label is determined only after analyzing the entire dataset and performing a statistical analysis of the labels and multiple substrings depending on the individual characteristics of the label.

To validate and tune the algorithm, we ran 84 individual experiments for a total of 851 CPU hours on the DITL 2013 and 2012 datasets. The quality of the random recognizer was confirmed with the following tests:

1) Test #1: An English dictionary was used to count the number of false positives detected. The ratio of "RANDOM_YES predictions that hold an English word" to "the total # of RANDOM_YES predictions" was calculated. This test verifies that a RANDOM_YES will not have English words embedded in it. Manual inspection of borderline strings very often reveals English words like "host," "mail," and "server" embedded in strings, so this test verified the random recognizer's performance in those situations. Less than 0.2% error rate was observed following manual inspection.

2) Test#2: Results of the random detector were compared to a simplistic detection of the Chrome random strings. Less than 0.8% error rate was observed following manual inspection.

[1] http://www.icann.org/en/groups/board/documents/resolutions-new-gtld-annex-1-07oct13-en.pdf

[2] https://www.dns-oarc.net/oarc/data/ditl

[3] http://www.icann.org/en/topics/idn/idn-vip-integrated-issues-final-clean-20feb12-en.pdf, Section 5.

[4] PCAP is a binary network packet capture file format

[5] https://github.com/JASdevteam/dns-oarc

[6] DNS-OARC may move these intermediary files to a permanent location at some future point.

[7] The length is hard-coded in Chromium source (IntranetRedirectDetector::kNumCharsInHostnames = 10) here

[8] See comments in source code here

[9] Chromium Source

[10] Of course there are non-English strings in the labels, but many are English or English-derived strings like "proxy" and "host" that are common internationally. An English dictionary was a good validation tool, but in the end "randomness" is not determined by existence in an English dictionary.

ICANN

Get Started

News and Media

Policy

Public Comment

Resources

Community

Quicklinks

Alternate Path to Delegation Report for .education

Eligibility for Alternate Path to Delegation

Second Level Domains (SLDs)

Methodology