Overview

This file contains Label Generation Rules (LGR) for the Bengali (Bangla) script for the Root Zone. This LGR covers Assamese, Bengali, Manipuri and a number of other languages written with the Bengali script. For details and additional background on the script, see "Proposal for a Bengali Script Root Zone Label Generation Ruleset (LGR)" [Proposal-Bengali]. This file is one of a set of LGR files that together form an integrated LGR for the DNS Root Zone [RZ-LGR-4]. The format of this file follows [RFC 7940].

Repertoire

The repertoire contains 61 code points for letters, as well as 9 code point sequences, for a total of 70 repertoire elements. Out of the nine sequences: two sequences override a WLE constraint; four sequences were defined for in-script variants; and the other three sequences were defined to restrict U+09BC NUKTA from appearing in any context other than these sequences. Accordingly, while U+09BC is not listed by itself, it brings the total of distinct code points to 62. For more detail, see Section 5, "Repertoire" in [Proposal-Bengali].

The repertoire is based on [MSR-4], which is a subset of Unicode 6.3 [Unicode 6.3].

As part of the Root Zone, this LGR includes neither digits nor the HYPHEN-MINUS.

Each code point or range is tagged with the script or scripts that the code point is used with, one or more tag values denoting character category, and one or more references documenting sufficient justification for inclusion in the repertoire; see "References" below. For code points that are part of the repertoire, comments identify the languages using the code point.

Code points outside the Bengali script that are listed in this file are targets for out-of-script variants and are identified by a reflexive (identity) variant of type "out-of-repertoire-var". They do not form part of the repertoire.

Variants

This LGR defines in-script variants and cross-script variants as described in Section 6, "Variants", in [Proposal-Bengali]. There are three in-script variants; two sequence sets and one set for variation of RA. See Section 6.1 of [Proposal-Bengali]. There are four cross-script variants; two sets with Gurmukhi and the other two sets with Devanagari. See Section 6.2 of [Proposal-Bengali].

Variant Disposition: The in-script variant pair U+09B0 / U+09F0 is of type "allocatable", thus allowing access to either user community. All other variants are of type "blocked", making labels that differ only by these variants mutually exclusive: whichever label containing either of these variants is chosen earlier, the other one equivalent variant label should be blocked. There is no preference among these variants.

The specification of variants in the Root Zone LGR follows the guidelines in [RFC 8228].

Character Classes

Consonants: All consonants contain an implicit vowel. More details in Section 3.3.1, "The Consonants" of [Proposal-Bengali].

Hasanta: A special sign is needed whenever the implicit vowel in the preceding consonant is stripped off. This symbol is also known as the Halant or "Virama". More details in Section 3.3.2, "The Implicit Vowel Killer: Hasanta" of [Proposal-Bengali].

Vowels: Separate symbols exist for all "Swara" or Vowels in Bengali, which are pronounced independently either at the beginning of the word or after another vowel or consonant sound. To indicate a Vowel sound other than the implicit one, a Vowel sign (Matra) is attached to the consonant. More details in Section 3.3.3, " Vowels" of [Proposal-Bengali].

Anusvara: The Anusvara represents a homorganic nasal. It replaces a conjunct group of a Nasal Consonant+Halant+Consonant belonging to that particular barga or set. Before a non-barga consonant, the anusvara represents a nasal sound. More details in Section 3.3.4, "The Anusvara" of [Proposal-Bengali].

Candrabindu: Candrabindu denotes nasalization of the preceding vowel as in চাঁদ /cãd/ "moon" (U+099A U+09BE U+0981 U+09A6). This sign with a dot inside the half-moon mark is used as nasalization marker in many Indian scripts. More details in Section 3.3.5, "Nasalization: Candrabindu" of [Proposal-Bengali].

Visarga and Avagraha: The Visarga U+0983 is frequently used in Bengali loanwords borrowed from Sanskrit and represents a sound very close to /h/. More details in Section 3.3.7, "Visarga and Avagraha" of [Proposal-Bengali].

Ya-phala: There are two instances in Bangla where a Hasanta is preceded by a full vowel (U+0985 BENGALI LETTER A and U+098F BENGALI LETTER E). More details in Section 3.3.9, "Use of Ya-phala" of [Proposal-Bengali].

Ra-phala and Ref Sequences: RA+Hasanta (Repha or Ra-phala sequences). More details in Section 3.3.10, "Ra-phala and Ref Sequences" of [Proposal-Bengali].

Nukta: Nukta is not listed by itself in the repertoire; it is only included in three sequences. More details in Section 3.3.6, "Nukta" of [Proposal-Bengali].

Zero Width Non-joiner (ZWNJ) and Zero Width Joiner (ZWJ): These are not included in the repertoire. More details in Section 3.3.8, "Zero Width Non-joiner (U+200C) and Zero Width Joiner (U+200D)" of [Proposal-Bengali].

Whole Label Evaluation (WLE) and Context Rules

Default Whole Label Evaluation Rules and Actions

The LGR includes the set of required default WLE rules and actions applicable to the Root Zone and defined in [MSR-4]. They are marked with ⍟. The default prohibition on leading combining marks is equivalent to ensuring that a label only starts with a consonant or vowel.

Bengali-specific Rules

These rules have been formulated as context rules suitable for adoption into an LGR specification.

The following symbols are used in the WLE rules:

C → Consonant
M → Kar (Matra)
V → Vowel
B → Onushshar (Anusvara)
X → Bisarga (Visarga)
D → Candrabindu
H → Hasanta (Halant)
Z → KhandaTa
P → Ra-Hasanta
S → (a/e) Ya-phala

The rules are:

1. C: C is a set of C and CN where CN is the set of normalized forms of {ড়,ঢ়,য়}
2. H: must be preceded by C
3. M: must be preceded by C
4. D: must be preceded by any of V, C, M
5. X: must be preceded by any of V, C, M, D
6. B: must be preceded by any of V, C, M, D
7. Z: must be preceded by any of V, C, M, D, B, X, P
8. V: CANNOT be preceded by H
9. S: CANNOT be preceded by H
10. U+09B0 and U+09F0 CANNOT be mixed in the same label

More details in Section 7, "Whole Label Evaluation Rules (WLE)" of [Proposal-Bengali].

Methodology and Contributors

The Root Zone LGR for the Bengali script was developed by the Neo-Brahmi Generation Panel (NBGP) the members of which have experience in linguistics and computational linguistics in a wide variety of languages written with Neo-Brahmi scripts. Under the Neo-Brahmi Generation Panel, there are nine scripts belonging to separate Unicode blocks. Each of these scripts has been assigned a separate LGR, with the Neo-Brahmi GP ensuring that the fundamental philosophy behind building each LGR is in sync with all other Brahmi derived scripts. For further details on methodology and contributors, see Sections 4 and 8 in [Proposal-Bengali], as well as [RZ-LGR-4-Overview].

References

The following general references are cited in this document:

[MSR-4]: Integration Panel, "Maximal Starting Repertoire — MSR-4 Overview and Rationale", 7 February 2019 https://www.icann.org/en/system/files/files/msr-4-overview-25jan19-en.pdf
[Proposal-Bengali]: Neo-Brahmi Generation Panel, “Proposal for a Bangla (Bengali) Script Root Zone Label Generation Rule-Set (LGR)”, 20 May 2020, https://www.icann.org/en/system/files/files/proposal-bangla-lgr-20may20-en.pdf
[RFC 7940]: Davies, K. and A. Freytag, "Representing Label Generation Rulesets Using XML", RFC 7940, August 2016, http://www.rfc-editor.org/info/rfc7940.
[RFC 8228]: A. Freytag, "Guidance on Designing Label Generation Rulesets (LGRs) Supporting Variant Labels", RFC 8228, August 2017, https://www.rfc-editor.org/info/rfc8228
[RZ-LGR-4]: Integration Panel, "Label Generation Rules for the Root Zone — LGR-4", 05 November 2020 (XML), https://www.icann.org/sites/default/files/lgr/lgr-4-common-05nov20-en.xml
non-normative HTML presentation: https://www.icann.org/sites/default/files/lgr/lgr-4-common-05nov20-en.html
[RZ-LGR-4-Overview]: Integration Panel, "Root Zone Label Generation Rules - LGR-4: Overview and Summary", 05 November 2020 (PDF), https://www.icann.org/sites/default/files/lgr/lgr-4-overview-05nov20-en.pdf
[Unicode 6.3]: The Unicode Consortium. The Unicode Standard, Version 6.3.0, (Mountain View, CA: The Unicode Consortium, 2013. ISBN 978-1-936213-08-5) http://www.unicode.org/versions/Unicode6.3.0/

For references consulted particularly in designing the repertoire for the Bengali script for the Root Zone please see details in the Table of References below. References [0] and [7] refer to the Unicode Standard versions in which the corresponding code points were initially encoded. References [101] and above correspond to sources given in [Proposal-Bengali] justifying the inclusion of the corresponding code points. Entries in the table may have multiple source reference values.

]]> The Unicode Standard 1.1 The Unicode Standard 4.1 Wikipedia, Bengali alphabet, accessed on 2017-11-25 https://en.wikipedia.org/wiki/Bengali_alphabet Bengali alphabet for Manipuri, found in Omniglot, Manipuri (Meeteilon/ Meithei), accessed on 20.10.2019 https://www.omniglot.com/writing/manipuri.htm Omniglot, Assamese (অসমীয়া), accessed on 2020-04-28 https://www.omniglot.com/writing/assamese.htm