unicode identifier dilemma
July 24th, 2016 by john.warden@gmail.com

Abstract: In this post I summarize the Unicode Consortium’s Recommendations for Identifier Syntax and Security (TR31 and TR39), and introduce some of the decisions that a language designer needs to make when implementing these.

I also recommend an exact syntax that conforms to the Unicode consortium recommendations for unicode identifiers while balancing issues of inclusiveness, confusability, and backwards and forwards compatibility. In a series of followup posts, I plan to discuss the reasons for these recommendations in detail.

The Decision

Many languages are allowing Unicode characters in identifiers:

Julia

∑(x) = sum(x)

Haskell

type ℚ = Ratio ℤ

Scala

for (n ← 1 to 10) ...

Although these languages all have different specs for what characters can be allowed in identifiers.

There is some debate on whether this is a good idea at all, if only because some coders will have trouble typing non-ASCII characters.

But there is also the issue of confusability, or homograms — different identifiers that look similar or identical — which can cause frustration and bugs at best, security concerns at worst. For example, micro sign µ and greek mu μ, scope in Latin and ѕсоре in Cyrillic, and worst of all, invisible control characters.

There is also the question of treatment of superscripts and subscripts, comparison, and backwards compatibility and immutability (so the set of valid identifiers doesn’t change as Unicode evolves).

On the other hand, not only is xᵦ² just pretty neat, it can be more readable than x[beta]^2 for some applications.

But most importantly, Unicode identifiers allow people to code in their preferred language. Certainly there’s some benefit to most open-source code being written and documented in English, the lingua-franca of software development, and a coder working in another language will certainly be at a disadvantage, but if it’s a question between English and not at all, it may be worth it.

Unicode’s Recommendations for Identifiers

Fortunately, the Unicode folks have thought a lot about what non-ASCII characters make sense in identifiers, and have issued Technical Report #31 – Unicode Identifier and Pattern Syntax (which I’ll call the “Identifier Spec”) which includes a syntax for identifiers:

<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*

The idea is that there are characters that can start an identifier (letters), a larger set that can continue an identifier (numbers, modifiers), and a few medial characters that can only occur in the middle (hyphens and middle dots).

Unicode defines character properties XID_START and XID_CONTINUE in the Unicode Character Database, which it recommends for use as <Start> and <Continue>, but the spec lets you define these however you like. For example, ‘_’ (underscore) is not in XID_START, but it is suggested you might want to include it in anyway.

XID_START is the international equivalent of /[a-zA-Z]/, and is made up of:

  1. Letters (Lt, Lo, Ll, Lu)
  2. A dozen exceptions, such as U+212E ( ℮ ) ESTIMATED SYMBOL, ‘grandfathered’ for backwards compatibility.

XID_CONTINUE is a superset of XID_START, and adds:

  1. Marks (Mn, Mc) — accents, etc. that combine with the characters before them
  2. Numbers (Nl, Nd) — the international equivalent of /0-9/
  3. Connector Punctuation (Pc) — the international equivalent of /_/
  4. A half-dozen miscellaneous characters, such as U+00B7 MIDDLE DOT

The identifier spec also suggests a small set of additional optional punctuation characters to use as <Medial>, including a variety of hyphens, middle-dots, apostrophes, and format control characters, which it exhorts you to use unless you have a good reason not to (though you probably have a good reason not to use some of them).

Normalization and Security

Unicode defines two forms of normalization, NFC or NFKC, which can help address the issue of confusable identifiers (that look the same but aren’t). But normalization intentionally does not eliminate homoglyphs — two characters with distinct meaning that look the same (for example latin ‘o’ and cyrillic ‘о’).

So to further reduce the possibility of identifier confusion, the identifier spec recommends excluding certain obsolete or limited use scripts.

Then there is a completely separate TR, Unicode Technical Report #39, Unicode Security Mechanisms (which I’ll call the “Security Spec”), which provides recommendations for further restricting certain characters and mixed-script identifiers.

Implementing these recommendations drastically reduces and sanitizes identifier space, removing most of the weirdness and confusability that comes from invisible characters and homoglyphs.

Other Considerations

The identifier also spec includes recommendations and alternatives for addressing:
– stability (backwards compatibility)
– immutability (forward compatibility)
– comparison/equality (case sensitive and insensitive)
– use of format control characters ZWJ and ZWNJ

So this leaves you with a lot of decisions to make. Do you restrict some or all characters recommended in the security spec? Do you allow hyphens, apostrophes, and middle dots in identifiers? Do you allow format control characters in some situations? Do you NFC or NFKC normalize? How do you define identifier equality? Do you make your identifiers “immutable” so that parsers built against different versions of Unicode don’t disagree on what an identifier is?

Some of these issues are tricky, and some of the Unicode recommendations are flawed. I will discuss these issues in detail in followup posts.

But jumping straight to conclusions, here is a specific set of recommendations for a balanced, reasonable definition of a Unicode identifier.

IPIC9: An Immutable Profile for Identifiers in Code

1. Base Syntax

  • <Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*

Define <Start> as:

  • XID_START and U+005F LOW LINE

Define <Continue> as:

  • XID_CONTINUE, and U+05F3 HEBREW PUNCTUATION GERESH
  • But not U+00B7 MIDDLE DOT or U+0387 GREEK ANO TELEIA (these are Medial instead)

Define <Medial> as:

  • U+002D, U+2010, U+00B7, U+2027, U+30FB, U+30A0, U+058A, U+05F4, U+0F0B

These are: HYPHEN-MINUS, HYPHEN, MIDDLE DOT, HYPHENATION POINT, KATAKANA MIDDLE DOT, KATAKANA DOUBLE HYPHEN, ARMENIAN HYPHEN, HEBREW PUNCTUATION GERSHAYIM, and TIBETAN MARK INTERSYLLABIC TSHEG.

2. Normalization

Then Normalize with the following steps:

  1. NFC normalize
  2. Convert U+2010 HYPHEN, U+30A0 KATAKANA DOUBLE HYPHEN, and U+058A ARMENIAN HYPHEN to U+002D HYPHEN-MINUS
  3. Convert U+2027 HYPHENATION POINT, U+30FB KATAKANA MIDDLE DOT, and U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG to U+00B7 MIDDLE DOT
  4. Convert U+002D HYPHEN-MINUS following …
    1. … an Armenian character to U+058A ARMENIAN HYPHEN
    2. … a Hiragana or Katakana character to U+30A0 KATAKANA DOUBLE HYPHEN
  5. Convert U+00B7 MIDDLE DOT following …
    1. … a Tibetan character to U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG
    2. … a Hiragana or Katakana character to U+30FB KATAKANA MIDDLE DOT

3. Restrictions

After normalizing, reject any identifiers that contain characters that:

  1. are not labeled as “Allowed” in the identifier character whitelist from the the Security Spec.
  2. contain characters from more than one script

And don’t ever update the whitelist, even as new versions of Unicode are released. This way the definition of an identifier is immutable: it will never change even as Unicode evolves.

Don’t worry about losing out on important Unicode updates. First, Unicode guarantees backwards compatibility of identifiers: characters will only be added to this list. Second, Unicode 9.0.0 is so extremely comprehensive that characters are added to the whitelist very infrequently. Between Unicode 7.0.0 and 9.0.0, there were are only 14 characters added to Unicode that made it onto this list!

Sample Code

The following sample Perl code provides an implementation of this specification:

Regexes for Matching Identifiers

my $START = qr/[\p{XID_START}_]/;
my $MEDIAL = qr/[\x{002d}\x{2010}\x{00b7}\x{2027}\x{30FB}\x{058a}\x{05f4}\x{0f0b}]/;
my $CONTINUE = qr/[\p{XID_CONTINUE}\x{05F3}]/;
my $NONEND = qr/[\x{0387}\x{00B7}]/;
my $WORD_IDENTIFIER_REGEX = qr/$START$CONTINUE*(?:$MEDIAL$CONTINUE+)*(?<!$NONEND)/;

Normalization

use Unicode::Normalize 'NFC';

sub normalize_identifier {
  my $string = shift;

  croak "Expected a string argument"
    if not defined $string;

  $string = NFC($string);

  # Hyphen, Armenian Hyphen, and Katakana Double Hyphen to hyphen-minus
  $string =~ s/[\x{2010}\x{058A}\x{30A0}]/\x{002D}/;

  # Hyphenation Point, Katakana middle dot, and Tibetan tsheg to middle dot
  $string =~ s/[\x{2027}\x{30FB}\x{0F0B}]/\x{00B7}/g;

  ### Context-specific normalizations
  $string =~ s/[\x{00B7}](?=\p{Tibetan})/\x{0F0B}/g; # middle dot to Tibetan tsheg
  $string =~ s/[\x{00B7}](?=[\p{Katakana}\p{Hiragana}])/\x{30FB}/g; # middle dot to Hiragana/Katakana middle dot
  $string =~ s/[\x{002D}](?=\p{Armenian})/\x{058A}/g; # hyphen to Armenian hyphen
  $string =~ s/[\x{002D}](?=[\p{Katakana}\p{Hiragana}])/\x{30A0}/g; # hyphen to Hiragana/Katakana double hyphen

  return $string;
}

Restriction

use Unicode::Normalize 'NFKC';
use Unicode::Security 'mixed_script';

sub is_restricted_identifier {
  my $identifier = shift;

  return 'mixed-script'
    if mixed_script($identifier);

  return 'disallowed'
    if $identifier =~ /\P{InTR39AllowedCharacters}/;

  return 0;
}

sub InTR39AllowedCharacters {
  # List copied from: http://www.unicode.org/Public/security/9.0.0/IdentifierStatus.txt
  return q{0027
002D 002E
0030 003A
...etc...
  };
}

Other Recommendations

Here are some other things to consider when implementing a Unicode-based language:

  • The Unicode identifier spec covers word-like identifiers only. Create a separate lexical class of math-and-punctuation based identifiers (i.e. operators) comprising sequences of math symbols (Sm) and any ASCII punctuation not reserved for other purposes in your language.

  • This spec disallows non-NKFC characters, such as superscripts and subscripts. But that means you can’t allow superscripts in your source. Just give superscripts and subscripts semantic value. For example, you can make xⱼⁿ⁺⁹ syntactic sugar for exp(nth(x, j), n+9).

  • Create a separate class of identifiers for the mathematical alphanumeric symbols such as 𝘽 and non-NFCK font variants of math symbols such as . These were designed to be visually non-confusable.

  • Emit syntax errors with specific reasons identifiers are not matched (non-NKFC, mixed script, contains restricted characters per TR39, starts with a Medial character, etc.)

  • If your identifiers are case sensitive, do case-folded comparisons with locale explicitly set to US English or “empty” (root) for consistent behavior.

  • Require space around mathematical minus sign U+2212 to avoid confusion with hyphen.

Summary

If you implement the above spec, you can’t go too far wrong. It conforms to both Unicode’s identifier spec (TR31) and security spec (TR39). It is immutable (both backwards and forwards compatible), and minimizes confusability. It is highly inclusive, allowing virtually all letters, accents, and medial punctuation in current use in all living languages. It disallows just a few of Unicode’s recommended optional medial punctuation characters that are truly problematic (period, colon, apostrophe, and usually-invisible control characters). It restricts potentially confusable punctuation characters to their appropriate scripts, and reserves non-NFKC variants of characters (superscript, subscript, small, full-width, bold, doublestruck, etc.) for you to assign specific semantic appropriate to your language.

Followup

I’ve written followup posts with my reasoning for these recommendations:

Posted in Programming Language Design Tagged with: ,