Immutability...it's like, woah
August 18th, 2016 by john.warden@gmail.com

In a previous post I introduced the Unicode Consortium’s Identifier Spec (TR 31) and Security Spec (TR 39), and explain my own recommendations for implementing Unicode Identifiers in conformance with the Unicode recommendations, which is to use a fixed whitelist of allowed identifier characters.

The Problem

As new versions of Unicode are released, what qualifies as an identifier could change.

For this reason, TR31 includes recommendations for immutable identifiers. The idea behind these is both backwards and forwards compatibility (or upward compatibility): make it so parsers/specs written against one version of Unicode recognize the same identifiers as parsers/specs written against future versions of Unicode. This approach is used in the XML 1.1 Recommendation.

The Unicode Consortium’s Recommendation

The Unicode recommended approach (UAX31-R2) allows all characters in identifiers (whether or not they are assigned in any specific version of Unicode), except known punctuation/control/whitespace/etc. Because this approach is so inclusive of potentially inappropriate characters (emojis, ideograms, etc), it suggests that alternatively you can start with valid identifier characters (from XID_START and XID_CONTINUE) from a specific version of Unicode, and then allow any character that aren’t assigned in that version.

This approach essentially decouples your spec/parser from the Unicode Character Database. But the result is that you lack information about the properties of parsed identifiers, such as what script they belong to, and their NFC/NFKC/case-folded forms.

This makes it impossible to implement the recommended restrictions from the Security Spec (restricted scripts, non-NFKC, etc.). And most importantly, it makes it impossible to compare identifiers for equality, which requires NFC normalization at the very least. Indeed UAX31-R2 ends with the recommendation that identifiers not use any unassigned characters.

I emailed Mark Davis of the Unicode Consortium with an earlier version of this post, and he explained:

The immutable identifiers are designed for those cases (like XML) that can’t update across versions of Unicode. They are not particularly recommended. Any system that *can* update across versions of Unicode (as your proposal requires) are instead recommended to use the default identifiers.

So in other words, decoupling your identifier spec from the UCD like this makes sense for XML because it doesn’t concern itself with what you do with the tags/attributes in the parsed document, it just needs to decide “identifier/not-identifier”. But for a programming language, you can and probably should disallow unassigned characters so you can use the UCD to understand the properties of identifiers, most importantly so you can NFC/NFKC normalize identifiers and compare them for equality.

My Recommendation: Fixed Whitelist of Allowed Characters

Now, immutability is still a neat idea for a programming language. As a language designer you will be swimming in complexity and change: it would be nice if there was one aspect that was set in stone once and for all. The definition of an identifier could be one of those things.

Another way to achieve immutability is to simply use a fixed white list of allowed Unicode characters. I recommend using the latest list of allowed characters from the Security Spec as your identifier character whitelist, and then never upgrading this list.

Your gut reaction to forever pegging your language spec to a specific version of Unicode is probably “no“! Software developers are used to upgrading to the latest versions of everything.

But Unicode is a standard, not code, and as of Unicode 9.0, the Unicode character set is incredibly complete. We’ve gone beyond encoding all the world’s currently used languages to encoding obscure academic scripts and emoji. If there’s a potentially legal identifier character in some language that somebody might actually program in, it’s almost certainly already in Unicode.

The suggested identifier character whitelist data file contains the version of Unicode in which each character was added. A quick search shows that there are only 11 characters characters added to Unicode between 8.0 and 9.0 that made it onto the whitelist, and another 9 previously existing characters were added to the whitelist (from the tamil script). In other words, evolution of this list has ground to a near halt.

Summary

You will still probably want to use the default syntax for identifiers to make sure your identifier doesn’t, say, start with a number, and to allow optional medial punctuation (e.g. hyphens). This syntax relies on the XID_START and XID_CONTINUE character properties from the Unicode Character Database (plus any optional characters you choose to add). You are probably using the version of the Unicode Character Database that is bundled with whatever language you are using to build your parser, which is probably less than the latest version (9.0 at the time of this writing). This means that the set of allowed identifier characters will be a subset of the characters in the identifier character whitelist, which is based on Unicode 9. So, as your UCD is upgraded, you will “grow in” to the full list of allowed characters in the whitelist.

If you follow this recommendation, you still have some choices to make about optional medial characters, restricting mixed-script identifiers, and NFC/NFKC normalization. I have defined an identifier profile called IPIC9 (Immutable Profile for Identifiers in Code), which uses the recommendation above and also specifies exactly which optional characters to allow, with specific rules for normalizing and restricting identifiers.

Using this spec will give you an extremely exhaustive, very clean set of legal identifier characters that will grow to a certain point and then never change.

Posted in Programming Language Design Tagged with: , , ,

punctuation
August 12th, 2016 by john.warden@gmail.com

In a previous post I introduced the Unicode Consortium’s specs for Unicode Identifier Syntax and Security (TRs 31 and 39), and summarized my own recommendations, in cases where the TRs leave you with options.

The Unicode Identifier Spec recommends a handful of optional punctuation characters to allow in unicode identifiers. In this post I make specific recommendations for which of these optional characters you really should allow.

Hyphenation

The identifier spec recommends allowing both the hyphen (U+2010) and the ASCII hyphen-minus (U+002D) in the middle of identifiers (<Medial> characters).

They are also included in the security spec’s identifier character whitelist. However, although it is not clear from the spec, these characters should only be allowed as <Medial> characters. An identifier should not start or end with a hyphen.

Unfortunately, allowing both hyphen and hyphen-minus can create confusion, since these are not the same under either NFC or NFKC normalization. For example, the following would be valid, but distinct, identifiers.

first-rate
first-rate

This possibility is mentioned in TR36 Single Script Spoofing. However as of Unicode 9.0 the security spec doesn’t make recommendations for dealing with this kind of single-script spoofing.

Recommendation: Allow Hyphen-Minus in Identifiers, Make Hyphen a Non-Canonical Alternative

If you follow the recommendation in the identifier spec to allow hyphen (U+2010) in medial position, then make the ASCII hyphen-minus the canonical form.

Even though the hyphen was added to Unicode as the ‘unambiguous’ hyphen, a hyphen-minus sandwiched between two words in an identifier such as pretty-print has unambiguous hyphen semantics, whereas as “pretty – print” obviously means “pretty minus print” (or a formatting mistake)

Plus the ASCII hyphen-minus has a tradition of use inside identifiers in many languages, and it’s probably best to make sure the set of valid Unicode identifiers is a superset of valid ASCII identifiers.

Canonicalizing the proper hyphen to ASCII hyphen-minus will make life easier for coders in some cases, such as when documentation formatters convert a hyphen-minus in identifiers to a hyphen for display. A coder who unwittingly copies-and-pastes non-ASCII hyphens into their code should be none the wiser.

Recommendation: Require Proper Spacing around Mathematical Minus and Hyphen

Furthermore, I recommend that the mathematical minus sign − (U+2212) be disallowed in places where it looks like a hyphen:

> first−rate

ERROR: Mathematical minus sign (U+2212) used as hyphen. Please use the ‘-‘ character (U+002D hyphen-minus) instead, or place space around the minus sign.

And the hyphen should (U+2010) be disallowed in places where it looks like a minus:

> first ‐ rate

ERROR: Hyphen (U+2212) cannot be used by itself as symbol. If you meant minus, use ‘-‘ (U+002D hyphen-minus) instead.

Don’t parse ‘first-rate’ differently depending on whether a mathematical minus or a hyphen is used. Code that looks the same should work the same. The proper hyphen and mathematical minus were introduced to Unicode to allow clear semantic distinctions between hyphen and minus, but a hidden semantic distinction doesn’t justify visual confusion. Your code should be flexible enough to distinguish between hyphen and minus based on context, but strict enough to reject semantically inappropriate use of either of them.

Furthermore, I recommend considering mathematical minus and hyphen-minus to be canonical equivalent when used as part of non-word identifiers. For example this is clearly a minus:

a - b

Apostrophes

The identifier spec also recommends apostrophe in optional medial characters.

Recommendation: Disallow Apostrophe

Unlike the hyphen, the apostrophe does not have a tradition of inclusion in identifiers in many computer languages. Plus, the syntax around the apostrophe doubling as a single quote character would make both visual and actual parsing quite difficult. Programmers who speak languages that have apostrophes in words have been dealing with omitting them from identifiers for a long time, and I don’t sense the demand for apostrophes in languages as there is for hyphens. Solving this problem is probably not worth the problems it can create.

If you follow this recommendation, do not allow the right single quotation mark either.

Disallow:

U+0027 ' APOSTROPHE
U+2019 ’ RIGHT SINGLE QUOTATION MARK

Middle Dots

Recommendation: Allow Middle Dot

The Catalans are a passionate people fiercely proud of their language, many of whom will be happy to use Catalan words in their code, which can have middle dots. Variants of the middle dot are also used in other languages including Katakana and Chinese, so it’s best to allow them.

00B7  · MIDDLE DOT
0387  · GREEK ANO TELEIA

U+0387 GREEK ANO TELEIA is a non-NFC equivalent to middle dot, so should be treated the same as the middle dot.

Recommendation: Normalize Hyphenation Point

The Hyphenation Point is confusable with U+00B7 middle dot. So allow it, but normalize to middle dot.

2027  ‧ HYPHENATION POINT

Recommendation: Disallow Middle Dots at End of an Identifier

The middle dot is a recommended medial character, but is also in XID_CONTINUE, which allows it to appear at the end of an identifier too. I recommend you disallow middle dot specifically at the end of identifiers.

Non-ASCII Punctuation

The identifier spec recommends several other optional characters from non-latin scripts. Some of these are confusable with the hyphen and middle dot.

For example, U+30FB KATAKANA MIDDLE DOT looks a lot like a regular middle dot, and U+058A ARMENIAN HYPHEN looks like a hyphen-minus if you don’t look at it closely.

The confusability issue is partly solved if your language disallows mixed-script identifiers. It would prevent for example an Armenian hyphen being used anywhere but between Armenian characters. But because hyphen-minus is part of the Common script, mixed-script detection does not prevent a regular hyphen-minus from being placed between Armenian characters!

Recommendation: Normalize Non-Ascii Hyphens and Middle Dots Based on Context

The following three medial characters belong to specific scripts, and should be used in place of hyphens and middle dots in those scripts.

  1. Hiragana and Katakana: 30A0 ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN
  2. Armenian: 058A ֊ ARMENIAN HYPHEN

Middle Dots:

  1. Hiragana and Katakana: 30FB ・ KATAKANA MIDDLE DOT
  2. Tibetan: 0F0B ་ TIBETAN MARK INTERSYLLABIC TSHEG

To normalize these characters, use the following rules:

  1. If any above characters follow a character from a different script, they should be normalized to hyphen minus or latin middle dot, respectively.
  2. If any a hyphen or middle dot follows a character from any of these scripts, they should be converted to the appropriate character from that script.

Examples:

  1. first֊rate (with Armenian hyphen) => first-rate
  2. il་lusio (with Tibetan tsheg) => il·lusio
  3. first・rate (with Hiragana/Katakana middle dot) => first·rate
  4. ウォルドルフ·アストリア (with Latin middle dot) => ウォルドルフ・アストリア (with Hiragana/Katakana middle dot)
  5. ウォルドルフ֊アストリア (with Armenian hyphen) => ウォルドルフ゠アストリア (with Hiragana/Katakana double hyphen)
  6. ཙ·ཚ (with latin middle dot) => ཙ་ཚ (with Tibetan tsheg)
  7. հայերեն֊հայերեն (with hyphen-minus) => հայերեն֊հայերեն (with Armenian hyphen)
  8. il・lusio (with Hiragana/Katakana middle dot) => il·lusio (with Latin middle dot)

Recommendation: Allow Recommended Hebrew Punctuation Characters

If your language dis-allows mixed-script identifiers as recommended in my last post (and in the Unicode the security spec), the following characters can only be used after Hebrew characters. Furthermore, although they could be confused with apostrophes and double-quotes, these characters are not allowed in identifiers.

Medial:

05F4  ״ HEBREW PUNCTUATION GERSHAYIM

Continue:

05F3    ׳   HEBREW PUNCTUATION GERESH

Miscellaneous ASCII Punctuation

The identifier spec recommends also allowing the following characters unless you have a compelling reason not to.

Start:

0024    $   DOLLAR SIGN
005F    _   LOW LINE

Medial:

002E    .   FULL STOP
003A    :   COLON

Recommendation: Allow _ but not $

A lot of mainstream languages allow the underscore (or low line) anywhere in identifiers, so there is no compelling reason to disallow it.

The same is not true of the dollar sign. The dollar sign, like other miscellaneous ASCII characters such as @, %, ~, is often allowed as part of non-word identifiers such as >=, ||, ->, ++, <$>. So I recommend allowing these in a separate class of non-word identifiers. Scala is an example of a language that takes this approach, having word identifiers, and ‘operator identifiers’ comprising misc. ASCII characters and Unicode math symbols.

Recommendation: Disallow . and :

‘.’ and ‘:’, like ‘/’, are often used in languages for separating the parts of ‘paths’ or ‘fully qualified’ names.

But, in many languages, it is best to think of these as just syntax for expressing a composite identifier with multiple parts, and not part of the content of the identifier itself. So I recommend disallowing these characters ‘identifiers’, and only allowing them in ‘paths’ or ‘namespaces’ or ‘fully qualified names’.

Format Control Characters

The ZWJ (U+200D ZERO WIDTH JOINER) and ZWNJ (U+200C ZERO WIDTH NON-JOINER) characters are invisible, except when they affect the appearance of certain pairs of characters when placed between them. Although initially intended solely for formatting, ZWJ and ZWNJ now can actually change the meaning of some words.

The identifier spec provides a regex for identifying places in text when ZWJ/ZWNJ might cause an actual visual distinction. However, there are still many places this doesn’t catch, and many terminals and text editors don’t know how to render these correctly anyway.

Recommendation: Elide ZWJ/ZWNJ in Case-Insensitive Identifiers

200C    ZERO WIDTH NON-JOINER*
200D    ZERO WIDTH JOINER*

Now, case-folding elides ZWJ/ZWNJ, so if your identifiers are case-insensitive (meaning you are case folding them before comparing them), allowing them will not create confusability issues, since two otherwise equal identifiers will still be equal if one has a ZWJ/ZWNJ character and the other doesn’t. So for purposes of improved readability, I recommend allowing but eliding ZWJ/ZWNJ characters for case-insensitive identifiers.

International domain names (IDN) also allow but elide ZWJ/ZWNJ characters based on the same logic.

Recommendation: Disallow ZWJ/ZWNJ in Case-Sensitive identifiers

If identifiers in your language are case-sensitive, then I recommend that you simply disallow these characters for now.

None of the Unicode normalization forms knows how to handle ZWJ/ZWNJ correctly, by removing them when they are invisible or adding them when it’s more correct. I think it might be possible to create a normalization algorithm in the future that can do this. But if in the meantime these characters were allowed, you couldn’t incorporate a proper ZWJ/ZWNJ normalization in the future without breaking backwards compatibility.

So disallowing ZWJ/ZWNJ will mean certain words can’t be used as identifiers in their proper spelling, but programmers have been dealing with this forever (e.g. I can’t use can't as an identifier in most languages but it’s ok). And it leaves open the possibility of a proper implementation of ZWNJ identifiers in the future.

Summary

I have summarized all my recommendations for identifiers in a spec I call IPIC9 (immutable profile for identifiers in code).

Posted in Programming Language Design Tagged with: , , , , , , ,