punctuation
August 12th, 2016 by john.warden@gmail.com

In a previous post I introduced the Unicode Consortium’s specs for Unicode Identifier Syntax and Security (TRs 31 and 39), and summarized my own recommendations, in cases where the TRs leave you with options.

The Unicode Identifier Spec recommends a handful of optional punctuation characters to allow in unicode identifiers. In this post I make specific recommendations for which of these optional characters you really should allow.

Hyphenation

The identifier spec recommends allowing both the hyphen (U+2010) and the ASCII hyphen-minus (U+002D) in the middle of identifiers (<Medial> characters).

They are also included in the security spec’s identifier character whitelist. However, although it is not clear from the spec, these characters should only be allowed as <Medial> characters. An identifier should not start or end with a hyphen.

Unfortunately, allowing both hyphen and hyphen-minus can create confusion, since these are not the same under either NFC or NFKC normalization. For example, the following would be valid, but distinct, identifiers.

first-rate
first-rate

This possibility is mentioned in TR36 Single Script Spoofing. However as of Unicode 9.0 the security spec doesn’t make recommendations for dealing with this kind of single-script spoofing.

Recommendation: Allow Hyphen-Minus in Identifiers, Make Hyphen a Non-Canonical Alternative

If you follow the recommendation in the identifier spec to allow hyphen (U+2010) in medial position, then make the ASCII hyphen-minus the canonical form.

Even though the hyphen was added to Unicode as the ‘unambiguous’ hyphen, a hyphen-minus sandwiched between two words in an identifier such as pretty-print has unambiguous hyphen semantics, whereas as “pretty – print” obviously means “pretty minus print” (or a formatting mistake)

Plus the ASCII hyphen-minus has a tradition of use inside identifiers in many languages, and it’s probably best to make sure the set of valid Unicode identifiers is a superset of valid ASCII identifiers.

Canonicalizing the proper hyphen to ASCII hyphen-minus will make life easier for coders in some cases, such as when documentation formatters convert a hyphen-minus in identifiers to a hyphen for display. A coder who unwittingly copies-and-pastes non-ASCII hyphens into their code should be none the wiser.

Recommendation: Require Proper Spacing around Mathematical Minus and Hyphen

Furthermore, I recommend that the mathematical minus sign − (U+2212) be disallowed in places where it looks like a hyphen:

> first−rate

ERROR: Mathematical minus sign (U+2212) used as hyphen. Please use the ‘-‘ character (U+002D hyphen-minus) instead, or place space around the minus sign.

And the hyphen should (U+2010) be disallowed in places where it looks like a minus:

> first ‐ rate

ERROR: Hyphen (U+2212) cannot be used by itself as symbol. If you meant minus, use ‘-‘ (U+002D hyphen-minus) instead.

Don’t parse ‘first-rate’ differently depending on whether a mathematical minus or a hyphen is used. Code that looks the same should work the same. The proper hyphen and mathematical minus were introduced to Unicode to allow clear semantic distinctions between hyphen and minus, but a hidden semantic distinction doesn’t justify visual confusion. Your code should be flexible enough to distinguish between hyphen and minus based on context, but strict enough to reject semantically inappropriate use of either of them.

Furthermore, I recommend considering mathematical minus and hyphen-minus to be canonical equivalent when used as part of non-word identifiers. For example this is clearly a minus:

a - b

Apostrophes

The identifier spec also recommends apostrophe in optional medial characters.

Recommendation: Disallow Apostrophe

Unlike the hyphen, the apostrophe does not have a tradition of inclusion in identifiers in many computer languages. Plus, the syntax around the apostrophe doubling as a single quote character would make both visual and actual parsing quite difficult. Programmers who speak languages that have apostrophes in words have been dealing with omitting them from identifiers for a long time, and I don’t sense the demand for apostrophes in languages as there is for hyphens. Solving this problem is probably not worth the problems it can create.

If you follow this recommendation, do not allow the right single quotation mark either.

Disallow:

U+0027 ' APOSTROPHE
U+2019 ’ RIGHT SINGLE QUOTATION MARK

Middle Dots

Recommendation: Allow Middle Dot

The Catalans are a passionate people fiercely proud of their language, many of whom will be happy to use Catalan words in their code, which can have middle dots. Variants of the middle dot are also used in other languages including Katakana and Chinese, so it’s best to allow them.

00B7  · MIDDLE DOT
0387  · GREEK ANO TELEIA

U+0387 GREEK ANO TELEIA is a non-NFC equivalent to middle dot, so should be treated the same as the middle dot.

Recommendation: Normalize Hyphenation Point

The Hyphenation Point is confusable with U+00B7 middle dot. So allow it, but normalize to middle dot.

2027  ‧ HYPHENATION POINT

Recommendation: Disallow Middle Dots at End of an Identifier

The middle dot is a recommended medial character, but is also in XID_CONTINUE, which allows it to appear at the end of an identifier too. I recommend you disallow middle dot specifically at the end of identifiers.

Non-ASCII Punctuation

The identifier spec recommends several other optional characters from non-latin scripts. Some of these are confusable with the hyphen and middle dot.

For example, U+30FB KATAKANA MIDDLE DOT looks a lot like a regular middle dot, and U+058A ARMENIAN HYPHEN looks like a hyphen-minus if you don’t look at it closely.

The confusability issue is partly solved if your language disallows mixed-script identifiers. It would prevent for example an Armenian hyphen being used anywhere but between Armenian characters. But because hyphen-minus is part of the Common script, mixed-script detection does not prevent a regular hyphen-minus from being placed between Armenian characters!

Recommendation: Normalize Non-Ascii Hyphens and Middle Dots Based on Context

The following three medial characters belong to specific scripts, and should be used in place of hyphens and middle dots in those scripts.

  1. Hiragana and Katakana: 30A0 ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN
  2. Armenian: 058A ֊ ARMENIAN HYPHEN

Middle Dots:

  1. Hiragana and Katakana: 30FB ・ KATAKANA MIDDLE DOT
  2. Tibetan: 0F0B ་ TIBETAN MARK INTERSYLLABIC TSHEG

To normalize these characters, use the following rules:

  1. If any above characters follow a character from a different script, they should be normalized to hyphen minus or latin middle dot, respectively.
  2. If any a hyphen or middle dot follows a character from any of these scripts, they should be converted to the appropriate character from that script.

Examples:

  1. first֊rate (with Armenian hyphen) => first-rate
  2. il་lusio (with Tibetan tsheg) => il·lusio
  3. first・rate (with Hiragana/Katakana middle dot) => first·rate
  4. ウォルドルフ·アストリア (with Latin middle dot) => ウォルドルフ・アストリア (with Hiragana/Katakana middle dot)
  5. ウォルドルフ֊アストリア (with Armenian hyphen) => ウォルドルフ゠アストリア (with Hiragana/Katakana double hyphen)
  6. ཙ·ཚ (with latin middle dot) => ཙ་ཚ (with Tibetan tsheg)
  7. հայերեն֊հայերեն (with hyphen-minus) => հայերեն֊հայերեն (with Armenian hyphen)
  8. il・lusio (with Hiragana/Katakana middle dot) => il·lusio (with Latin middle dot)

Recommendation: Allow Recommended Hebrew Punctuation Characters

If your language dis-allows mixed-script identifiers as recommended in my last post (and in the Unicode the security spec), the following characters can only be used after Hebrew characters. Furthermore, although they could be confused with apostrophes and double-quotes, these characters are not allowed in identifiers.

Medial:

05F4  ״ HEBREW PUNCTUATION GERSHAYIM

Continue:

05F3    ׳   HEBREW PUNCTUATION GERESH

Miscellaneous ASCII Punctuation

The identifier spec recommends also allowing the following characters unless you have a compelling reason not to.

Start:

0024    $   DOLLAR SIGN
005F    _   LOW LINE

Medial:

002E    .   FULL STOP
003A    :   COLON

Recommendation: Allow _ but not $

A lot of mainstream languages allow the underscore (or low line) anywhere in identifiers, so there is no compelling reason to disallow it.

The same is not true of the dollar sign. The dollar sign, like other miscellaneous ASCII characters such as @, %, ~, is often allowed as part of non-word identifiers such as >=, ||, ->, ++, <$>. So I recommend allowing these in a separate class of non-word identifiers. Scala is an example of a language that takes this approach, having word identifiers, and ‘operator identifiers’ comprising misc. ASCII characters and Unicode math symbols.

Recommendation: Disallow . and :

‘.’ and ‘:’, like ‘/’, are often used in languages for separating the parts of ‘paths’ or ‘fully qualified’ names.

But, in many languages, it is best to think of these as just syntax for expressing a composite identifier with multiple parts, and not part of the content of the identifier itself. So I recommend disallowing these characters ‘identifiers’, and only allowing them in ‘paths’ or ‘namespaces’ or ‘fully qualified names’.

Format Control Characters

The ZWJ (U+200D ZERO WIDTH JOINER) and ZWNJ (U+200C ZERO WIDTH NON-JOINER) characters are invisible, except when they affect the appearance of certain pairs of characters when placed between them. Although initially intended solely for formatting, ZWJ and ZWNJ now can actually change the meaning of some words.

The identifier spec provides a regex for identifying places in text when ZWJ/ZWNJ might cause an actual visual distinction. However, there are still many places this doesn’t catch, and many terminals and text editors don’t know how to render these correctly anyway.

Recommendation: Elide ZWJ/ZWNJ in Case-Insensitive Identifiers

200C    ZERO WIDTH NON-JOINER*
200D    ZERO WIDTH JOINER*

Now, case-folding elides ZWJ/ZWNJ, so if your identifiers are case-insensitive (meaning you are case folding them before comparing them), allowing them will not create confusability issues, since two otherwise equal identifiers will still be equal if one has a ZWJ/ZWNJ character and the other doesn’t. So for purposes of improved readability, I recommend allowing but eliding ZWJ/ZWNJ characters for case-insensitive identifiers.

International domain names (IDN) also allow but elide ZWJ/ZWNJ characters based on the same logic.

Recommendation: Disallow ZWJ/ZWNJ in Case-Sensitive identifiers

If identifiers in your language are case-sensitive, then I recommend that you simply disallow these characters for now.

None of the Unicode normalization forms knows how to handle ZWJ/ZWNJ correctly, by removing them when they are invisible or adding them when it’s more correct. I think it might be possible to create a normalization algorithm in the future that can do this. But if in the meantime these characters were allowed, you couldn’t incorporate a proper ZWJ/ZWNJ normalization in the future without breaking backwards compatibility.

So disallowing ZWJ/ZWNJ will mean certain words can’t be used as identifiers in their proper spelling, but programmers have been dealing with this forever (e.g. I can’t use can't as an identifier in most languages but it’s ok). And it leaves open the possibility of a proper implementation of ZWNJ identifiers in the future.

Summary

I have summarized all my recommendations for identifiers in a spec I call IPIC9 (immutable profile for identifiers in code).

Posted in Programming Language Design Tagged with: , , , , , , ,

duck library 2
July 19th, 2013 by john.warden@gmail.com

Most OO programmers have come across this situation: you have some types that don’t share any common supertype, but you wish they did, so you could write some generalized code that works for both types.

For example, you have a CartoonDuck and a RubberDuck class, they both quack, but you didn’t design them to implement a common Duck interface. So it makes it hard for you to create your duck utility library that works with all kind of Ducks.

Ill-Conceived Duck Library

The obvious solution is just to modify the original library code, and make the two duck classes implement a common Duck trait. But let’s say we can’t or don’t want to (e.g. we don’t have commit privileges for the Duck class library, or it is on a long release cycle).

The ability to extend functionality of a library without modifying its source is known as Retroactive Extension. Doing so without creating wrappers has been called Retroactive Polymorphism. There are a couple of great posts by Casual Miracles and Daniel Westheide that talk about using Type Classes in Scala to achieve retroactive polymorphism.

I would classify the particular kind of retroactive extension required for the Duck library — creating a common supertype for some classes without modifying their original code — as retroactive supertyping

In this post I’ll explore various techniques for achieving retroactive supertyping. But for the impatient, I’ll skip to the end with a comparison of pros/cons for each:

Comparison

Let’s start with our base class library.

class CartoonDuck(saying: String) {
    def quack(): String = saying
}

class RubberDuck {
    def quack(): String = "Squeek!"
}

val donald = new CartoonDuck("What's the big idea?")
val daffy = new CartoonDuck("You're dispicable!")
val rubberDucky = new RubberDuck()

/* Todo, implement describeDuckCollection and use it  */
//def describeDuckCollection(ducks: List[Duck]) { /* implement */ }
//describeDuckCollection(List(donald, daffy, rubberDucky))

But, we can’t implement describeDuckCollection(ducks: List[Duck]), because the Duck class doesn’t exist…

Solution 1: Modify Original Code

Again, often the best solution, but suppose we don’t want to do this.

Solution 2. The Adapter Pattern

Okay, so let’s create a Duck trait and create two adapter classes. This is usually a perfectly acceptable solution, especially if your code will be called from Java code that can’t use implicits, or maintained by Java programmers that don’t like implicits. The main drawback is it requires clients of your duck utility library to write extra code to wrap their ducks.

trait Duck {
    def quack(): String
}

def cartoonDuckAsDuck(d: CartoonDuck): Duck = new Duck { 
  def quack() = d.quack() 
}

def rubberDuckAsDuck(d: RubberDuck): Duck = new Duck {
  def quack() = d.quack()
}

def describeDuckCollectionUsingWrappers(ducks: List[Duck]) {
    print(
        "Here are my ducks: " +
        ducks.map(
          duck => "\tA duck that says '" + duck.quack() + "'"
        ).mkString("\n","\n","\n")
    )
}

val wrappedDonald = cartoonDuckAsDuck(donald)
val wrappedDaffy = cartoonDuckAsDuck(daffy)
val wrappedRubberDucky = rubberDuckAsDuck(rubberDucky)

describeDuckCollectionUsingWrappers(
    List(wrappedDonald, wrappedDaffy, wrappedRubberDucky)
)


Solution 3: The Rich Wrapper Pattern (or Pimp My Library Pattern)

If you don’t like explicitly creating those wrappedDonald, wrappedDaffy, etc. objects, then you can make your code more terse and at the same time more mystifying to newbie Scala developers — so they will respect you for your erudition even if they can’t work with your code 😉 — by using the Rich Wrapper pattern and implicit conversions.

/* Create a companion object to the Duck trait with the 
implicit conversion functions.  
Names of the functions don't matter, only signatures. */
object Duck {
    implicit def cartoonDuckAsDuck(d: CartoonDuck): Duck = new Duck {
      def quack() = d.quack() 
    }
    implicit def rubberDuckAsDuck(d: RubberDuck): Duck = new Duck {
       def quack() = d.quack()
    }
}

def describeDuckCollectionUsingWrappersAndImplicitConversions(ducks: List[Duck]) {
    print(
        "Here are my ducks: " +
        ducks.map(
          duck => "\tA duck that says '" + duck.quack() + "'"
        ).mkString("\n","\n","\n")
    )
}

describeDuckCollectionUsingWrappersAndImplicitConversions(
    List(donald, daffy, rubberDucky)
)

Here, the caller doesn’t have explicitly wrap its ducks — Scala wraps them for you automatically! If the types of objects being passed to a function don’t match the required types, Scala will look for implicits, functions that can convert them to the right types. It will look for any methods marked implicit in the current scope, or in any relevant companion objects — in this case, the Duck object — and if it finds one with the right type signature, it will assume you want to use it to convert your object to the right type, and do it for you automatically.

Solution 4: Structural Types

With structural types, I can retroactively create a supertype, and declare that anything that declares the quack() method of the right signature is an instance of that type.

/* If it quacks, it's a duck */
type StructuralDuck = {def quack(): String}

def describeDuckCollectionUsingStructuralType(ducks: List[StructuralDuck]) {
    print(
        "Here are my ducks: " +
        ducks.map(
          duck => "\tA duck that says '" + duck.quack() + "'"
        ).mkString("\n","\n","\n")
    )
}

describeDuckCollectionUsingStructuralType(
List[StructuralDuck](
  donald,
  daffy,
  rubberDucky
))

This one is simple, doesn’t require client code to explicitly wrap objects, and doesn’t use implicits! It seems like a perfect solution!

But, structural types only work if your classes all happen to have a duck method with the right signature.

And although structural types are considered to be type safe, they are not not necessarily semantically type safe. Even though it’s probably safe to say that anything that has a quack method is a duck, in other cases, two classes sharing a method with a common name could be mere coincidence. For example, I can create a structural type {def open(): ()}, and it would automatically be a common supertype for both Files and a Doors, but it would be useless and potentially dangerous. So, just keep that in mind.

Solution 5: Type Class Pattern

Type classes, the go-to solution for retroactive extension in Haskell, are probably overkill for the simple problem we are trying to solve here.

Type classes have many of the same pros/cons as the Rich Wrapper pattern, but are more powerful because they allow for multiple dispatch. Our retroactive duck supertyping challenge doesn’t require multiple dispatch, but I’ll still show how the type class pattern would be used in this example.

trait DuckTypeClass[D] { def quack(d: D): String }

object DuckTypeClass {
    implicit def cartoonDuckService: DuckTypeClass[CartoonDuck] = 
      new DuckTypeClass[CartoonDuck] { def quack(d: CartoonDuck) = d.quack() }
    implicit def rubberDuckService: DuckTypeClass[RubberDuck] = 
      new DuckTypeClass[RubberDuck] { def quack(d: RubberDuck) = d.quack()}
}


def describeDuckCollectionUsingTypeClass[D: DuckTypeClass](ducks: List[D]) {
    print(
        "Here are my ducks: " +
        ducks.map(
          duck => "\tA duck that says '" + 
            implicitly[DuckTypeClass[D]].quack(duck) + "'"
        ).mkString("\n","\n","\n")
    )
}

So understand that here, unlike with the Adapter pattern, we are not creating wrapped Duck objects that implement some kind of Duck trait. Instead we are creating objects that are like services (called type class instances) that provide a static duck function, taking CartoonDucks or RubberDucks as arguments. We create just one instance of each of these services for each type of duck, and pass them implicitly to describeDuckCollectionUsingTypeClass.

Notice the : DuckTypeClass inside the type parameter. This is a context bound, which is syntactic sugar, essentially equivalent to defining the method signature as:

def describeDuckCollectionUsingTypeclass[D]
(ducks: List[D])(implicit duckService: DuckTypeClass[D])

And then

implicitly[DuckTypeClass[D]].quack(...) 

is also syntactic sugar, equivalent to writing

duckService.quack(...)

And to be precise, Scala will choose some unique name for the parameter, not necessarily duckService.

Okay, basically we create services that give us static quack and other duck-like functionality, we pass them implicitly to general purpose duck code, and use the context-bound implicit parameter to make method signature a little more terse.

The Heterogeneous Collection Problem

But there’s a problem. Even though this works:

describeDuckCollectionUsingTypeClass(List(donald, daffy))

The following will give you a horrid little error that will make you wonder if it was all worth it:

describeDuckCollectionUsingTypeClass(List(donald, daffy, rubberDucky))

Output

error: could not find implicit value for evidence parameter of type this.DuckTypeClass[ScalaObject]
describeDuckCollectionUsingTypeClass(List(donald, daffy, rubberDucky))
^
So what’s going on? Well, the first function call works, because, the List only contains CartoonDucks. So Scala infers the type of the argument to be List[CartoonDuck]. It then looks for a typeclass instance of type DuckTypeClass[CartoonDuck], which it finds in the DuckTypeClass companion object. So all good

In the second function call, since the list contains two different types of Duck objects, Scala infers the type to be List[ScalaObject] — the common supertype of RubberDuck and CartoonDuck. But, we can only implicitly pass one instance of DuckTypeClass[D] — either DuckTypeClass[CartoonDuck] or DuckTypeClass[RubberDuck]. Thus, failure!

Solution 6: Hybrid Type Class with Structural Type

However, ScalaObject isn’t the only common supertype of CartoonDuck and RubberDuck. We already defined the StructuralDuck type previously! So let’s just create a new type class instance for StructuralDucks in the DuckTypeClass companion object!

object DuckTypeClass {
  implicit def structuralDuckService: DuckTypeClass[StructuralDuck] = 
    new DuckTypeClass[StructuralDuck] { 
      def quack(d: StructuralDuck) = d.quack()
    }
}

Then we suddenly can deal with heterogenous collections! Because Scala can deduce the list we are passing to be an instance of List[StructuralDuck], and it can find an implicit instance of DuckTypeClass[StructuralDuck], the call will work:

describeDuckCollectionUsingTypeClass(List(donald, daffy, rubberDucky))

Multiple Dispatch and Type Classes

But why would we want to do this? Aren’t we overcomplicating things? We had a perfectly good solution with structural types alone — why use structural types AND type classes?

The answer is, we wouldn’t. Don’t use a type class in cases like this. It’s overkill.

But let me show you when you would need a typeclass. Suppose we want our ducks to fight.

The trick is, the fight method takes two ducks as parameters, and the outcome of the fight depends on the type of duck. If we add the fight method to the type class instances cartoonDuckService and rubberDuckService, we have a problem. Only ducks of the same type could fight:

trait DuckTypeClass[D] { def quack(d: D): String; def fight(d1: D, d2: D): D }

object DuckTypeClass {
    implicit def cartoonDuckService: DuckTypeClass[CartoonDuck] = 
      new DuckTypeClass[CartoonDuck] { 
        def quack(d: CartoonDuck) = d.quack() 
        def fight(duck1: CartoonDuck, duck2: CartoonDuck)
    }

}

But using a type class and a structural type, any two objects with a quack method can fight.

object DuckTypeClass {
    implicit def structuralDuckService: DuckTypeClass[StructuralDuck] = 
    new DuckTypeClass[StructuralDuck] { 
      def quack(d: StructuralDuck) = d.quack()

      /* Define a fight method that works for any two structural ducks */
      import scala.util.Random
      def fight(blueCorner: StructuralDuck, redCorner: StructuralDuck): 
        StructuralDuck = (blueCorner, redCorner) match {

        /* Cartoon Ducks beat Rubber Ducks */ 
        case (rubberDuck: RubberDuck, cartoonDuck:CartoonDuck) =>
         cartoonDuck

        /* Obviously, order of parameters doesn't matter */
        case (cartoonDuck:CartoonDuck, rubberDuck: RubberDuck) => 
 
          fight(rubberDuck, cartoonDuck)

        /* For other combinations of ducks, let fate decide */
        case (duck1, duck2) => 
          if((new Random()).nextBoolean) duck1 else duck2
      }
    }
}

def describeDuckCollectionUsingTypeClassAndMultipleDispatch
[D: DuckTypeClass](ducks: List[D]) {

    /* Import quack and fight from the typeclass instance */
    val duckService = implicitly[DuckTypeClass[D]];
    import duckService._

    print(
        "Here are my ducks: " +
        ducks.map(
          duck => "\tA duck that says '" 
          + quack(duck) + "'"
        ).mkString("\n","\n","\n")
    )
    print(
        "They are fighting!  The winner says: "
        + quack(ducks.reduce(fight))
    )
}

describeDuckCollectionUsingTypeClassAndMultipleDispatch(
  List[StructuralDuck](
    donald, daffy, rubberDucky
  )
)


Now to be clear, what are the benefits of this solution? It means that now, we can create new Duck types, create a DuckTypeClass instances for them, and include them in our duck collection, and all of our describeDuckCollection* code will still work. And we can do this all without touching either our original library code, or the describeDuckCollection implementations! If you are a fan of the open-closed principle, and like to extend functionality with modifying perfectly good code, that’s nice.

Comparison

So each method has its pros and cons, like everything in life! I hope that this simple comparison will help you decide which solution will work best for your particular needs.

Comparison

Posted in Programming Language Design Tagged with: , , ,