Category: Programming Language Design

Immutability...it's like, woah
August 18th, 2016 by john.warden@gmail.com

In a previous post I introduced the Unicode Consortium’s Identifier Spec (TR 31) and Security Spec (TR 39), and explain my own recommendations for implementing Unicode Identifiers in conformance with the Unicode recommendations, which is to use a fixed whitelist of allowed identifier characters.

The Problem

As new versions of Unicode are released, what qualifies as an identifier could change.

For this reason, TR31 includes recommendations for immutable identifiers. The idea behind these is both backwards and forwards compatibility (or upward compatibility): make it so parsers/specs written against one version of Unicode recognize the same identifiers as parsers/specs written against future versions of Unicode. This approach is used in the XML 1.1 Recommendation.

The Unicode Consortium’s Recommendation

The Unicode recommended approach (UAX31-R2) allows all characters in identifiers (whether or not they are assigned in any specific version of Unicode), except known punctuation/control/whitespace/etc. Because this approach is so inclusive of potentially inappropriate characters (emojis, ideograms, etc), it suggests that alternatively you can start with valid identifier characters (from XID_START and XID_CONTINUE) from a specific version of Unicode, and then allow any character that aren’t assigned in that version.

This approach essentially decouples your spec/parser from the Unicode Character Database. But the result is that you lack information about the properties of parsed identifiers, such as what script they belong to, and their NFC/NFKC/case-folded forms.

This makes it impossible to implement the recommended restrictions from the Security Spec (restricted scripts, non-NFKC, etc.). And most importantly, it makes it impossible to compare identifiers for equality, which requires NFC normalization at the very least. Indeed UAX31-R2 ends with the recommendation that identifiers not use any unassigned characters.

I emailed Mark Davis of the Unicode Consortium with an earlier version of this post, and he explained:

The immutable identifiers are designed for those cases (like XML) that can’t update across versions of Unicode. They are not particularly recommended. Any system that *can* update across versions of Unicode (as your proposal requires) are instead recommended to use the default identifiers.

So in other words, decoupling your identifier spec from the UCD like this makes sense for XML because it doesn’t concern itself with what you do with the tags/attributes in the parsed document, it just needs to decide “identifier/not-identifier”. But for a programming language, you can and probably should disallow unassigned characters so you can use the UCD to understand the properties of identifiers, most importantly so you can NFC/NFKC normalize identifiers and compare them for equality.

My Recommendation: Fixed Whitelist of Allowed Characters

Now, immutability is still a neat idea for a programming language. As a language designer you will be swimming in complexity and change: it would be nice if there was one aspect that was set in stone once and for all. The definition of an identifier could be one of those things.

Another way to achieve immutability is to simply use a fixed white list of allowed Unicode characters. I recommend using the latest list of allowed characters from the Security Spec as your identifier character whitelist, and then never upgrading this list.

Your gut reaction to forever pegging your language spec to a specific version of Unicode is probably “no“! Software developers are used to upgrading to the latest versions of everything.

But Unicode is a standard, not code, and as of Unicode 9.0, the Unicode character set is incredibly complete. We’ve gone beyond encoding all the world’s currently used languages to encoding obscure academic scripts and emoji. If there’s a potentially legal identifier character in some language that somebody might actually program in, it’s almost certainly already in Unicode.

The suggested identifier character whitelist data file contains the version of Unicode in which each character was added. A quick search shows that there are only 11 characters characters added to Unicode between 8.0 and 9.0 that made it onto the whitelist, and another 9 previously existing characters were added to the whitelist (from the tamil script). In other words, evolution of this list has ground to a near halt.

Summary

You will still probably want to use the default syntax for identifiers to make sure your identifier doesn’t, say, start with a number, and to allow optional medial punctuation (e.g. hyphens). This syntax relies on the XID_START and XID_CONTINUE character properties from the Unicode Character Database (plus any optional characters you choose to add). You are probably using the version of the Unicode Character Database that is bundled with whatever language you are using to build your parser, which is probably less than the latest version (9.0 at the time of this writing). This means that the set of allowed identifier characters will be a subset of the characters in the identifier character whitelist, which is based on Unicode 9. So, as your UCD is upgraded, you will “grow in” to the full list of allowed characters in the whitelist.

If you follow this recommendation, you still have some choices to make about optional medial characters, restricting mixed-script identifiers, and NFC/NFKC normalization. I have defined an identifier profile called IPIC9 (Immutable Profile for Identifiers in Code), which uses the recommendation above and also specifies exactly which optional characters to allow, with specific rules for normalizing and restricting identifiers.

Using this spec will give you an extremely exhaustive, very clean set of legal identifier characters that will grow to a certain point and then never change.

Posted in Programming Language Design Tagged with: , , ,

punctuation
August 12th, 2016 by john.warden@gmail.com

In a previous post I introduced the Unicode Consortium’s specs for Unicode Identifier Syntax and Security (TRs 31 and 39), and summarized my own recommendations, in cases where the TRs leave you with options.

The Unicode Identifier Spec recommends a handful of optional punctuation characters to allow in unicode identifiers. In this post I make specific recommendations for which of these optional characters you really should allow.

Hyphenation

The identifier spec recommends allowing both the hyphen (U+2010) and the ASCII hyphen-minus (U+002D) in the middle of identifiers (<Medial> characters).

They are also included in the security spec’s identifier character whitelist. However, although it is not clear from the spec, these characters should only be allowed as <Medial> characters. An identifier should not start or end with a hyphen.

Unfortunately, allowing both hyphen and hyphen-minus can create confusion, since these are not the same under either NFC or NFKC normalization. For example, the following would be valid, but distinct, identifiers.

first-rate
first-rate

This possibility is mentioned in TR36 Single Script Spoofing. However as of Unicode 9.0 the security spec doesn’t make recommendations for dealing with this kind of single-script spoofing.

Recommendation: Allow Hyphen-Minus in Identifiers, Make Hyphen a Non-Canonical Alternative

If you follow the recommendation in the identifier spec to allow hyphen (U+2010) in medial position, then make the ASCII hyphen-minus the canonical form.

Even though the hyphen was added to Unicode as the ‘unambiguous’ hyphen, a hyphen-minus sandwiched between two words in an identifier such as pretty-print has unambiguous hyphen semantics, whereas as “pretty – print” obviously means “pretty minus print” (or a formatting mistake)

Plus the ASCII hyphen-minus has a tradition of use inside identifiers in many languages, and it’s probably best to make sure the set of valid Unicode identifiers is a superset of valid ASCII identifiers.

Canonicalizing the proper hyphen to ASCII hyphen-minus will make life easier for coders in some cases, such as when documentation formatters convert a hyphen-minus in identifiers to a hyphen for display. A coder who unwittingly copies-and-pastes non-ASCII hyphens into their code should be none the wiser.

Recommendation: Require Proper Spacing around Mathematical Minus and Hyphen

Furthermore, I recommend that the mathematical minus sign − (U+2212) be disallowed in places where it looks like a hyphen:

> first−rate

ERROR: Mathematical minus sign (U+2212) used as hyphen. Please use the ‘-‘ character (U+002D hyphen-minus) instead, or place space around the minus sign.

And the hyphen should (U+2010) be disallowed in places where it looks like a minus:

> first ‐ rate

ERROR: Hyphen (U+2212) cannot be used by itself as symbol. If you meant minus, use ‘-‘ (U+002D hyphen-minus) instead.

Don’t parse ‘first-rate’ differently depending on whether a mathematical minus or a hyphen is used. Code that looks the same should work the same. The proper hyphen and mathematical minus were introduced to Unicode to allow clear semantic distinctions between hyphen and minus, but a hidden semantic distinction doesn’t justify visual confusion. Your code should be flexible enough to distinguish between hyphen and minus based on context, but strict enough to reject semantically inappropriate use of either of them.

Furthermore, I recommend considering mathematical minus and hyphen-minus to be canonical equivalent when used as part of non-word identifiers. For example this is clearly a minus:

a - b

Apostrophes

The identifier spec also recommends apostrophe in optional medial characters.

Recommendation: Disallow Apostrophe

Unlike the hyphen, the apostrophe does not have a tradition of inclusion in identifiers in many computer languages. Plus, the syntax around the apostrophe doubling as a single quote character would make both visual and actual parsing quite difficult. Programmers who speak languages that have apostrophes in words have been dealing with omitting them from identifiers for a long time, and I don’t sense the demand for apostrophes in languages as there is for hyphens. Solving this problem is probably not worth the problems it can create.

If you follow this recommendation, do not allow the right single quotation mark either.

Disallow:

U+0027 ' APOSTROPHE
U+2019 ’ RIGHT SINGLE QUOTATION MARK

Middle Dots

Recommendation: Allow Middle Dot

The Catalans are a passionate people fiercely proud of their language, many of whom will be happy to use Catalan words in their code, which can have middle dots. Variants of the middle dot are also used in other languages including Katakana and Chinese, so it’s best to allow them.

00B7  · MIDDLE DOT
0387  · GREEK ANO TELEIA

U+0387 GREEK ANO TELEIA is a non-NFC equivalent to middle dot, so should be treated the same as the middle dot.

Recommendation: Normalize Hyphenation Point

The Hyphenation Point is confusable with U+00B7 middle dot. So allow it, but normalize to middle dot.

2027  ‧ HYPHENATION POINT

Recommendation: Disallow Middle Dots at End of an Identifier

The middle dot is a recommended medial character, but is also in XID_CONTINUE, which allows it to appear at the end of an identifier too. I recommend you disallow middle dot specifically at the end of identifiers.

Non-ASCII Punctuation

The identifier spec recommends several other optional characters from non-latin scripts. Some of these are confusable with the hyphen and middle dot.

For example, U+30FB KATAKANA MIDDLE DOT looks a lot like a regular middle dot, and U+058A ARMENIAN HYPHEN looks like a hyphen-minus if you don’t look at it closely.

The confusability issue is partly solved if your language disallows mixed-script identifiers. It would prevent for example an Armenian hyphen being used anywhere but between Armenian characters. But because hyphen-minus is part of the Common script, mixed-script detection does not prevent a regular hyphen-minus from being placed between Armenian characters!

Recommendation: Normalize Non-Ascii Hyphens and Middle Dots Based on Context

The following three medial characters belong to specific scripts, and should be used in place of hyphens and middle dots in those scripts.

  1. Hiragana and Katakana: 30A0 ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN
  2. Armenian: 058A ֊ ARMENIAN HYPHEN

Middle Dots:

  1. Hiragana and Katakana: 30FB ・ KATAKANA MIDDLE DOT
  2. Tibetan: 0F0B ་ TIBETAN MARK INTERSYLLABIC TSHEG

To normalize these characters, use the following rules:

  1. If any above characters follow a character from a different script, they should be normalized to hyphen minus or latin middle dot, respectively.
  2. If any a hyphen or middle dot follows a character from any of these scripts, they should be converted to the appropriate character from that script.

Examples:

  1. first֊rate (with Armenian hyphen) => first-rate
  2. il་lusio (with Tibetan tsheg) => il·lusio
  3. first・rate (with Hiragana/Katakana middle dot) => first·rate
  4. ウォルドルフ·アストリア (with Latin middle dot) => ウォルドルフ・アストリア (with Hiragana/Katakana middle dot)
  5. ウォルドルフ֊アストリア (with Armenian hyphen) => ウォルドルフ゠アストリア (with Hiragana/Katakana double hyphen)
  6. ཙ·ཚ (with latin middle dot) => ཙ་ཚ (with Tibetan tsheg)
  7. հայերեն֊հայերեն (with hyphen-minus) => հայերեն֊հայերեն (with Armenian hyphen)
  8. il・lusio (with Hiragana/Katakana middle dot) => il·lusio (with Latin middle dot)

Recommendation: Allow Recommended Hebrew Punctuation Characters

If your language dis-allows mixed-script identifiers as recommended in my last post (and in the Unicode the security spec), the following characters can only be used after Hebrew characters. Furthermore, although they could be confused with apostrophes and double-quotes, these characters are not allowed in identifiers.

Medial:

05F4  ״ HEBREW PUNCTUATION GERSHAYIM

Continue:

05F3    ׳   HEBREW PUNCTUATION GERESH

Miscellaneous ASCII Punctuation

The identifier spec recommends also allowing the following characters unless you have a compelling reason not to.

Start:

0024    $   DOLLAR SIGN
005F    _   LOW LINE

Medial:

002E    .   FULL STOP
003A    :   COLON

Recommendation: Allow _ but not $

A lot of mainstream languages allow the underscore (or low line) anywhere in identifiers, so there is no compelling reason to disallow it.

The same is not true of the dollar sign. The dollar sign, like other miscellaneous ASCII characters such as @, %, ~, is often allowed as part of non-word identifiers such as >=, ||, ->, ++, <$>. So I recommend allowing these in a separate class of non-word identifiers. Scala is an example of a language that takes this approach, having word identifiers, and ‘operator identifiers’ comprising misc. ASCII characters and Unicode math symbols.

Recommendation: Disallow . and :

‘.’ and ‘:’, like ‘/’, are often used in languages for separating the parts of ‘paths’ or ‘fully qualified’ names.

But, in many languages, it is best to think of these as just syntax for expressing a composite identifier with multiple parts, and not part of the content of the identifier itself. So I recommend disallowing these characters ‘identifiers’, and only allowing them in ‘paths’ or ‘namespaces’ or ‘fully qualified names’.

Format Control Characters

The ZWJ (U+200D ZERO WIDTH JOINER) and ZWNJ (U+200C ZERO WIDTH NON-JOINER) characters are invisible, except when they affect the appearance of certain pairs of characters when placed between them. Although initially intended solely for formatting, ZWJ and ZWNJ now can actually change the meaning of some words.

The identifier spec provides a regex for identifying places in text when ZWJ/ZWNJ might cause an actual visual distinction. However, there are still many places this doesn’t catch, and many terminals and text editors don’t know how to render these correctly anyway.

Recommendation: Elide ZWJ/ZWNJ in Case-Insensitive Identifiers

200C    ZERO WIDTH NON-JOINER*
200D    ZERO WIDTH JOINER*

Now, case-folding elides ZWJ/ZWNJ, so if your identifiers are case-insensitive (meaning you are case folding them before comparing them), allowing them will not create confusability issues, since two otherwise equal identifiers will still be equal if one has a ZWJ/ZWNJ character and the other doesn’t. So for purposes of improved readability, I recommend allowing but eliding ZWJ/ZWNJ characters for case-insensitive identifiers.

International domain names (IDN) also allow but elide ZWJ/ZWNJ characters based on the same logic.

Recommendation: Disallow ZWJ/ZWNJ in Case-Sensitive identifiers

If identifiers in your language are case-sensitive, then I recommend that you simply disallow these characters for now.

None of the Unicode normalization forms knows how to handle ZWJ/ZWNJ correctly, by removing them when they are invisible or adding them when it’s more correct. I think it might be possible to create a normalization algorithm in the future that can do this. But if in the meantime these characters were allowed, you couldn’t incorporate a proper ZWJ/ZWNJ normalization in the future without breaking backwards compatibility.

So disallowing ZWJ/ZWNJ will mean certain words can’t be used as identifiers in their proper spelling, but programmers have been dealing with this forever (e.g. I can’t use can't as an identifier in most languages but it’s ok). And it leaves open the possibility of a proper implementation of ZWNJ identifiers in the future.

Summary

I have summarized all my recommendations for identifiers in a spec I call IPIC9 (immutable profile for identifiers in code).

Posted in Programming Language Design Tagged with: , , , , , , ,

unicode identifier dilemma
July 24th, 2016 by john.warden@gmail.com

Abstract: In this post I summarize the Unicode Consortium’s Recommendations for Identifier Syntax and Security (TR31 and TR39), and introduce some of the decisions that a language designer needs to make when implementing these.

I also recommend an exact syntax that conforms to the Unicode consortium recommendations for unicode identifiers while balancing issues of inclusiveness, confusability, and backwards and forwards compatibility. In a series of followup posts, I plan to discuss the reasons for these recommendations in detail.

The Decision

Many languages are allowing Unicode characters in identifiers:

Julia

∑(x) = sum(x)

Haskell

type ℚ = Ratio ℤ

Scala

for (n ← 1 to 10) ...

Although these languages all have different specs for what characters can be allowed in identifiers.

There is some debate on whether this is a good idea at all, if only because some coders will have trouble typing non-ASCII characters.

But there is also the issue of confusability, or homograms — different identifiers that look similar or identical — which can cause frustration and bugs at best, security concerns at worst. For example, micro sign µ and greek mu μ, scope in Latin and ѕсоре in Cyrillic, and worst of all, invisible control characters.

There is also the question of treatment of superscripts and subscripts, comparison, and backwards compatibility and immutability (so the set of valid identifiers doesn’t change as Unicode evolves).

On the other hand, not only is xᵦ² just pretty neat, it can be more readable than x[beta]^2 for some applications.

But most importantly, Unicode identifiers allow people to code in their preferred language. Certainly there’s some benefit to most open-source code being written and documented in English, the lingua-franca of software development, and a coder working in another language will certainly be at a disadvantage, but if it’s a question between English and not at all, it may be worth it.

Unicode’s Recommendations for Identifiers

Fortunately, the Unicode folks have thought a lot about what non-ASCII characters make sense in identifiers, and have issued Technical Report #31 – Unicode Identifier and Pattern Syntax (which I’ll call the “Identifier Spec”) which includes a syntax for identifiers:

<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*

The idea is that there are characters that can start an identifier (letters), a larger set that can continue an identifier (numbers, modifiers), and a few medial characters that can only occur in the middle (hyphens and middle dots).

Unicode defines character properties XID_START and XID_CONTINUE in the Unicode Character Database, which it recommends for use as <Start> and <Continue>, but the spec lets you define these however you like. For example, ‘_’ (underscore) is not in XID_START, but it is suggested you might want to include it in anyway.

XID_START is the international equivalent of /[a-zA-Z]/, and is made up of:

  1. Letters (Lt, Lo, Ll, Lu)
  2. A dozen exceptions, such as U+212E ( ℮ ) ESTIMATED SYMBOL, ‘grandfathered’ for backwards compatibility.

XID_CONTINUE is a superset of XID_START, and adds:

  1. Marks (Mn, Mc) — accents, etc. that combine with the characters before them
  2. Numbers (Nl, Nd) — the international equivalent of /0-9/
  3. Connector Punctuation (Pc) — the international equivalent of /_/
  4. A half-dozen miscellaneous characters, such as U+00B7 MIDDLE DOT

The identifier spec also suggests a small set of additional optional punctuation characters to use as <Medial>, including a variety of hyphens, middle-dots, apostrophes, and format control characters, which it exhorts you to use unless you have a good reason not to (though you probably have a good reason not to use some of them).

Normalization and Security

Unicode defines two forms of normalization, NFC or NFKC, which can help address the issue of confusable identifiers (that look the same but aren’t). But normalization intentionally does not eliminate homoglyphs — two characters with distinct meaning that look the same (for example latin ‘o’ and cyrillic ‘о’).

So to further reduce the possibility of identifier confusion, the identifier spec recommends excluding certain obsolete or limited use scripts.

Then there is a completely separate TR, Unicode Technical Report #39, Unicode Security Mechanisms (which I’ll call the “Security Spec”), which provides recommendations for further restricting certain characters and mixed-script identifiers.

Implementing these recommendations drastically reduces and sanitizes identifier space, removing most of the weirdness and confusability that comes from invisible characters and homoglyphs.

Other Considerations

The identifier also spec includes recommendations and alternatives for addressing:
– stability (backwards compatibility)
– immutability (forward compatibility)
– comparison/equality (case sensitive and insensitive)
– use of format control characters ZWJ and ZWNJ

So this leaves you with a lot of decisions to make. Do you restrict some or all characters recommended in the security spec? Do you allow hyphens, apostrophes, and middle dots in identifiers? Do you allow format control characters in some situations? Do you NFC or NFKC normalize? How do you define identifier equality? Do you make your identifiers “immutable” so that parsers built against different versions of Unicode don’t disagree on what an identifier is?

Some of these issues are tricky, and some of the Unicode recommendations are flawed. I will discuss these issues in detail in followup posts.

But jumping straight to conclusions, here is a specific set of recommendations for a balanced, reasonable definition of a Unicode identifier.

IPIC9: An Immutable Profile for Identifiers in Code

1. Base Syntax

  • <Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*

Define <Start> as:

  • XID_START and U+005F LOW LINE

Define <Continue> as:

  • XID_CONTINUE, and U+05F3 HEBREW PUNCTUATION GERESH
  • But not U+00B7 MIDDLE DOT or U+0387 GREEK ANO TELEIA (these are Medial instead)

Define <Medial> as:

  • U+002D, U+2010, U+00B7, U+2027, U+30FB, U+30A0, U+058A, U+05F4, U+0F0B

These are: HYPHEN-MINUS, HYPHEN, MIDDLE DOT, HYPHENATION POINT, KATAKANA MIDDLE DOT, KATAKANA DOUBLE HYPHEN, ARMENIAN HYPHEN, HEBREW PUNCTUATION GERSHAYIM, and TIBETAN MARK INTERSYLLABIC TSHEG.

2. Normalization

Then Normalize with the following steps:

  1. NFC normalize
  2. Convert U+2010 HYPHEN, U+30A0 KATAKANA DOUBLE HYPHEN, and U+058A ARMENIAN HYPHEN to U+002D HYPHEN-MINUS
  3. Convert U+2027 HYPHENATION POINT, U+30FB KATAKANA MIDDLE DOT, and U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG to U+00B7 MIDDLE DOT
  4. Convert U+002D HYPHEN-MINUS following …
    1. … an Armenian character to U+058A ARMENIAN HYPHEN
    2. … a Hiragana or Katakana character to U+30A0 KATAKANA DOUBLE HYPHEN
  5. Convert U+00B7 MIDDLE DOT following …
    1. … a Tibetan character to U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG
    2. … a Hiragana or Katakana character to U+30FB KATAKANA MIDDLE DOT

3. Restrictions

After normalizing, reject any identifiers that contain characters that:

  1. are not labeled as “Allowed” in the identifier character whitelist from the the Security Spec.
  2. contain characters from more than one script

And don’t ever update the whitelist, even as new versions of Unicode are released. This way the definition of an identifier is immutable: it will never change even as Unicode evolves.

Don’t worry about losing out on important Unicode updates. First, Unicode guarantees backwards compatibility of identifiers: characters will only be added to this list. Second, Unicode 9.0.0 is so extremely comprehensive that characters are added to the whitelist very infrequently. Between Unicode 7.0.0 and 9.0.0, there were are only 14 characters added to Unicode that made it onto this list!

Sample Code

The following sample Perl code provides an implementation of this specification:

Regexes for Matching Identifiers

my $START = qr/[\p{XID_START}_]/;
my $MEDIAL = qr/[\x{002d}\x{2010}\x{00b7}\x{2027}\x{30FB}\x{058a}\x{05f4}\x{0f0b}]/;
my $CONTINUE = qr/[\p{XID_CONTINUE}\x{05F3}]/;
my $NONEND = qr/[\x{0387}\x{00B7}]/;
my $WORD_IDENTIFIER_REGEX = qr/$START$CONTINUE*(?:$MEDIAL$CONTINUE+)*(?<!$NONEND)/;

Normalization

use Unicode::Normalize 'NFC';

sub normalize_identifier {
  my $string = shift;

  croak "Expected a string argument"
    if not defined $string;

  $string = NFC($string);

  # Hyphen, Armenian Hyphen, and Katakana Double Hyphen to hyphen-minus
  $string =~ s/[\x{2010}\x{058A}\x{30A0}]/\x{002D}/;

  # Hyphenation Point, Katakana middle dot, and Tibetan tsheg to middle dot
  $string =~ s/[\x{2027}\x{30FB}\x{0F0B}]/\x{00B7}/g;

  ### Context-specific normalizations
  $string =~ s/[\x{00B7}](?=\p{Tibetan})/\x{0F0B}/g; # middle dot to Tibetan tsheg
  $string =~ s/[\x{00B7}](?=[\p{Katakana}\p{Hiragana}])/\x{30FB}/g; # middle dot to Hiragana/Katakana middle dot
  $string =~ s/[\x{002D}](?=\p{Armenian})/\x{058A}/g; # hyphen to Armenian hyphen
  $string =~ s/[\x{002D}](?=[\p{Katakana}\p{Hiragana}])/\x{30A0}/g; # hyphen to Hiragana/Katakana double hyphen

  return $string;
}

Restriction

use Unicode::Normalize 'NFKC';
use Unicode::Security 'mixed_script';

sub is_restricted_identifier {
  my $identifier = shift;

  return 'mixed-script'
    if mixed_script($identifier);

  return 'disallowed'
    if $identifier =~ /\P{InTR39AllowedCharacters}/;

  return 0;
}

sub InTR39AllowedCharacters {
  # List copied from: http://www.unicode.org/Public/security/9.0.0/IdentifierStatus.txt
  return q{0027
002D 002E
0030 003A
...etc...
  };
}

Other Recommendations

Here are some other things to consider when implementing a Unicode-based language:

  • The Unicode identifier spec covers word-like identifiers only. Create a separate lexical class of math-and-punctuation based identifiers (i.e. operators) comprising sequences of math symbols (Sm) and any ASCII punctuation not reserved for other purposes in your language.

  • This spec disallows non-NKFC characters, such as superscripts and subscripts. But that means you can’t allow superscripts in your source. Just give superscripts and subscripts semantic value. For example, you can make xⱼⁿ⁺⁹ syntactic sugar for exp(nth(x, j), n+9).

  • Create a separate class of identifiers for the mathematical alphanumeric symbols such as 𝘽 and non-NFCK font variants of math symbols such as . These were designed to be visually non-confusable.

  • Emit syntax errors with specific reasons identifiers are not matched (non-NKFC, mixed script, contains restricted characters per TR39, starts with a Medial character, etc.)

  • If your identifiers are case sensitive, do case-folded comparisons with locale explicitly set to US English or “empty” (root) for consistent behavior.

  • Require space around mathematical minus sign U+2212 to avoid confusion with hyphen.

Summary

If you implement the above spec, you can’t go too far wrong. It conforms to both Unicode’s identifier spec (TR31) and security spec (TR39). It is immutable (both backwards and forwards compatible), and minimizes confusability. It is highly inclusive, allowing virtually all letters, accents, and medial punctuation in current use in all living languages. It disallows just a few of Unicode’s recommended optional medial punctuation characters that are truly problematic (period, colon, apostrophe, and usually-invisible control characters). It restricts potentially confusable punctuation characters to their appropriate scripts, and reserves non-NFKC variants of characters (superscript, subscript, small, full-width, bold, doublestruck, etc.) for you to assign specific semantic appropriate to your language.

Followup

I’ve written followup posts with my reasoning for these recommendations:

Posted in Programming Language Design Tagged with: ,

2013-09-27-bcfd3b8
March 31st, 2016 by john.warden@gmail.com

The Problem

Why were keywords and vectors added to LISP, a language that was already based on symbols and lists?

The answer: because keywords and vectors are self-evaluating. Like string and number literals, they are what they are. Unlike symbols and lists, they are never treated as code that is replaced with some other value when evaluated.

You might argue that if you don’t want to symbol or a list to be treated as code, just don’t eval it! But this gets complicated if you are evaluating code that has non-code symbolic data embedded in it.

For example, it’s clear that the code below prints “2”.


> (def function 'print)
> (def arg 2)
> (eval (list function arg))
; prints 2

But what about this?


> (def function 'print)
> (def arg '(red blue))
> (eval (list function arg))
; ERROR: red is not defined

Even though we used the quote operator when we first defined arg to make sure the value was treated as a list of colors and not code, this value gets evaluated again during the evaluation of the generated code.

It can be hard to keep track of exactly when symbols and lists might be evaluated and thus need to be quoted. This is one reason LISP programmers prefer vectors and keywords for symbolic data that is not code. If data is not meant to be interpreted as code, it’s better if it can’t be interpreted as code — just as it’s better if the number 2 can’t be interpreted as anything but the number 2. You can eval vectors, keywords, and numbers all day long and they will never change.


> (def function 'print)
> (def arg [:red :blue])
> (eval (list 'print arg))
; prints [:red :blue]

There is more good discussion about why we use keywords as symbols in the Arcane Sentiment blog.

An Alternative for Self-Evaluating Data

So a way to embed self-evaluating, non-code symbolic data literals inside of LISP code is valuable. Keywords and vectors make this possible.

But another way of meeting this need would be to make symbols and lists self-evaluating by default! We could then have a way to deliberately mark expressions that are meant to be evaluated as code.

Below I explore how this might work in a variant of LISP.

: as Quote Operator

First, let’s make : the quote operator, instead of ', since LISP programmers are used to thinking of : as meaning “self-evaluating”.


> :(red blue)
(red blue)

If a single symbol is quoted, it looks like a keyword. But the : is not part of the value as it is in common LISP — it’s just the quote operator.


> :blue
blue

But since quoted data is self-evaluating by default, a symbol evaluates to itself.


> (eval :blue)
blue

' for Aliases

Second, let’s continue to use ' to quote expressions that are meant to be interpreted as code, since LISP programmers are used to that.

But instead of quoting a symbol, we can make ' part of the symbol. Let’s call symbols that start with ' aliases.


> 'x
'x

So now we have inverted the usage of : and '. In common LISP, : is a prefix attached to keywords, and ' is the quote operator. In this new LISP, : is the quote operator, and ' is a prefix attached to aliases.

Quoting Code

So let’s say aliases are evaluated in the same way as symbols are evaluated in common lisp — including rules for special forms. So to quote code, you’d need to replace every symbol in a common LISP expression with an explicit alias in this new LISP.


> eval ('let (['x 4]) ('sqrt 'x))
2

This is inconvenient. So let’s say placing ' in front of an expression is a syntactic shortcut for marking all symbols in the expression as aliases.


> '(let ([x 4]) (sqrt x))
'(let ([x 4]) (sqrt x))

' is actually part of the data, marking the entire expression as code.

The equivalent self-evaluating data structure would not include the '.


> :(let ([x 4]) (sqrt x))
(let ([x 4]) (sqrt x))

So the ' operator can still be used just as it is used common LISP, for quoting symbolic data that represents code. But now we we have the : operator for quoting self-evaluating symbolic data.

Self-Evaluating Lists

Now, common LISP assumes that lists represent function calls when evaluated. So the head of the list has to evaluate to a function, or we get an error.

> ("sqrt" 2)
ERROR: "sqrt" is not a function

Let’s say that in this new LISP, if the head of the list evaluates to a symbol, there is no error — the list is just left as a list.

> (:green "SUCCESS")
(green "SUCCESS")

Another way of looking at this is to say that a symbols acts like a value constructor that returns a list with itself as the head and its arguments as the tail.

Here’s another example of symbols as value constructors:


> (defun makeSquare (s) (:rect s s))
> (makeSquare 2)
(rect 2 2)

Back to Original Example

Now going back to the example at the beginning of this post, we can insert symbolic data in code and eval that code, and we get the expected the result.


> (def function 'print)
> (def arg :(red blue))
> (eval (list function arg))
; prints (red blue)

So we constructed a list where the head is an alias to a function, and the argument is a list of symbols. The argument self-evaluates to a list of colors, because its head is a self evaluating symbol. But, the overall expression evaluates as a function call, because the head is an alias that evaluates to the function it is bound to.

Summary of : and ' Quotes.

  • : means quote
  • ' means mark all symbols as aliases
  • symbols are self-evaluating
  • aliases are evaluated as symbols are in common LISP
  • lists are evaluated as:
    • function-applications if the head evaluates to a function
    • lists if the head evaluates to a symbol
    • errors if the head evaluates to anything else

Lists for Trees, Vectors for Lists

It’s common for the head of a list to be a symbol that defines the semantics of the list, whether the list is used to represent code…

'(sqrt 2)

…or self-evaluating symbolic data where the head is a symbol indicating the data type.

:(rect 2 2)

This is called “tagged data” in SICP. In the case of both tagged data and code, lists function like trees, with a root that indicates the type or behavior of that node, and zero or more subtrees.

Contrast this to a list where the first item doesn’t have any special significance: it just happens to be first.

[:red :blue]

LISP programmers are more likely to use vectors in this latter case, partly because LISP will never try to eval a vector as a function-application, and partly because of the syntactic similarity of vectors to lists in other languages.

I’d argue that in LISP, lists have proved themselves most appropriate for representing tree-structured data, either code or tagged data, where the head of the list is a symbol or expression acting as the root of the tree. Vectors on the other hand have proved themselves most useful for representing, well, lists.

This code snippet (using the alternative LISP we just defined) illustrates the use of lists for code, lists for self-evaluating tagged-data, and vectors for lists of things.

(render [(:rect 3 2) (:rect 8 10)])

Nesting :– and '-quotes

Naturally, you can nest '-quoted code:

> (eval '(let ([f 'print]) (f 2))) 
; prints red

And you can nest self-evaluating data literals (:-quotes) inside of '-quoted code:

> (eval '[ (square 2), (:square 2) ])
[4, (square 2)]

And there’s no reason you shouldn’t also be able to nest code inside of self-evaluating data literals:

> (eval :[ ('square 2), (square 2) ])
[4, (square 2)]

The two expressions above produce identical results, but in the first case, we assume code and quote the part of the expression that we don’t want to be treated as code, while in the second case, we assume self-evaluating data and quote the part of the expression that we do want to be treated as code.

So just as in common LISP, any data can be evaluated. Data might be purely self-evaluating, or pure code, or code with some nested self-evaluating data literals, or self-evaluating data with nested code. When data is evaluated, aliases will be replaced with whatever they are bound to, and any lists where the head has evaluated to a function will be replaced with the result of applying that function — all with the exception of special forms.

Summary

So we have inverted the default assumptions about evaluation of symbols and lists. Instead of treating symbols as aliases for other values by default, and using : to mark exceptions, we treat symbols as self-evaluating by default, and use ' to mark exceptions. And instead of treating lists as function applications by default, we treat lists as self-evaluating if the head evaluates to a symbol, and only treat lists as function applications if the head evaluates to a function.

Code is data, but not all data is code. When LISP was created, it was intended that symbols and lists be used to represent all sorts of symbolic expressions — not just code. But the requirement for self-evaluating non-code data literals that can be included as literals in code, led to the introduction of keywords and vectors as self-evaluating supplements to symbols and lists.

With this new variant of LISP, we go back to using symbols and lists in self-evaluating data, and introduce aliases as non-self-evaluating symbols for use in code. We also allow both : and ' to be used for quoting expressions, letting programmers make higher-level decisions on whether quoted data is code or not, freeing them from the granular, case-by-case decisions about whether to use a keyword or a symbol.

Finally, making symbols act as value constructors, creates a natural way to use lists either as trees representing tagged data or trees representing code, while leaving vectors to play the more natural role of listing things.

Posted in Programming Language Design Tagged with: , ,

9d80511470b05582fdb13329de734bda_400x400
June 11th, 2015 by john.warden@gmail.com

In my post on Procedures in a Pure Language – Part 1, I discuss how even pure functional languages can be used to create procedures that have effects, and how that is how things should be.

In Procedures in a Pure Language Part 2, I propose a little language where these impure procedures can coexist with pure functions in a way that makes the line between pure and impure very clear.

In this post, I propose adding a stricture to this language that ensures that, while procedures can have effects, they cannot have side-effects.

Here’s a quick review of a simple program in this language.

main = (console) -> procedure
  let greet = console.println("Hello, Bill")
  greet!

println and main are functions which, given a console argument, return procedures, which can be executed with the ! operator. Let’s call such procedures, which are bound to the object on which they operate, methods.

Containing Effects

Now let’s say add this rule to our language: procedures cannot execute other procedures other than methods on objects passed to them as arguments.

So procedures have no ability to directly reference or execute any other procedure: there are no built-in procedures like println, no global objects like Javascript’s window, and no ability to directly import services or objects like Java’s System.out.

Whoever runs the main procedure above can be certain it will have no effects outside of those that can be achieved by invoking methods on console. Since the caller has complete control over these methods, any effects of main are completely contained.

So these procedures can have effects, but since those affects are contained, they cannot have side-effects.

Contained Inconsistency

Along with not having effects, pure functions must be referentially transparent. In other words, they can’t be inconsistent. Let’s say procedures in our language can only be inconsistent as a result of invoking inconsistent operations on the objects that were passed to them as arguments. Their inconsistency is also contained.

“Impure” Inversion of Control

So a program can’t actually do anything unless it is provided with an object on which it can invoke methods. To OO programmers this sounds like dependency injection or inversion of control.

We can use given/inject preprocessing to achieve inversion of control in this language without the syntactic overhead of manual dependency injection.

given console
main = procedure  
  console.println!("Hello, Bill")

Procedures Inside Functions

There is no reason that a function cannot be pure, and still use procedural code in its implementation. For example:


greet = (name) ->
  mutable output = []
  output.push! "Hello, "
  output.push! name
  output.join("")

greet creates a locally-scoped mutable object, output, and manipulates it — thereby producing effects. But those effects are contained to output, which is not returned and therefore disappears when the function returns.

A functions may be implemented using temporary internal stateful computations like this and still be pure if these states cannot affect the caller or the outside world.

Sandboxing

Since any effects of executing procedures are contained to objects passed to those procedures, we can sandbox their effects.

Presumably, when we run the above program in our hypothetical language, the interpreter will by default pass the a real console object that will actually print to standard output. But let’s say we have the ability to create simple mutable objects that look someting like this:

    mutable mockConsole = object
      output: [] # empty list
      println: (message) -> procedure
        output.push!(message)
    

Now we can pass a mock console to main:

    main = (console) -> procedure
      console.println!("Hello, Bob")
    main!(mockConsole)
    mockConsole.output // ["Hello, Bob"]

Since all services that the main function might use to have effects (Network, Filesystem, etc.) must be passed to it, it makes commandline scripts written our language easy to test.

Conclusion

Requiring inversion of control for all dependencies on services that can be used to have effects, gives the caller of the procedure complete control over its effects: effects without side-effects.

Posted in Programming Language Design Tagged with: , ,

bea6047db6
June 11th, 2015 by john.warden@gmail.com

Global environment variables violate a core principle of functional programming. For example, this is not very acceptable in the FP world:

def hello = 
  if(locale == 'en.US') 
    "hello"
  elsif(locale == 'fr.FR')
    "bonjour"

locale shouldn’t just be there as a global. In an pure functional language it should be passed as a parameter, according to the the principle of referential transparency.

def hello(locale) = 
  if(locale == 'en.US') 
    "hello"
  elsif(locale == 'fr.FR')
    "bonjour"

But this means you also have to pass locale to every function that calls hello.

def greet = (name, locale) ->
   hello(locale) ++ ", " ++ name

Which means you have to pass locale to every function that calls greet, and so on. Turtles all the way down.

Given/Inject Preprocessing

In my post on functional dependency injection and inversion of control, I discuss a solution for passing dependencies without this syntactic overhead, using given/inject statements and code pre-processing.

We usually think of “dependencies” as libraries or services. But anything your code depends on can be thought of as a dependency. In any case, we can use given to thread locale through our code without the semantic overhead.

given locale

def greet = (name) ->
   hello ++ ", " ++ name

def hello = 
  if(locale == 'en.US') 
    "hello"
  elsif(locale == 'fr.FR')
    "bonjour"

The given statement causes:

  • greet and hello to be rewritten in the pre-processing stage to accept an invisible locale parameter
  • greet to pass the locale parameter in its call to hello.

Now other functions can inject a value for locale. The inject statement causes locale to be passed as a parameter to every subsequent function call in the code block that needs it.

def main(environment) = 
    inject locale = environment.locale
    greet("Everybody")

Now we have the convenience of what kind of looks like a global variable. But it is implemented in such a way that the core principles of FP — no side-effects and referential transparency — are respected.

Referential Transparency

You might argue that, because you don’t actually see these parameters literally being passed in your code, given/inject violates referential transparency.

But the referential transparency rule simply requires that an expression can be replaced with its value without changing the output of the program. The value of an expression involving a given simply depends on that variable being bound to a value. Once bound, the expression can be evaluated, and it will be interchangeable with its value.

Referential transparency is not about syntax: it’s about having well-defined, reproducible behavior that is determined entirely by inputs. There is nothing non-FP about using macros and pre-processors to save some keystrokes.

Clarity

Some programmers might fear that bugs and confusion may arise when programmers don’t realize that their program behavior depends on an environment variable. But if even indirect dependencies on environment variables must be declared with given statements, it should be just as clear to a programmer reading the code what is going on.

Preprocessing vs Monads

Another purely functional approach to environment variables is the use of the Reader Monad. But this requires a great deal more syntactic overhead than given/inject. The signature of every function must be modified to return a Reader, and every function call involves a bind. Code pre-processing with given/inject does this work for you.

Conclusion

given/inject preprocessing lets you use environment variables in a purely functional way without the work and complexity of threading them as parameters throughout your code.

Posted in Programming Language Design Tagged with: ,

Syringe_052712
June 9th, 2015 by john.warden@gmail.com

The terms Dependency Injection and Inversion of Control tend to be used in OOP circles, though these concepts are applicable to virtually any language.

A good summary of the definitions of these concepts is the first answer by Garret Hall on the stack overflow question Inversion of Control vs Dependency Injection. Adam Warski has another good blog post on dependency injection in a post OO world.

In this post I’ll talk about dependency injection and IoC in functional programming languages, and propose a solution for achieving IoC with given/inject preprocessing.

Dependency Injection for Dummies

Dependency Injection is when instead of doing something this:

import someLibrary.makeRect
def makeSquare(sideLength) = makeRect(sideLength, sideLength)

You do this:

def makeSquare(sideLength, makeRect) = makeRect(sideLength, sideLength)

I’m not using any particular programming language here. Just trying to illustrate concepts.

So makeSquare has a dependency, makeRect. We pass makeRect as a parameter to makeSquare in the second example, instead of hard-coding a reference to a specific implementation as we did in the first.

Instead of calling this little technique passing-of-dependencies-as-arguments, we call it dependency-injection. There you go.

Too Many Arguments

If you are a consistent dependency-injector, you inject all dependencies, deferring all decisions about concrete implementations to the highest level of your program — say the main function. So:

def main =
  import someLibrary.makeRect
  makeSomeSquares(makeRect)

def makeSquare(sideLength, makeRect) = makeRect(sideLength, sideLength)
def makeSomeSquares(makeRect) = [makeSquare(2, makeRect), makeSquare(4, makeRect)]

So your have littered your code with makeRects, passing it from main all the way down the call stack.. This is ugly. Most people don’t actually do this.

Other Means of Inversion of Control

Inversion of Control is the more general principle we are striving for here. IoC just means not hard-coding specific implementations for dependencies, but rather just specifying what is needed (“something that makes a rectangle”), and letting something higher up make a decision about specific implementations

Passing dependencies as parameters like this is just one way of achieving IoC. In the OO world, there are also IoC frameworks, service locators, the Template design pattern, and more. But what about the FP world?

Given and Inject

In a functional programming language where code is data, we don’t need containers or design patterns or anything like that. We can just create something that modifies our code to do the dependency injection for us.

Let’s define the given keyword, which is like import, but where you don’t hard-code a specific implementation.

given makeRect // instead of: import someLibrary.makeRect

def makeSquare(sideLength) = makeRect(sideLength, sideLength)
def makeSomeSquares = [makeSquare(2), makeSquare(4)]

So we no longer pass makeRect explicitly to makeSquare or makeSomeSquares.

No let’s define an inject keyword for binding given dependencies to actual implementations.

def main =
  inject makeRect = someLibrary.makeRect
  makeSomeSquares

Now we have two simple, complementary keywords — given and inject — for achieving inversion of control, without the burden of “manually” injecting dependencies into every function call all the way down the stack.

given and inject Macros

given and inject can be thought of as pre-processor directives or macros, that cause your program to be transformed before it is evaluated. For example in a Lisp-like language, we could define GIVEN and INJECT macros, and write our program like this:

    (GIVEN [makeRect] 
      def makeSomeSquares ..etc...)
    
    (def main
      (INJECT [makeRect someLibrary.makeRect]
        makeSomeSquares))

After the macro evaluation stage, we’d have a program where makeRect is explicitly passed as a parameter to makeSquare and makeSomeSquares, but we’d never have to touch that awkward code.

The given and inject syntax is just an alternative syntax for achieving the same thing.

Conclusion

I don’t need to explain the merits of inversion of control — it’s one of the pillars of good OO design. However, it’s often dismissed as non-applicable in a language with higher-order functions. But as demonstrated above, manually injecting dependencies by passing functions as arguments can be cumbersome.

given/inject preprocessing allows you to inject your dependencies to achieve inversion of control without syntactic overhead or complexity, while respecting the principles of pure FP.

Posted in Programming Language Design Tagged with: , , ,

Pure
May 31st, 2015 by john.warden@gmail.com

In my last post on Procedures in a Pure Language, I discussed how even a “purely functional” programming language such as Haskell actually allows you to create procedures that have side effects, and how that is the way things should be.

I also defined the following wish list for a programming language where:

  • I can write pure functions.
  • I can also write procedures.
  • I can treat procedures as values.
  • I clearly see in my code when I am defining a procedure and when I am defining a pure function.
  • I clearly see in my code when a procedure is executed.
  • I can write procedures without the syntactic overhead of the IO Monad required in Haskell.
  • I can contain procedures to only operate on specific objects, so that I can limit their effects to a sandbox.

Proposed Solution

Suppose we start with a simple untyped language with an -> operator for defining anonymous functions, and a ++ operator for concatenating strings.

    f = (name) -> "Hello, " ++ name
    f("Bill")

Since this program has no procedures, it doesn’t do anything other than produce the value “Hello, Bill” when evaluated.

Now let’s add procedures:

    main = (console) -> procedure
      console.println!("Hello, Bill")

I have defined a function, main, which takes an argument named console, and returns a procedure.

The body of a procedure is a sequence of imperatives. In this example there is a single imperative, console.println!("Hello, Bill"). A imperative is to an expression what a function is to a procedure: both return values, but imperatives don’t have to be pure functions.

console.println, like main, is a function that returns a procedure. The ! operator says that this procedure should actually be executed, on not just returned, at this point in the code. Otherwise, the result of evaluating main would be a procedure that, when executed, just returns another procedure.

Methods

console.println looks like what you’d call a method in OO. I’m not thinking we’re defining an OO language here, mind you. We could easily have written this as println console, but I like the . syntax here. Either way, println is a function that is somehow attached to the console value — or more specifically console is polymorphic: console itself supplies the definition of println. We don’t need to go into detail of exactly how this works (types? classes? typeclasses?). I’ll just say that functions like println that are attached to objects are called methods.

The “Apply and Execute” Operator

The ! binary operator could be thought of as “apply and execute”, because it applies a function to its arguments, and then execute the procedure that is returned.

You can also apply a function to it’s arguments without executing it:

    let greet = console.println("Hello, Bill")

The ! operator can also be used as a unary, postfix operator, which simply executes a procedure (instead of calling a function and executing the resulting procedure).

    greet!

Operations

Methods like println, that return procedures are called operations.

The ! binary operator is used to invoke an operation by applying arguments to a function and then executing the procedure.

Summary

main = (console) -> procedure
  let greet = console.println("Hello, Bill")
  greet!
  console.println!("Hello, Bill") // another way of doing the above.

So main is a pure function that returns a procedure. println is an operation — a method that returns a procedure. println, like all methods, is also a pure function, because simply applying it has no side effects.

greet is a procedure, the result of applying println to its arguments in the expression console.println("Hello, Bill").

greet!, because of the presence of the ! operator, is an imperative.

console.println!("Hello, Bill") is likewise an imperative.

Summary of Definitions

  • Function: takes arguments, never has effects.
  • Procedure: takes no arguments, has effects when executed.
  • Method: functions attached to objects (enabling polymorphism).
  • Operation: method that produces a procedure.
  • Expression: has no effects, consistent.
  • Imperative: may have effects or be inconsistent.

Conclusion

We have defined a language that, like Haskell, allows us to define pure functions, which themselves can produce procedures that can be executed. The body of any function containing imperatives that execute procedures must be a procedure, just as in Haskell any function that uses a bind operation or do on an IO something must itself return an IO something. But our language has first-class procedures instead of the IO monad, and the ! operator instead of do or any of the bind operators.

Also just as in Haskell, “evaluating” the program has no side-effects. It just produces a procedure which you can then execute.

Our language doesn’t treat procedures as Monadic value as does Haskell. After a program is evaluated there is no need for something that can be bound, fmaped over, or stuck in a do block, since all that you will ever do with this procedure is execute it.

Also by treating procedures differently from monadic values, it is even easier to see exactly when you are invoking procedures. This will be helpful to a programmer striving to minimize unnecessary use of impure code.

Posted in Programming Language Design Tagged with: , , , ,

Pure
May 28th, 2015 by john.warden@gmail.com

The fact that you can write procedures, which produce side-effects, in Haskell, which is supposed to be a pure language, can be confusing.

I think the key to clearing up the confusion is to understand that most Haskell programs are actually programs that produce programs — like preprocessors in CPP. Conal Elliott’s post The C language is purely functional explores this idea.

The Haskell language itself is pure. When evaluated, Haskell expressions have no side effects, and are referentially transparent.

But the value returned by main in many Haskell programs is an IO something — which is a procedure that can be executed and may have side effects.

If the GHC didn’t have the built-in ability to take an IO something and compile it into an executable procedure, then a Haskell program could evaluate expressions all day long, but the resulting values would just disappear into the ether because the program could never output the results to STDOUT, because that is a side-effect.

Haskell has advantages over impure languages not because it removes the ability to write impure procedures, but because it adds the ability to write pure functions, guaranteed to have no side effects and to be referentially transparent. The majority of many programs can and should be written with pure functions, which are easier to maintain, understand, and debug. Often you only need impure procedures for a small part of the program, perhaps only the parts that actually writes output to STDOUT or a database.

Unfortunately, when you need to write impure-procedures in Haskell, there is a great deal of syntactic overhead that I don’t think has to be necessary.

The benefits of Haskell’s IO to me are are:

  • I can write pure functions.
  • I can also write procedures.
  • I can treat procedures as values.
  • I clearly see in my code when I am defining a procedure and when I am defining a pure function.
  • I clearly see in my code when a procedure is executed.

I’d like to see a language with those benefits, but with additional benefits:

  • I can write procedures without the syntactic overhead of the IO Monad required in Haskell.
  • I can contain procedures to only operate on specific objects, so that I can limit their effects to a sandbox.

I think such a language is possible. Anyone want to help me create it it?

Posted in Programming Language Design Tagged with: , , ,

currypufffold003
May 25th, 2015 by john.warden@gmail.com

Implicit currying and folded application are language feature that render moot the distinction between curried and un-curried functions, allowing functions written in curried style to be called as un-curried functions and vice-versa.

Quick Background: Currying

Currying is the technique of translating a function that takes multiple arguments, or a tuple of arguments (an n-ary function) into a sequence of functions, each with a single argument. For example:

// binary function (uncurried form)
let product = (x,y) -> x*y

// curried form of same function
let product = x -> y -> x*y

The curried form of the function has to be invoked differently than the uncurried form:

(product 2) 4
// which, given left-associativity, is the same as
product 2 4

Instead of

product(2,4)

Partial Application vs Implicit Currying

Partial application is the ability to pass only the first argument to a function taking two or more arguments, and get back a function taking the remainder of the arguments. So you can write your function as an n-ary function, but call it as if it were curried.

Partial application is non-applicable in a language, like Haskell, where n-ary functions are not supported. Functions with multiple arguments have to be defined in curried form.

The advantages to languages like Haskell, where all functions take only a single argument, have been thoroughly explored elsewhere. But these advantages would not be lost in a language that allowed functions to be defined as if they were n-ary. It could be left to the compiler or interpreter to translate them automatically to curried form. So:

let product = (x,y) -> x*y

is syntactically equivalent to:

let product = x -> y -> x*y

This is different from partial application, in that n-ary functions still don’t exist. You still can’t pass multiple arguments to functions.

// won't work
product(x,y)

You have to pass arguments one at a time as you do in Haskell:

product x y

Folded Application

So now we have a language where all functions take a single argument, and multiple-argument functions are either explicitly or implicitly re-written in curried form. Now let’s also say that, as a syntactic convenience, we allowed curried function to be invoked as if they actually allowed multiple arguments. So:

 product(2,4)
 // is the same as
 (product 2) 4

I call this folded application, because it involves implicitly doing a fold (or reduce) operation on the invisible apply operator, applying the function to the first argument, then applying the result to the second argument, and so on. So:

product(2,4)
// is implicitly translated to
reduce apply [product,2,4]

Curried/Uncurried Equivalence

There are languages, like Haskell, where all functions take just one argument (which has advantages), and other languages where functions can take multiple arguments (which also has advantages).

Implicit currying + folded application together allows you to mix paradigms. If you want to think of your functions as always taking single arguments, as in Haskell, you can, regardless of how the function is defined. If you prefer to think of the same functions as taking a list of arguments, you can. The distinction between a curried functions and it’s equivalent n-ary function becomes moot. So if you define:

let product1 = x -> y -> x*y
let product2 = (x,y) -> x*y

Then all of the following are equivalent:

(product1 2) 4
(product2 2) 4
product1(2,4)
product2(2,4)

Difference from Partial Application

A language that allows n-ary functions and partial applications doesn’t quite give you curried/un-curried equivalence. It allows n-ary functions to be invoked as if they were curried. But it does not allow curried functions to be invoked as if they were n-ary. This asymmetry requires programmers to make a choice when defining their functions, and once the choice is made their options for using the function is limited. Curried/un-curried equivalence eliminates this dilemma.

Conclusion

Should languages mix paradigms? I could understand if your instinct is that you should choose a paradigm and stick with it. On the other hand, I think that there are times when both paradigms are useful. The n-ary function paradigm can be a more familiar and intuitive for some software developers, and for certain problems. Indeed I think it is more natural to think of a product as a function that takes two arguments, not as a function that returns a function. On the other hand, there is great power in a language where, at a fundamental mathematical level underneath the syntactic sugar, all functions ultimately take a single argument. A language that allows both paradigms offers the best of both worlds.

Posted in Programming Language Design Tagged with: , , , , , ,