Generic Syntax

generic syntax

Is it really necessary to invent a new syntax, complete with a new grammar and parser, for every new programming language? Is there a generic syntax that can conveniently express the semantics of many languages? Since all languages, whatever the syntax, are eventually parsed into an abstract syntax tree, couldn’t we just use a general language for expressing abstract syntax trees?

XML and LISP

XML was designed to be generic enough to express data and concepts from virtually any domain. So the AST of, say, a Java program could be represented in XML:

Java

public Integer sum(Integer x, Integer y) {
    return x+y
}

XML

<method name='sum' returnType='Integer' public='true'>
  <parameters>
    <parameter name='x' type='Integer'/>
    <parameter name='y' type='Integer'/>
  </parameters>
  <body>
    <return><plus arg1='x' arg2='y'/></return>
  </body>
</method>

Obviously, nobody would actually want to write their Java programs in XML — but bear with me. Once you have a data structure representing your Java program, your program is now data that can be manipulated before it is finally converted to bytecode, for example via an XSLT transformation. You can now extend Java semantics with macros — functions that act on your program itself. Macros are one of the Lisp programmer’s secret weapons, and part of the reason Lisp programmers are always pointing and laughing at the rest of us.

You could also represent a Java AST in JSON, but it would be even more unwieldy than XML:

{
  head: "method",
  name: "sum",
  returnType: "Integer",
  public: true,
  parameters: [
    {
      head: "parameter",
      name: "x",
      type: "Integer"
    },
    {
      head: "parameter",
      name: "y",
      type: "Integer"
    }
  ],
  body: [
    {
      head: "return",
      body: {
        head: "+",
        arguments: ["x","y"]
      }
    }
  ]
}


You could also represent your Java AST in Lisp, since Lisp is actually built on a generic syntax for expressing nested data structures, which is what an AST is. This is why there are so many domain-specific languages in the Lisp world. Your Java program, if represented as a Lisp expression, would look something like this:

'(method Integer sum :public true [(Integer arg1) (Integer arg2)] (
  (return (+ arg1 arg2))
))

Since this is quoted data, there are no pre-defined semantics. It can represent anything — in our case a Java program.

Improving on LISP

The above Lisp code is a lot more terse than the XML or JSON. But even so, but I imagine few people will want to write their Java programs in Lisp.

Cleaner, More Generic

But what if we could further simplify the syntax and take Lisp out of the equation? We could:

  • First, eliminate the quote operator, by dropping the assumption that every expression represents a functional program by default. This makes the syntax truly generic, more like XML.
  • Second, eliminate unnecessary parentheses. Use whitespace indent/outdent to indicate open/close parentheses.
  • Third, use infix notation for symbols like +

Our syntax now looks like this:

method Integer sum :public true [Integer x, Integer y]
return (x+y)

This is pretty terse and readable. In fact, I think it compares in brevity and clarity with the original Java:

public Integer sum(Integer x, Integer y) {
  return x+y
}

The Parsed Data Structure

Here’s how a generic syntax expression might be parsed into JSON:

  • Comma-separated lists are converted to arrays.
  • Other sequence are converted to maps containing:
    • a root (the first element)
    • any key-value pairs (public in our example)
    • a body (containing the rest of the list)

So this expression:

method Integer sum :public true [Integer x, Integer y]
  return (x+y)

Would be parsed into this JSON structure:

{
  "head": "method"
  "public": true
  "body": [
    "Integer",
    "sum"
    [
      {
        "head": "String"
        "body": ["x"]
      },
      {
        "head": "String"
        "body": ["y"]
      }
    ],
    [
      {
        "head": "return",
        "body": [
          {
            "head": "+",
            "body": [
              "x",
              "y"
            ]
          }

        ]
      }
    ]
  ]
}

Your compiler/interpreter could take this object and interpret it in a way that makes sense. For example, it would be quite straightforward to write an interpreter that applied some transformations or some macros and then converted this data structure into Java.

Other Language Examples

Before showing examples of using generic syntax to represent the ASTs of programs in other languages, I’ll add one more element to the syntax. Sandwiching a token between colons like :this: makes that token an infix operator. And I’ll say all infix operators are right-associative with equal precedence.

Notice how natural the below programs look using generic syntax.

Generic C

include "iostream"

function int main []
  cout << "Hello, World!"
  return 0

Generic Haskell

quicksort :: (Ord a) => (List a) -> (List a)
(quicksort List) = List
quicksort (p : xs) = 
  (quicksort lesser) ++ (List p) ++ (quicksort greater) 
  :where:
        lesser  = filter (< p) xs
        greater = filter (>= p) xs


Conclusion

And a bit more work needs to be done for this language to be complete and practical of course.

Many generic data interchange languages exist, such as XML, YAML, and JSON. But none have been defined to have the versatile terseness required for a programming language. Such a syntax could have interesting uses:

  • It could make it easy to create a new domain-specific languages or new programming language without defining a parser.
  • You could write code from your favorite language, say Java or PHP, using this new syntax, and then define macros that transform your program before it is finally converted to Java or PHP.

In another post, I hope to develop this generic syntax more thoroughly, show some examples of how you might replace your current syntax with generic syntax, so that you can use Macros to transform your code.

June 28th, 2013 by